Site Reliability Engineer
Computer Enterprises Inc
Celebration, United States of America
7 days ago
Role details
Contract type
Temporary to permanent Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
Senior Compensation
$ 146KJob location
Celebration, United States of America
Tech stack
Java
Microsoft Windows
Amazon Web Services (AWS)
Build Automation
Azure
Cloud Computing
Computer Security
Information Systems
Continuous Integration
Linux
DevOps
Distributed Systems
Monitoring of Systems
Python
Powershell
Reliability Engineering
Ansible
Ruby
Shell Script
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
System Availability
Delivery Pipeline
Gitlab
Procedural Programming
Kubernetes
Information Technology
Terraform
Go
Job description
This role supports Technology Operations by focusing on build automation, monitoring, and operational excellence across lower and production environments. The position partners closely with application, infrastructure, and operations teams to improve reliability, incident recovery, and system availability. Key Responsibilities
- Drive a DevOps culture across engineering and application teams.
- Design, build, and support platforms, development pipelines, and automated infrastructure.
- Develop and maintain automation and monitoring solutions for lower and production environments.
- Perform systems administration across Windows, Linux, and Kubernetes environments.
- Design and implement robust monitoring and telemetry solutions for Windows, Linux, and containerized systems.
- Support incident and problem management, including root cause analysis, long-term fixes, and interim mitigations.
- Coordinate and lead retrospectives following major incidents to improve reliability and recovery times.
- Collaborate cross-functionally to ensure timely resolution of system issues and release alignment.
- Support deployment and coordination of software builds across multiple lower environments.
- Provide operational support for production systems supporting business-critical applications.
- Engage application owners in deep technical discussions and promote adoption of monitoring and automation tools.
- Work with vendors to design mitigations and monitoring when code fixes are delayed.
- Implement automation and documentation to accelerate recovery and reduce time to restore service.
Requirements
- Strong systems knowledge in Linux, Windows, and Kubernetes.
- Hands-on experience with container orchestration, troubleshooting, and platform administration.
- Solid scripting skills in Python, PowerShell, and shell scripting.
- Cloud experience with AWS, Azure, or Google Cloud.
- Experience with infrastructure as code and automation tools, including Terraform and Ansible.
- Hands-on experience with CI/CD tools such as GitLab, Azure DevOps, and AWX.
- Applied understanding of observability principles and monitoring tools.
- Experience with procedural programming languages such as Python, Java, Go, Ruby, or PowerShell.
- Strong troubleshooting skills across systems, networking, and distributed environments.
- Ability to identify root causes in large-scale, complex systems.
Required Education
- Bachelor's degree in Computer Science, Information Systems, Software, Electrical or Electronics Engineering, or comparable field of study, and/or equivalent work experience.
Preferred Skills
- Experience engaging application teams and enforcing problem management practices.
- Collaboration with Security Operations teams to deliver secure solutions.
- Experience leading technical initiatives and supporting smooth project delivery.
- Familiarity with incident management, problem management, and on-call rotations.
- Interest in emerging technologies and continuous improvement of operational practices., This role offers hands-on ownership of automation, monitoring, and reliability for enterprise-scale systems. You will play a key role in improving operational resilience while partnering with diverse technical teams.