Site Reliability Engineer

Computer Enterprises Inc

Celebration, United States of America

7 days ago

Role details

Contract type

Temporary to permanent

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 146K

Job location

Celebration, United States of America

Tech stack

Java

Microsoft Windows

Amazon Web Services (AWS)

Build Automation

Azure

Cloud Computing

Computer Security

Information Systems

Continuous Integration

Linux

DevOps

Distributed Systems

Monitoring of Systems

Python

Powershell

Reliability Engineering

Ansible

Ruby

Shell Script

Scripting (Bash/Python/Go/Ruby)

Google Cloud Platform

System Availability

Delivery Pipeline

Gitlab

Procedural Programming

Kubernetes

Information Technology

Terraform

Job description

This role supports Technology Operations by focusing on build automation, monitoring, and operational excellence across lower and production environments. The position partners closely with application, infrastructure, and operations teams to improve reliability, incident recovery, and system availability. Key Responsibilities

Drive a DevOps culture across engineering and application teams.
Design, build, and support platforms, development pipelines, and automated infrastructure.
Develop and maintain automation and monitoring solutions for lower and production environments.
Perform systems administration across Windows, Linux, and Kubernetes environments.
Design and implement robust monitoring and telemetry solutions for Windows, Linux, and containerized systems.
Support incident and problem management, including root cause analysis, long-term fixes, and interim mitigations.
Coordinate and lead retrospectives following major incidents to improve reliability and recovery times.
Collaborate cross-functionally to ensure timely resolution of system issues and release alignment.
Support deployment and coordination of software builds across multiple lower environments.
Provide operational support for production systems supporting business-critical applications.
Engage application owners in deep technical discussions and promote adoption of monitoring and automation tools.
Work with vendors to design mitigations and monitoring when code fixes are delayed.
Implement automation and documentation to accelerate recovery and reduce time to restore service.

Requirements

Strong systems knowledge in Linux, Windows, and Kubernetes.
Hands-on experience with container orchestration, troubleshooting, and platform administration.
Solid scripting skills in Python, PowerShell, and shell scripting.
Cloud experience with AWS, Azure, or Google Cloud.
Experience with infrastructure as code and automation tools, including Terraform and Ansible.
Hands-on experience with CI/CD tools such as GitLab, Azure DevOps, and AWX.
Applied understanding of observability principles and monitoring tools.
Experience with procedural programming languages such as Python, Java, Go, Ruby, or PowerShell.
Strong troubleshooting skills across systems, networking, and distributed environments.
Ability to identify root causes in large-scale, complex systems.

Required Education

Bachelor's degree in Computer Science, Information Systems, Software, Electrical or Electronics Engineering, or comparable field of study, and/or equivalent work experience.

Preferred Skills

Experience engaging application teams and enforcing problem management practices.
Collaboration with Security Operations teams to deliver secure solutions.
Experience leading technical initiatives and supporting smooth project delivery.
Familiarity with incident management, problem management, and on-call rotations.
Interest in emerging technologies and continuous improvement of operational practices., This role offers hands-on ownership of automation, monitoring, and reliability for enterprise-scale systems. You will play a key role in improving operational resilience while partnering with diverse technical teams.