Site Reliability Engineer

KEY BUSINESS SOLUTIONS
Alpharetta, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate

Job location

Alpharetta, United States of America

Tech stack

Java
Artificial Intelligence
Amazon Web Services (AWS)
Systems Engineering
Azure
Unix
Computer Clusters
Computer Programming
Data Governance
Disaster Recovery
Distributed Data Store
Distributed Systems
DNS
Fault Tolerance
Infrastructure as a Service (IaaS)
Python
Linux System Administration
Nagios
Routing
Performance Tuning
Reliability Engineering
Ansible
Prometheus
TCP/IP
Datadog
Data Logging
Scripting (Bash/Python/Go/Ruby)
Load Balancing
Data Storage Technologies
Data Ingestion
System Availability
Grafana
Generative AI
Cloudformation
Build Management
Containerization
Kubernetes
Terraform
Docker
Go

Job description

Skill Set - Expertise in UNIX + LINUX Administration + AWS/ AZURE Cloud monitoring + Terraform/ Ansible + Promethe Grafana observability experience)., Experience: 6+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineering knowledge.

  • Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting
  • Optimize cost vs. performance tradeoffs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
  • Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms
  • Maintain runbooks, operational playbooks, documentation, and training materials
  • Participate in on-call rotations and respond to production incidents 24/7 as needed
  • Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability

Requirements

  • Production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
  • Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
  • Excellent communication, documentation, and cross-team collaboration skills
  • Proven track record of reducing operational toil via automation, Skills: Digital : Python~Digital : Docker~Digital : Kubernetes~Digital : Site Reliability Engineering (SRE)

Experience Required: 6-8

Skills: Category Name Required Importance Experience, SkillCategoryTest1_MN Digital : Site Reliability Engineering (SRE) Yes 1 4-7 years

Apply for this position