DevOps Engineer

TechniPros, LLC
Atlanta, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Remote
Atlanta, United States of America

Tech stack

Agile Methodologies
Amazon Web Services (AWS)
Application Layers
Azure
Bash
Cloud Computing
Cloud Engineering
Configuration Management
Computer Networks
Continuous Integration
DevOps
Disaster Recovery
Distributed Systems
Python
Linux System Administration
Powershell
Reliability Engineering
Prometheus
Datadog
Data Logging
Google Cloud Platform
Cloud Platform System
Grafana
Reliability of Systems
Infrastructure as Code (IaC)
Kubernetes
Splunk
Docker
Go

Job description

We are seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with a strong background in Incident Management, Change Control, Error Budgeting, Remediation, and Production Operations. The ideal candidate will be responsible for ensuring the reliability, scalability, performance, and operational excellence of cloud-native platforms and distributed systems. This role requires deep expertise in cloud infrastructure, automation, observability, incident response, and operational governance., * Manage and improve platform reliability, availability, and performance across production environments.

  • Lead and participate in incident management, root cause analysis, remediation planning, and post-incident reviews.
  • Drive change control processes and ensure operational governance standards are followed.
  • Monitor and manage error budgets while implementing reliability improvements.
  • Design, build, and maintain scalable cloud infrastructure and automation frameworks.
  • Deploy and manage containerized applications using Kubernetes and Docker.
  • Develop and maintain CI/CD pipelines to support efficient software delivery.
  • Implement Infrastructure as Code (IaC) solutions for automated provisioning and configuration management.
  • Establish observability strategies using monitoring, logging, and alerting platforms.
  • Collaborate with development, infrastructure, security, and business teams to ensure platform stability.
  • Troubleshoot complex production issues across cloud, networking, infrastructure, and application layers.
  • Continuously improve operational processes, automation, and system resilience.

Requirements

  • 7+ years of experience in Site Reliability Engineering (SRE), DevOps, Cloud Infrastructure, or Production Operations.
  • Strong experience managing workloads in cloud environments:
  • Microsoft Azure
  • Amazon Web Services (AWS)
  • Google Cloud Platform (Google Cloud Platform)
  • Hands-on experience with:
  • Kubernetes
  • Docker
  • CI/CD Pipelines
  • Infrastructure as Code (IaC)
  • Strong scripting and automation expertise using:
  • Python
  • Bash
  • PowerShell
  • Go (Golang)
  • Experience with observability and monitoring platforms:
  • Datadog
  • Grafana
  • Prometheus
  • Splunk
  • Strong understanding of:
  • Networking concepts
  • Linux Administration
  • Windows Administration
  • Distributed Systems
  • Cloud-Native Architectures
  • Experience with:
  • Incident Response
  • Production Troubleshooting
  • Operational Governance

Preferred Qualifications:

  • Experience implementing reliability engineering best practices and SRE methodologies.
  • Experience supporting large-scale enterprise production environments.
  • Familiarity with high-availability and disaster recovery architectures.
  • Experience automating operational workflows and infrastructure management.
  • Knowledge of security best practices within cloud environments.
  • Experience working in Agile and DevOps-driven organizations.

Mandatory Skills: Site Reliability Engineering (SRE), Incident Management, Change Control, Error Budgeting, Production Remediation, Microsoft Azure, AWS, Google Cloud Platform, Kubernetes, Docker, CI/CD Pipelines, Infrastructure as Code (IaC), Python, Bash, PowerShell, Go (Golang), Datadog, Grafana, Prometheus, Splunk, Linux Administration, Windows Administration, Networking, Distributed Systems, Cloud-Native Architectures, Production Troubleshooting, Operational Governance

Apply for this position