DevOps Engineer

TechniPros, LLC

Atlanta, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Remote

Atlanta, United States of America

Tech stack

Agile Methodologies

Amazon Web Services (AWS)

Application Layers

Azure

Bash

Cloud Computing

Cloud Engineering

Configuration Management

Computer Networks

Continuous Integration

DevOps

Disaster Recovery

Distributed Systems

Python

Linux System Administration

Powershell

Reliability Engineering

Prometheus

Datadog

Data Logging

Google Cloud Platform

Cloud Platform System

Grafana

Reliability of Systems

Infrastructure as Code (IaC)

Kubernetes

Splunk

Docker

Job description

We are seeking an experienced Site Reliability Engineer (SRE) / DevOps Engineer with a strong background in Incident Management, Change Control, Error Budgeting, Remediation, and Production Operations. The ideal candidate will be responsible for ensuring the reliability, scalability, performance, and operational excellence of cloud-native platforms and distributed systems. This role requires deep expertise in cloud infrastructure, automation, observability, incident response, and operational governance., * Manage and improve platform reliability, availability, and performance across production environments.

Lead and participate in incident management, root cause analysis, remediation planning, and post-incident reviews.
Drive change control processes and ensure operational governance standards are followed.
Monitor and manage error budgets while implementing reliability improvements.
Design, build, and maintain scalable cloud infrastructure and automation frameworks.
Deploy and manage containerized applications using Kubernetes and Docker.
Develop and maintain CI/CD pipelines to support efficient software delivery.
Implement Infrastructure as Code (IaC) solutions for automated provisioning and configuration management.
Establish observability strategies using monitoring, logging, and alerting platforms.
Collaborate with development, infrastructure, security, and business teams to ensure platform stability.
Troubleshoot complex production issues across cloud, networking, infrastructure, and application layers.
Continuously improve operational processes, automation, and system resilience.

Requirements

7+ years of experience in Site Reliability Engineering (SRE), DevOps, Cloud Infrastructure, or Production Operations.
Strong experience managing workloads in cloud environments:
Microsoft Azure
Amazon Web Services (AWS)
Google Cloud Platform (Google Cloud Platform)
Hands-on experience with:
Kubernetes
Docker
CI/CD Pipelines
Infrastructure as Code (IaC)
Strong scripting and automation expertise using:
Python
Bash
PowerShell
Go (Golang)
Experience with observability and monitoring platforms:
Datadog
Grafana
Prometheus
Splunk
Strong understanding of:
Networking concepts
Linux Administration
Windows Administration
Distributed Systems
Cloud-Native Architectures
Experience with:
Incident Response
Production Troubleshooting
Operational Governance

Preferred Qualifications:

Experience implementing reliability engineering best practices and SRE methodologies.
Experience supporting large-scale enterprise production environments.
Familiarity with high-availability and disaster recovery architectures.
Experience automating operational workflows and infrastructure management.
Knowledge of security best practices within cloud environments.
Experience working in Agile and DevOps-driven organizations.

Mandatory Skills: Site Reliability Engineering (SRE), Incident Management, Change Control, Error Budgeting, Production Remediation, Microsoft Azure, AWS, Google Cloud Platform, Kubernetes, Docker, CI/CD Pipelines, Infrastructure as Code (IaC), Python, Bash, PowerShell, Go (Golang), Datadog, Grafana, Prometheus, Splunk, Linux Administration, Windows Administration, Networking, Distributed Systems, Cloud-Native Architectures, Production Troubleshooting, Operational Governance

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all