SRE (Kubernetes)

OpenKyber LLC

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Remote

Tech stack

Amazon Web Services (AWS)

Azure

Bash

Cloud Computing

Cloud Engineering

Computer Programming

DevOps

Distributed Systems

Python

Machine Learning

Openshift

Performance Tuning

Reliability Engineering

Site Reliability Engineering Practices

Prometheus

Scripting (Bash/Python/Go/Ruby)

Google Cloud Platform

System Availability

Grafana

Containerization

Kubernetes

Infrastructure Automation Frameworks

ArcSight Event Correlation

Splunk

Dynatrace

Devsecops

Docker

ServiceNow

Job description

Lead SRE strategy, architecture, and reliability initiatives across large-scale distributed systems
Design and implement AIOps-driven monitoring and incident management solutions
Build proactive observability frameworks using Dynatrace and related monitoring platforms
Drive automation, self-healing, root cause analysis, and performance optimization initiatives
Collaborate with DevOps, Cloud, Platform Engineering, and Application teams
Improve system availability, scalability, resiliency, and operational excellence
Define SLOs, SLIs, SLAs, reliability metrics, and operational best practices
Lead production incident management, problem management, and postmortem processes
Mentor engineering teams on SRE practices and operational maturity

Requirements

Do you have experience in System performance monitoring?, Hiring: SRE Architect Lead AIOps & Dynatrace Location: Atlanta, GA (Local to GA Candidates only) Work Mode: Hybrid We are looking for a highly skilled SRE Architect Lead with strong experience in AIOps, Observability, and Enterprise Reliability Engineering to join a fast-paced enterprise environment., * Strong experience in Site Reliability Engineering (SRE) Architecture & Leadership

Hands-on expertise with Dynatrace (Monitoring, APM, Observability, Dashboarding, Alerting)
Experience with AIOps platforms, event correlation, anomaly detection, and automation
Strong cloud experience with AWS / Azure / Google Cloud Platform
Expertise in Kubernetes, Docker, OpenShift, or containerized environments
Experience with CI/CD pipelines and Infrastructure Automation
Scripting/Programming experience in Python, Bash, or Go
Knowledge of Incident Management, RCA, Capacity Planning, and Reliability Engineering
Experience supporting enterprise-scale production environments

Nice to Have:

Experience with ServiceNow, Splunk, Grafana, Prometheus, ELK, or Moogsoft
Exposure to ML-driven observability or predictive analytics
DevSecOps and cloud-native architecture experience

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all