SRE (Kubernetes)
OpenKyber LLC
2 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
EnglishJob location
Remote
Tech stack
Amazon Web Services (AWS)
Azure
Bash
Cloud Computing
Cloud Engineering
Computer Programming
DevOps
Distributed Systems
Python
Machine Learning
Openshift
Performance Tuning
Reliability Engineering
Site Reliability Engineering Practices
Prometheus
Scripting (Bash/Python/Go/Ruby)
Google Cloud Platform
System Availability
Grafana
Containerization
Kubernetes
Infrastructure Automation Frameworks
ArcSight Event Correlation
Splunk
Dynatrace
Devsecops
Docker
ServiceNow
Job description
- Lead SRE strategy, architecture, and reliability initiatives across large-scale distributed systems
- Design and implement AIOps-driven monitoring and incident management solutions
- Build proactive observability frameworks using Dynatrace and related monitoring platforms
- Drive automation, self-healing, root cause analysis, and performance optimization initiatives
- Collaborate with DevOps, Cloud, Platform Engineering, and Application teams
- Improve system availability, scalability, resiliency, and operational excellence
- Define SLOs, SLIs, SLAs, reliability metrics, and operational best practices
- Lead production incident management, problem management, and postmortem processes
- Mentor engineering teams on SRE practices and operational maturity
Requirements
Do you have experience in System performance monitoring?, Hiring: SRE Architect Lead AIOps & Dynatrace Location: Atlanta, GA (Local to GA Candidates only) Work Mode: Hybrid We are looking for a highly skilled SRE Architect Lead with strong experience in AIOps, Observability, and Enterprise Reliability Engineering to join a fast-paced enterprise environment., * Strong experience in Site Reliability Engineering (SRE) Architecture & Leadership
- Hands-on expertise with Dynatrace (Monitoring, APM, Observability, Dashboarding, Alerting)
- Experience with AIOps platforms, event correlation, anomaly detection, and automation
- Strong cloud experience with AWS / Azure / Google Cloud Platform
- Expertise in Kubernetes, Docker, OpenShift, or containerized environments
- Experience with CI/CD pipelines and Infrastructure Automation
- Scripting/Programming experience in Python, Bash, or Go
- Knowledge of Incident Management, RCA, Capacity Planning, and Reliability Engineering
- Experience supporting enterprise-scale production environments
Nice to Have:
- Experience with ServiceNow, Splunk, Grafana, Prometheus, ELK, or Moogsoft
- Exposure to ML-driven observability or predictive analytics
- DevSecOps and cloud-native architecture experience