Site Reliability Engineer
Role details
Job location
Tech stack
Job description
-
Partner with application teams to instrument services using OpenTelemetry and Dynatrace
-
Improve reliability, availability, and performance of critical production applications
-
Analyze system behavior and implement proactive monitoring and alerting
-
Contribute to release engineering processes and CI/CD pipeline improvements
-
Design and implement automation solutions to reduce manual operations
-
Support hybrid environments (legacy + modern/cloud-based systems)
-
Drive adoption of SRE best practices (SLIs/SLOs, error budgets, observability)
-
Collaborate with engineering, platform, and operations teams to enhance delivery tooling
-
Act as a technical influencer to improve engineering practices across teams
Requirements
-
Strong software engineering background
-
Ability to read, debug, and modify application code (Java, .NET, Python, or similar)
-
Experience with observability and instrumentation tools
-
OpenTelemetry, Dynatrace (or equivalent)
-
Solid understanding of Site Reliability Engineering principles
-
Hands-on experience with:
-
Monitoring & alerting
-
Automation (scripting, tooling)
-
Incident response & root cause analysis
-
Familiarity with DevOps practices and CI/CD pipelines
-
Experience working in enterprise-scale or complex environments
-
Strong communication and collaboration skills
-
Ability to work with diverse application teams and influence change
Preferred Qualifications (Nice to Have)
-
Experience in financial services or regulated environments
-
Exposure to large-scale platform rollouts or "factory model" transformations
-
Experience with release engineering practices
-
Familiarity with hybrid cloud environments (Azure preferred)
-
Knowledge of modern DevOps tools and delivery platforms
-
Experience modernizing legacy applications