Site Reliability Engineer - MX based
Role details
Job location
Tech stack
Job description
We are looking for a Senior Site Reliability Engineer (SRE) to help build, maintain, and scale highly reliable cloud infrastructure and enterprise applications. This role is focused on ensuring platform stability, performance, scalability, automation, and operational excellence across AWS environments.
The ideal candidate combines strong software engineering fundamentals with deep operational and infrastructure expertise, and thrives in high-scale production environments.
Responsibilities
- Design, implement, and maintain highly available and scalable infrastructure on AWS
- Improve platform reliability, observability, and operational efficiency
- Automate infrastructure provisioning and management using Terraform
- Manage and support containerized environments using EKS or ECS
- Build and enhance CI/CD pipelines and deployment automation processes
- Monitor production systems and proactively identify reliability and performance issues
- Lead incident response, troubleshooting, root cause analysis, and postmortem processes
- Design and manage escalation response plans across monitoring, response, remediation, and retrospective activities
- Collaborate with software engineering teams to improve system resilience and scalability
- Optimize application performance for high-concurrency workloads and caching strategies
- Drive reliability engineering best practices, automation, and continuous improvement initiatives
- Participate in architecture reviews and operational readiness processes
Requirements
-
Strong experience as an SRE, Cloud Engineer, DevOps Engineer, or Software Engineer supporting production infrastructure
-
Hands-on experience with AWS in large-scale production environments
-
Experience with infrastructure-as-code technologies, preferably Terraform
-
Experience with containerization and orchestration platforms, preferably EKS or ECS
-
Strong troubleshooting experience across:Web server platforms Application platforms Operating systems Networking components Virtualization technologies Storage systems Database platforms
-
Experience working with CI/CD and continuous deployment environments
-
Experience supporting high-concurrency systems and caching strategies
-
Strong incident management, root cause analysis, and systems engineering skills
-
Ability to design and manage operational escalation processes in proactive and collaborative environments
-
Demonstrated experience managing highly scaled cloud infrastructure
-
Strong communication and problem-solving skills
-
Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
Nice to Have
- Experience with observability and monitoring platforms
- Kubernetes ecosystem knowledge
- Experience with distributed systems and microservices architectures
- Familiarity with SLOs, SLIs, and error budgets
- Experience with performance tuning and capacity planning
- Exposure to DevSecOps and cloud security best practices