Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
- Own the AWS Cloud Infrastructure: Design, implement, and manage highly reliable, scalable, and cost-efficient services utilizing core AWS tools (e.g., EC2, ECS/EKS, Lambda, RDS, S3, CloudWatch).
- Drive Operational Excellence: Implement and maintain robust CI/CD pipelines, automating infrastructure deployment and configuration.
- Enhance System Observability: Establish comprehensive monitoring, logging, and alerting strategies to proactively identify and resolve performance and reliability issues.
- SRE/DevOps Collaboration: Work closely with the Software Development teams to define and enforce Service Level Objectives and improve the entire service lifecycle, from design through deployment.
- Platform Leadership: Provide technical vision and mentorship in discussions around software architecture, infrastructure scaling, and the roadmap for future platform development.
- Incident Response: Lead and participate in incident response, root cause analysis (RCA), and continuous improvement processes to minimize downtime and prevent recurrence.
Requirements
Do you have experience in Software development?, We are looking for a Senior Site Reliability Engineer who is passionate about technology and always looking for new ways to tackle complex issues. As our SRE, you would be in charge of our production infrastructure, focusing on reliability, performance, observability, and cost-efficiency.
You'll be a key player in ensuring our state-of-the-art AI solutions are delivered with five-nines reliability, driving a culture of automation and infrastructure-as-code. This role is envisioned as a future leader for a Platform Engineering team, and we expect you to contribute to strategic, long-term technical direction.
The usual day-to-day tasks include designing and deploying cloud architecture, automating deployments (CI/CD), enhancing system monitoring and alerting, and troubleshooting complex production issues.
The role will be completely in English, and CVs/resumes in other languages will not be considered., A strong candidate will ideally possess deep expertise in the following areas:
- Senior-Level AWS Proficiency: Extensive, hands-on experience designing, deploying, and managing complex, production-grade workloads on AWS.
- Expertise in Python: Knowledge of Python for scripting, automation, and building SRE tools and services.
- Containerization and Orchestration: Deep understanding and hands-on experience with Docker in a production environment.
- CI/CD Pipeline Design: Experience designing, maintaining, and troubleshooting automated delivery pipelines (e.g., GitHub Actions, GitLab CI, Jenkins, AWS CodePipeline).
- Monitoring & Observability: Strong experience with monitoring stacks and centralized logging (e.g. OpenSearch).
- Networking and Security: Solid understanding of cloud networking (VPC, security groups, load balancers) and security best practices.
- Troubleshooting: Expert ability to diagnose and resolve complex issues across distributed systems.