Principal Site Reliability Engineer (AI-first SRE)
Role details
Job location
Tech stack
Job description
- Architect and maintain self-healing systems with 99.9%+ availability targets.
- Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
- Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
- Build AIOps-based observability and auto-remediation pipelines.
- Apply predictive modeling to forecast failures before they impact users.
- Lead chaos, performance, and resilience testing programs.
- Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance.
- Mentor engineers and drive reliability standards across teams.
- Partner with platform, data, and product teams to ensure stability aligns with business goals.
- Support major incident response, incident review, and participate in on-call rotations.
Conocimientos
Software/systems engineering Site Reliability Engineering (SRE) GCP AWS Kubernetes Terraform Python Go Observability stacks AIOps
Requirements
A global e-commerce platform seeks a Principal Site Reliability Engineer in Valencia to drive AI-driven reliability and design self-healing systems. The role requires leadership in incident response, mentoring engineers, and enhancing service resilience with a focus on predictive modeling. Ideal candidates will have 10+ years in software engineering, expertise in GCP, and proficiency in Python or Go. This position offers a chance to significantly impact business performance while fostering a culture of innovation., * 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.
- Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
- Proficiency in Python or Go for automation and tooling.
- Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
- Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
- Strong communication and influencing skills - data over hierarchy., 10+ years in software/systems engineering 5+ years in SRE or platform reliability