Principal Site Reliability Engineer (AI-first SRE)

Sólo para miembros registrados
Municipality of Madrid, Spain
15 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Municipality of Madrid, Spain

Tech stack

Testing (Software)
Amazon Web Services (AWS)
Systems Engineering
Python
Machine Learning
Reliability Engineering
Prometheus
Software Engineering
Istio
Grafana
Kubernetes
Real Time Data
Terraform

Job description

  • Architect and maintain self-healing systems with 99.9%+ availability targets.
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
  • Build AIOps-based observability and auto-remediation pipelines.
  • Apply predictive modeling to forecast failures before they impact users.
  • Lead chaos, performance, and resilience testing programs.
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance.
  • Mentor engineers and drive reliability standards across teams.
  • Partner with platform, data, and product teams to ensure stability aligns with business goals.
  • Support major incident response, incident review, and participate in on-call rotations.

Conocimientos

Software/systems engineering Site Reliability Engineering (SRE) GCP AWS Kubernetes Terraform Python Go Observability stacks AIOps

Requirements

A global e-commerce platform seeks a Principal Site Reliability Engineer in Valencia to drive AI-driven reliability and design self-healing systems. The role requires leadership in incident response, mentoring engineers, and enhancing service resilience with a focus on predictive modeling. Ideal candidates will have 10+ years in software engineering, expertise in GCP, and proficiency in Python or Go. This position offers a chance to significantly impact business performance while fostering a culture of innovation., * 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.

  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
  • Proficiency in Python or Go for automation and tooling.
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
  • Strong communication and influencing skills - data over hierarchy., 10+ years in software/systems engineering 5+ years in SRE or platform reliability

Apply for this position