Principal Site Reliability Engineer (AI-first SRE)

Sólo para miembros registrados

Municipality of Madrid, Spain

15 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Municipality of Madrid, Spain

Tech stack

Testing (Software)

Amazon Web Services (AWS)

Systems Engineering

Python

Machine Learning

Reliability Engineering

Prometheus

Software Engineering

Istio

Grafana

Kubernetes

Real Time Data

Terraform

Job description

Architect and maintain self-healing systems with 99.9%+ availability targets.
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
Build AIOps-based observability and auto-remediation pipelines.
Apply predictive modeling to forecast failures before they impact users.
Lead chaos, performance, and resilience testing programs.
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance.
Mentor engineers and drive reliability standards across teams.
Partner with platform, data, and product teams to ensure stability aligns with business goals.
Support major incident response, incident review, and participate in on-call rotations.

Conocimientos

Software/systems engineering Site Reliability Engineering (SRE) GCP AWS Kubernetes Terraform Python Go Observability stacks AIOps

Requirements

A global e-commerce platform seeks a Principal Site Reliability Engineer in Valencia to drive AI-driven reliability and design self-healing systems. The role requires leadership in incident response, mentoring engineers, and enhancing service resilience with a focus on predictive modeling. Ideal candidates will have 10+ years in software engineering, expertise in GCP, and proficiency in Python or Go. This position offers a chance to significantly impact business performance while fostering a culture of innovation., * 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.

Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
Proficiency in Python or Go for automation and tooling.
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
Strong communication and influencing skills - data over hierarchy., 10+ years in software/systems engineering 5+ years in SRE or platform reliability