Senior Site Reliability Engineer

Trust In Soda
Municipality of Madrid, Spain
5 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English

Job location

Municipality of Madrid, Spain

Tech stack

Amazon Web Services (AWS)
Cloud Computing
Python
Node.js
Reliability Engineering
Prometheus
Ruby
Software Engineering
Datadog
Data Logging
Pulumi
Grafana
Cloudformation
Kubernetes
Operational Systems
Cloudwatch
Terraform

Job description

You will operate at the intersection of software engineering, cloud infrastructure and reliability engineering. This role goes beyond execution and delivery. You will be expected to design, plan and lead initiatives , shaping how reliability, observability and incident management are implemented across the organisation.

You will partner closely with engineering teams, influence architectural decisions early, and help define how reliability is measured and improved as the platform scales.

Requirements

  • Led initiatives across multiple teams or domains rather than working solely within one squad

  • Designed and evolved systems with clear reasoning around trade offs, failure modes and long term impact

  • Strong communication skills and confidence presenting technical decisions in larger group settings

  • Experience in scale ups or mid sized tech environments where structure is still evolving and ownership is high

Technical background

You bring strong depth across:

  • Cloud infrastructure, ideally AWS, with solid networking and service level understanding

  • Containers and orchestration such as Kubernetes, ECS or similar

  • Infrastructure as Code using tools like Terraform, Pulumi or CloudFormation

  • Observability and monitoring including metrics, logging and alerting using tools such as Prometheus, Grafana, DataDog or CloudWatch

  • CI CD and automation practices with a focus on reliability and safety

You also have a strong software engineering background , with experience building and operating systems in languages such as Python, Node.js, Ruby or similar, not just scripting.

Reliability mindset

You are comfortable with:

  • Defining and using SLOs and SLIs to make reliability measurable

  • Using error budgets to guide engineering priorities

  • Leading or participating in incident response and post incident improvement

  • Improving production readiness, on call quality and reducing recurring failure patterns

Why this role stands out

  • High impact senior role with real ownership and influence

  • Opportunity to shape reliability practices in a growing engineering organisation

  • Strong engineering culture with an emphasis on autonomy and trust, If you are a senior engineer who enjoys designing systems, leading initiatives and improving reliability at scale, this role offers the scope and autonomy to make a real impact.

Benefits & conditions

  • Competitive salary, equity and a flexible hybrid working model

Apply for this position