Senior Site Reliability Engineer

Trust In Soda

Municipality of Madrid, Spain

5 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Job location

Municipality of Madrid, Spain

Tech stack

Amazon Web Services (AWS)

Cloud Computing

Python

Node.js

Reliability Engineering

Prometheus

Ruby

Software Engineering

Datadog

Data Logging

Pulumi

Grafana

Cloudformation

Kubernetes

Operational Systems

Cloudwatch

Terraform

Job description

You will operate at the intersection of software engineering, cloud infrastructure and reliability engineering. This role goes beyond execution and delivery. You will be expected to design, plan and lead initiatives , shaping how reliability, observability and incident management are implemented across the organisation.

You will partner closely with engineering teams, influence architectural decisions early, and help define how reliability is measured and improved as the platform scales.

Requirements

Led initiatives across multiple teams or domains rather than working solely within one squad
Designed and evolved systems with clear reasoning around trade offs, failure modes and long term impact
Strong communication skills and confidence presenting technical decisions in larger group settings
Experience in scale ups or mid sized tech environments where structure is still evolving and ownership is high

Technical background

You bring strong depth across:

Cloud infrastructure, ideally AWS, with solid networking and service level understanding
Containers and orchestration such as Kubernetes, ECS or similar
Infrastructure as Code using tools like Terraform, Pulumi or CloudFormation
Observability and monitoring including metrics, logging and alerting using tools such as Prometheus, Grafana, DataDog or CloudWatch
CI CD and automation practices with a focus on reliability and safety

You also have a strong software engineering background , with experience building and operating systems in languages such as Python, Node.js, Ruby or similar, not just scripting.

Reliability mindset

You are comfortable with:

Defining and using SLOs and SLIs to make reliability measurable
Using error budgets to guide engineering priorities
Leading or participating in incident response and post incident improvement
Improving production readiness, on call quality and reducing recurring failure patterns

Why this role stands out

High impact senior role with real ownership and influence
Opportunity to shape reliability practices in a growing engineering organisation
Strong engineering culture with an emphasis on autonomy and trust, If you are a senior engineer who enjoys designing systems, leading initiatives and improving reliability at scale, this role offers the scope and autonomy to make a real impact.