Senior Site Reliability Engineer - Automation Platform

DOCTOLIB SAS

Canton de Nantes-1, France

4 months ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Canton de Nantes-1, France

Tech stack

Java

Amazon Web Services (AWS)

Azure

Elasticsearch

Python

Reliability Engineering

Logstash

Prometheus

Ruby

Datadog

Data Logging

Google Cloud Platform

Containerization

Kubernetes

Docker

Programming Languages

Job description

As a Senior Site Reliability Engineer within the Core Reliability & Observability team, you will play a pivotal role in shaping the company's observability strategy and ensuring our platform remains reliable, debuggable, and scalable. This role sits at the intersection of infrastructure, developer experience, and product engineering, with a particular focus on building and evolving the foundations of logging, metrics, tracing, and alerting across the organization.

Lead the observability strategy across the platform, with an emphasis on building scalable, developer-friendly logging and tracing capabilities.
Identify and lead large-scale cross-cutting reliability initiatives, including improvements to our incident detection, response, and postmortem analysis capabilities.
Take part in the on-call rotation, and actively contribute to improving our on-call experience by refining alerting, reducing noise, and ensuring actionable telemetry.

Requirements

Have a solid hands-on experience (3y+) on a large-scale production platform
Have proven experience with cloud platforms such as AWS, Azure or Google Cloud
Have solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Have a strong understanding of Helm for managing Kubernetes manifests and ArgoCD for GitOps workflows
Deep expertise in observability tooling and architecture, such as:

Logging: Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Logstash, Vector
Tracing: OpenTelemetry or proprietary APMs
Metrics: Prometheus, Thanos, Datadog, or equivalent

Have proficiency in at least one programming language (Ruby, Python, Go, Java, etc.) and a deep understanding of infrastructure as code principles
Have an experience with monitoring and observability tools
Like troubleshooting performance issues in complex environments
Speak English

Benefits & conditions

Free Health Insurance for you & your family
Up to 14 days of RTT
Parental care program (1 month off in addition to the legal parental leave and 0,5 days off per child when the school starts)
Wellbeing program (free mental health and coaching offer with our partner moka.care)
A flexible workplace policy offering both hybrid and office-based mode
Flexibility days allowing to work in EU countries and the UK 10 days per year
Lunch voucher with Swile card
Work Council subsidy to refund part of sport club membership or creative class
Bicycle subsidy

The interview process

Recruiter interview
Technical SRE interview
System Design interview
Behavioral interview
Background / Reference check
Offer!

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all