Senior Site Reliability Engineer - Automation Platform
DOCTOLIB SAS
Canton de Nantes-1, France
2 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
SeniorJob location
Canton de Nantes-1, France
Tech stack
Java
Amazon Web Services (AWS)
Azure
Elasticsearch
Python
Reliability Engineering
Logstash
Prometheus
Ruby
Datadog
Data Logging
Google Cloud Platform
Containerization
Kubernetes
Docker
Go
Programming Languages
Job description
As a Senior Site Reliability Engineer within the Core Reliability & Observability team, you will play a pivotal role in shaping the company's observability strategy and ensuring our platform remains reliable, debuggable, and scalable. This role sits at the intersection of infrastructure, developer experience, and product engineering, with a particular focus on building and evolving the foundations of logging, metrics, tracing, and alerting across the organization.
- Lead the observability strategy across the platform, with an emphasis on building scalable, developer-friendly logging and tracing capabilities.
- Identify and lead large-scale cross-cutting reliability initiatives, including improvements to our incident detection, response, and postmortem analysis capabilities.
- Take part in the on-call rotation, and actively contribute to improving our on-call experience by refining alerting, reducing noise, and ensuring actionable telemetry.
Requirements
- Have a solid hands-on experience (3y+) on a large-scale production platform
- Have proven experience with cloud platforms such as AWS, Azure or Google Cloud
- Have solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
- Have a strong understanding of Helm for managing Kubernetes manifests and ArgoCD for GitOps workflows
- Deep expertise in observability tooling and architecture, such as:
- Logging: Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Logstash, Vector
- Tracing: OpenTelemetry or proprietary APMs
- Metrics: Prometheus, Thanos, Datadog, or equivalent
- Have proficiency in at least one programming language (Ruby, Python, Go, Java, etc.) and a deep understanding of infrastructure as code principles
- Have an experience with monitoring and observability tools
- Like troubleshooting performance issues in complex environments
- Speak English
Benefits & conditions
- Free Health Insurance for you & your family
- Up to 14 days of RTT
- Parental care program (1 month off in addition to the legal parental leave and 0,5 days off per child when the school starts)
- Wellbeing program (free mental health and coaching offer with our partner moka.care)
- A flexible workplace policy offering both hybrid and office-based mode
- Flexibility days allowing to work in EU countries and the UK 10 days per year
- Lunch voucher with Swile card
- Work Council subsidy to refund part of sport club membership or creative class
- Bicycle subsidy
The interview process
- Recruiter interview
- Technical SRE interview
- System Design interview
- Behavioral interview
- Background / Reference check
- Offer!