Site Reliability Engineer

deepset GmbH

Berlin, Germany

5 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English, German

Experience level

Intermediate

Job location

Berlin, Germany

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Software as a Service

Continuous Integration

Software Debugging

Github

Machine Learning

Reliability Engineering

Site Reliability Engineering Practices

Prometheus

Service-Oriented Architecture

Private Cloud Environment

Datadog

Large Language Models

Kubernetes

Terraform

Job description

Build and operate real-world infrastructureDesign, configure, and evolve infrastructure that runs both in our cloud and inside customer environments (SaaS, private cloud, on-prem).

Make self-hosted production-readyHelp us deliver a production-grade, self-hosted platform that can be deployed on any Kubernetes setup in weeks - not months.
Drive automation & platform maturityImprove CI/CD pipelines, GitHub workflows, and GitOps setups so teams can ship faster with confidence.
Reduce complexity and costContinuously simplify systems and optimize infrastructure spend without compromising performance or reliability.
Shape how we buildChampion best practices in reliability, scalability, and security across the organization, not as rules, but as working systems.

Requirements

2-5 years of experience working with large-scale production infrastructure
Fluent German language skills
Experience with distributed or service-oriented architectures
Hands-on expertise with:

AWS
Kubernetes
CI/CD and GitOps (e.g. ArgoCD)

Working knowledge of Infrastructure as Code (Terraform preferred)
Solid troubleshooting skills - you can debug across systems, not just within one layer
A pragmatic mindset: you balance speed, simplicity, and reliability
Ownership and accountability - you take responsibility for systems end-to-end
Ability to work independently while staying aligned with the team's goals, * Familiarity with observability stacks (e.g. Datadog, Prometheus)
Experience optimizing cloud costs at scale
Interest or experience in Machine Learning / LLM systems
Experience improving developer experience and platform tooling using AI agents
Contributions to SRE practices like postmortems, SLIs/SLOs, and reliability engineering culture

Benefits & conditions

Remote-first setup with flexible hours & tech of your choice
30 days vacation + extra days for family sick leave
Competitive salary & stock options for every team member
Monthly sports & mental health support allowance with Oliva
Annual learning & development budget
Monthly team socials & in-person meetups
Dog-friendly Berlin HQ

About the company

Founded in 2018, deepset builds open and enterprise-grade tools that help teams build AI with purpose. From Haystack, our open-source framework, to the Haystack Enterprise Platform, we give developers and organizations the building blocks to solve complex, high impact challenges with AI with full control, transparency, and sovereignty. Backed by GV and Balderton, we're growing the world's production AI community and customer base solving challenges too critical to get wrong.