Site Reliability Engineer

deepset

Berlin, Germany

2 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Shift work

Languages

English

Experience level

Intermediate

Job location

Remote

Berlin, Germany

Tech stack

Artificial Intelligence

Amazon Web Services (AWS)

Software as a Service

Continuous Integration

Software Debugging

Github

Machine Learning

Reliability Engineering

Site Reliability Engineering Practices

Prometheus

Service-Oriented Architecture

Private Cloud Environment

Datadog

Large Language Models

Kubernetes

Terraform

Job description

You won't just "keep things running" - you'll help define how our platform is built, deployed, and scaled across cloud and customer environments.

Build and operate real-world infrastructure. Design, configure, and evolve infrastructure that runs both in our cloud and inside customer environments (SaaS, private cloud, on-prem).
Make self-hosted production-ready. Help us deliver a production-grade, self-hosted platform that can be deployed on any Kubernetes setup in weeks - not months.
Drive automation & platform maturity. Improve CI/CD pipelines, GitHub workflows, and GitOps setups so teams can ship faster with confidence.
Reduce complexity and cost. Continuously simplify systems and optimize infrastructure spend without compromising performance or reliability.
Shape how we build. Champion best practices in reliability, scalability, and security across the organization, not as rules, but as working systems.

Requirements

Do you have experience in Terraform?, Do you have a Master's degree?, * 2-5 years of experience working with large-scale production infrastructure

Experience with distributed or service-oriented architectures
Hands-on expertise with:

AWS
Kubernetes
CI/CD and GitOps (e.g. ArgoCD)

Working knowledge of Infrastructure as Code (Terraform preferred)
Solid troubleshooting skills - you can debug across systems, not just within one layer
A pragmatic mindset: you balance speed, simplicity, and reliability
Ownership and accountability - you take responsibility for systems end-to-end
Ability to work independently while staying aligned with the team's goals

Nice to have

Familiarity with observability stacks (e.g. Datadog, Prometheus)
Experience optimizing cloud costs at scale
Interest or experience in Machine Learning / LLM systems
Experience improving developer experience and platform tooling using AI agents
Contributions to SRE practices like postmortems, SLIs/SLOs, and reliability engineering culture

Benefits & conditions

Pulled from the full job description

Flexible schedule, * Remote-first setup with flexible hours & tech of your choice
30 days vacation + extra days for family sick leave
Competitive salary & stock options for every team member
Monthly sports & mental health support allowance with Oliva
Annual learning & development budget
Monthly team socials & in-person meetups
Dog-friendly Berlin HQ

About the company

We are building the next enterprise search engine fueled by NLP and open-source. Building on top of latest NLP research we leverage question answering & transfer learning to provide granular, semantic search results tailored to your domain.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

About the company

Apply for this position

Good distractions

Moments

Videos View all