Site Reliability Engineer

deepset
Berlin, Germany
2 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Shift work
Languages
English
Experience level
Intermediate

Job location

Remote
Berlin, Germany

Tech stack

Artificial Intelligence
Amazon Web Services (AWS)
Software as a Service
Continuous Integration
Software Debugging
Github
Machine Learning
Reliability Engineering
Site Reliability Engineering Practices
Prometheus
Service-Oriented Architecture
Private Cloud Environment
Datadog
Large Language Models
Kubernetes
Terraform

Job description

You won't just "keep things running" - you'll help define how our platform is built, deployed, and scaled across cloud and customer environments.

  • Build and operate real-world infrastructure. Design, configure, and evolve infrastructure that runs both in our cloud and inside customer environments (SaaS, private cloud, on-prem).
  • Make self-hosted production-ready. Help us deliver a production-grade, self-hosted platform that can be deployed on any Kubernetes setup in weeks - not months.
  • Drive automation & platform maturity. Improve CI/CD pipelines, GitHub workflows, and GitOps setups so teams can ship faster with confidence.
  • Reduce complexity and cost. Continuously simplify systems and optimize infrastructure spend without compromising performance or reliability.
  • Shape how we build. Champion best practices in reliability, scalability, and security across the organization, not as rules, but as working systems.

Requirements

Do you have experience in Terraform?, Do you have a Master's degree?, * 2-5 years of experience working with large-scale production infrastructure

  • Experience with distributed or service-oriented architectures
  • Hands-on expertise with:
  • AWS
  • Kubernetes
  • CI/CD and GitOps (e.g. ArgoCD)
  • Working knowledge of Infrastructure as Code (Terraform preferred)
  • Solid troubleshooting skills - you can debug across systems, not just within one layer
  • A pragmatic mindset: you balance speed, simplicity, and reliability
  • Ownership and accountability - you take responsibility for systems end-to-end
  • Ability to work independently while staying aligned with the team's goals

Nice to have

  • Familiarity with observability stacks (e.g. Datadog, Prometheus)
  • Experience optimizing cloud costs at scale
  • Interest or experience in Machine Learning / LLM systems
  • Experience improving developer experience and platform tooling using AI agents
  • Contributions to SRE practices like postmortems, SLIs/SLOs, and reliability engineering culture

Benefits & conditions

Pulled from the full job description

  • Flexible schedule, * Remote-first setup with flexible hours & tech of your choice
  • 30 days vacation + extra days for family sick leave
  • Competitive salary & stock options for every team member
  • Monthly sports & mental health support allowance with Oliva
  • Annual learning & development budget
  • Monthly team socials & in-person meetups
  • Dog-friendly Berlin HQ

About the company

We are building the next enterprise search engine fueled by NLP and open-source. Building on top of latest NLP research we leverage question answering & transfer learning to provide granular, semantic search results tailored to your domain.

Apply for this position