Site Reliability Engineer
Role details
Job location
Tech stack
Job description
We're a small, tool-agnostic team that owns the observability infrastructure behind Proton's services - the logs, metrics, traces, and alerts that keep systems running smoothly for the millions of users who trust us with their privacy. We run on open-source stacks across Proton's on-premise data centers, and we dogfood heavily: we're our own first customers. We favor simple, solid solutions over large engineering efforts, and we believe good systems emerge iteratively. You'll join a group that values frank, open communication and a problem-solving mentality - if you want narrow scope and a fixed backlog, this isn't the right fit., * Languages: Python, Go
- Observability: open-source stacks for logs, metrics, traces, alerting; OpenTelemetry
- Orchestration: Kubernetes
- GitOps: ArgoCD
- Infrastructure-as-code: Terraform, Ansible, Puppet
- Storage at scale: ClickHouse
- Platform: Linux, on-premise data centers, * Design, deploy, and operate observability pipelines for logs, metrics, traces, and alerts across Proton's services using open-source technologies.
- Partner with development and platform teams to ship practical alerting, dashboarding, and integration solutions that engineers actually rely on.
- Build reusable templates and tooling that streamline onboarding, incident response, and analysis.
- Champion observability best practices across teams and raise the bar for how Proton instruments its systems.
- Build AI-powered tooling that sharpens detection, analysis, and response capabilities.
- Evolve the observability platform iteratively to meet the real needs of internal stakeholders.
Requirements
Do you have experience in Terraform?, * Extensive experience in an SRE, DevOps, or Platform Engineering role.
- Comfortable writing Python and/or Go for tooling and automation.
- Hands-on experience operating open-source observability stacks (logs, metrics, traces, alerting).
- Working knowledge of Kubernetes and GitOps workflows.
- Practical experience with infrastructure-as-code (Terraform, Ansible, Puppet, or similar) and solid Linux system administration skills.
Nice to Have
- Familiarity with OpenTelemetry.
- Experience running ClickHouse for log and metric storage at scale.
- Interest in or experience with AI/ML tooling.