Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
As a Senior Site Reliability Engineer, you will take ownership of the reliability and operational foundations of our platform. You will work across Kubernetes, AWS infrastructure, CI/CD, observability, and security to ensure that our systems remain scalable, resilient, and secure as we grow., * Manage cloud infrastructure: Design and operate scalable AWS infrastructure (EKS, RDS, ALB, IAM, VPC, S3) using Infrastructure as Code, with strong IaC discipline and clear change management.
- Strengthen observability: Improve our Sentry, Grafana, Prometheus, and Loki setup so teams can define SLOs, debug fast, and operate their services with confidence.
- Lead on security: Own our security posture across infrastructure and application layers - IAM, secrets management, network segmentation, container and dependency scanning, vulnerability management, supply chain security, and audit readiness. Embed security as a design constraint, not a bolted-on review step.
- Improve incident response: Strengthen our on-call practices, runbooks, and post-incident learning. We treat reliability as a product feature.
- Enable product teams: Provide tooling, guidance, and self-service capabilities that help product engineers adopt better operational and deployment practices - make the good path the easy path.
- Support the broader platform surface: Temporal workflows, PostgreSQL operations, S3, our Estuary CDC pipeline, and AI service infrastructure on GCP/Azure as we expand our AI capabilities
Requirements
Do you have experience in TypeScript?, Do you have a Master's degree?, * Engineering experience: 5+ years building and operating production systems in cloud environments, including real ownership of non-trivial systems at scale.
- Kubernetes depth: Deep, hands-on production Kubernetes experience - beyond kubectl apply, including operators, networking, autoscaling, and debugging.
- AWS expertise: Strong working knowledge of EKS, RDS, ALB, IAM, VPC, S3, and the operational realities of running services on AWS.
- Infrastructure as Code: Solid Terraform experience with disciplined IaC practices.
- GitOps: Hands-on experience with ArgoCD, Helm, or equivalent declarative deployment tooling.
- Security expertise applied to cloud-native environments: IAM best practices, secrets management, secure network architecture, container and dependency vulnerability scanning, secure SDLC principles, and familiarity with compliance frameworks (e.g. ISO 27001, SOC 2, or comparable). You proactively identify risks and contribute to incident response and audit readiness.
- Observability instincts: You've built dashboards, defined SLOs, run real incidents, and used the resulting learning to improve systems.
- Automation fluency: Comfortable scripting and building tooling in Python, Go, Bash, or similar.
- Communication: Excellent written and verbal English (C1+). You document decisions, write runbooks people actually use, and explain tradeoffs clearly.
Nice to have
- Experience with Temporal or other workflow orchestration systems
- Exposure to CDC pipelines (Estuary, Debezium, or similar)
- Multi-cloud experience (AWS primary, GCP/Azure for AI services)
- Background running platforms for Spring/Java services at scale
- Experience in regulated environments (financial services, real estate, healthcare)
- Familiarity with AI/LLM infrastructure patterns or agentic engineering workflows
- Prior experience as an early platform hire - building foundations without over-engineering
How you work
- You think in systems and prefer building self-service capabilities over becoming a ticket queue
- You're pragmatic about quality - protecting long-term adaptability without gold-plating
- You communicate openly about tradeoffs, mistakes, and unknowns.
- You see security and compliance as enablers of speed, not obstacles.
- You're comfortable with autonomy and high ownership in a small, focused team.
- You're curious about how AI is changing platform engineering - and you want to help figure it out.
Benefits & conditions
Pulled from the full job description
- Flexible schedule