Senior Site Reliability Engineer (Database)
Role details
Job location
Tech stack
Job description
Join Kombo as one of our first Database Reliability Engineer. You'll take ownership of our Postgres infrastructure, ensuring performance, scalability, and reliability as we grow. High impact, high autonomy, and the chance to shape Kombo's database reliability practices from the ground up, We're looking for a Site Reliability Engineer (Database) to lead how we scale, operate, and optimize Kombo's Postgres environments. You'll work closely with our platform team, product engineers, and CTO to evolve our database architecture - balancing performance, scalability, and developer experience.
This is a hands-on role for someone who loves digging into performance metrics, understanding query plans, and making distributed systems fast and reliable.
What You'll Do
- Own & evolve Postgres infrastructure: design, optimize, and maintain databases handling terabytes of data.
- Performance & tuning: manage indexing, vacuuming, partitioning, and connection pooling (e.g. PgBouncer).
- Scalability: plan and execute replication, sharding, and read replica strategies.
- Observability: instrument monitoring, alerting, and metrics for database performance and reliability (e.g. query latency, I/O, locks).
- Incident response: lead database-related incidents, drive root cause analysis, and define preventive measures.
- Reliability automation: use Terraform and scripting (Python, Go, or SQL) to automate database provisioning, backups, and maintenance.
- Collaboration: partner with the platform and backend teams to build systems that are fast, observable, and fault-tolerant.
- Operational excellence: define SLIs/SLOs/error budgets for data reliability, and ensure we can sleep soundly at night
Requirements
Do you have experience in TypeScript?, * 6-8+ years in database reliability, infrastructure, or backend engineering roles.
- Deep expertise with Postgres internals, performance tuning, and optimization at scale (terabyte+ datasets).
- Strong understanding of replication, failover, and disaster recovery strategies.
- Familiarity with Kubernetes and GCP environments (or equivalent cloud infra).
- Proficiency with Infrastructure-as-Code (Terraform) and automation.
- Solid troubleshooting experience across performance, scaling, and reliability incidents.
- Hands-on coding skills in SQL and one modern programming language (Python, Go, or TypeScript).
- Collaborative, pragmatic, and calm under pressure - low ego, high ownership.
Nice to Have
- Experience with database sharding and multi-tenant architectures.
- Knowledge of data observability tools (e.g. pg_stat_statements, Prometheus, Grafana).
- Familiarity with CI/CD pipelines and schema migration tooling.
- Experience mentoring engineers or contributing to database best practices.
- Exposure to compliance or security frameworks (SOC2, ISO27001, GDPR).