Site Reliability Engineer II
Role details
Job location
Tech stack
Job description
Platform Stability & High Availability: Conduct health checks, risk assessments, and preventive maintenance for database platform components. Design and implement HA solutions (e.g.,
automated fault recovery, adaptive disaster resilience) and cloud-native technologies. Optimize network architecture and Kubernetes (k8s) cluster operations for database services. Operational Tooling & Automation: Develop platforms/tools for large-scale distributed systems management, including automated deployment, monitoring, and diagnostics. Enhance observability through metrics, logging, tracing, and alerting systems (e.g., Prometheus, Grafana, OpenTelemetry). Incident Management & Optimization: Resolve live-site issues, including performance bottlenecks, capacity scaling, and security threats. Collaborate with product teams to refine architectures, reduce latency, and improve availability. Cross-Functional Collaboration: Drive standardization of control-plane components (e.g., microservice frameworks, metadata services) across database engines.
- Research and Development of Database Platform Infrastructure
Systems & Products: The employee will design and support Database-as-a-Service (DBaaS) platforms. This includes cloud-native database engines (such as PolarDB, RDS, or similar
distributed SQL/NoSQL databases) and their control-plane orchestration systems. Research Areas: Conduct research on Distributed Consensus Protocols (e.g., Paxos, Raft) to ensure
data consistency and high availability. Research Adaptive Disaster Resilience algorithms to automate failover across multi-region cloud architectures. Process: Lead the end-to-end
lifecycle of high-availability solutions, from architectural design and prototyping to automated stress testing and chaos engineering to validate system robustness under extreme failure
modes.
- Large-Scale Distributed Systems Management & Tooling
Equipment & Systems: Work extensively with Kubernetes (K8s) orchestration, focusing on Custom Resource Definitions (CRDs) and Operators to manage stateful database workloads.
Tools & Technologies: Develop and maintain internal automation platforms using languages such as Go (Golang), Java, or Python. Utilize Prometheus, Grafana, and OpenTelemetry to
build advanced observability frameworks that provide real-time telemetry and predictive diagnostics for thousands of database nodes. Specific Projects: Development of an automated
Database Fleet Management System that handles seamless patching, scaling, and migration of large-scale distributed clusters without service interruption.
- Network Architecture and Cloud-Native Optimization
Technical Focus: Optimize the networking stack within virtualized environments (e.g., Service Mesh, VPC configurations, Load Balancers) to minimize tail latency and maximize throughput
for database traffic. Industry Application: These duties are situated within the Cloud Computing and Information Technology Services industry, specifically focusing on Infrastructure-as-
Software and Large-Scale Data Management.
- Incident Management and Security Performance
Process: Implement a systematic approach to Root Cause Analysis (RCA) for complex live-site incidents involving performance bottlenecks, such as CPU saturation, I/O wait times, or
memory leaks in distributed environments. Security: Design and implement automated security auditing tools to ensure database components comply with industry standards (e.g.,
encryption at rest/in transit, identity and access management).
Telecommuting may be permitted. When not telecommuting, must report to worksite.
Requirements
- Bachelor's degree or foreign degree equivalent in Computer Science, Information Science, or related field.
- 2 years of experience in the Site Reliability Engineer II, or any other related occupation, job title/position.