Site Reliability Engineer II

Alibaba Cloud
Bellevue, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate
Compensation
$ 173K

Job location

Remote
Bellevue, United States of America

Tech stack

Java
Big Data
Cloud Computing
Cloud Engineering
Databases
Distributed Systems
Identity and Access Management
Information Sciences
Python
Network Architecture
NoSQL
Reliability Engineering
Prometheus
Memory Leaks
SQL Databases
Database Engines
Virtualization Technology
Data Logging
Load Balancing
Cloud Platform System
Istio
System Availability
Grafana
Amazon Web Services (AWS)
Containerization
Kubernetes
Information Technology
Deployment Automation
Go

Job description

Platform Stability & High Availability: Conduct health checks, risk assessments, and preventive maintenance for database platform components. Design and implement HA solutions (e.g.,

automated fault recovery, adaptive disaster resilience) and cloud-native technologies. Optimize network architecture and Kubernetes (k8s) cluster operations for database services. Operational Tooling & Automation: Develop platforms/tools for large-scale distributed systems management, including automated deployment, monitoring, and diagnostics. Enhance observability through metrics, logging, tracing, and alerting systems (e.g., Prometheus, Grafana, OpenTelemetry). Incident Management & Optimization: Resolve live-site issues, including performance bottlenecks, capacity scaling, and security threats. Collaborate with product teams to refine architectures, reduce latency, and improve availability. Cross-Functional Collaboration: Drive standardization of control-plane components (e.g., microservice frameworks, metadata services) across database engines.

  1. Research and Development of Database Platform Infrastructure

Systems & Products: The employee will design and support Database-as-a-Service (DBaaS) platforms. This includes cloud-native database engines (such as PolarDB, RDS, or similar

distributed SQL/NoSQL databases) and their control-plane orchestration systems. Research Areas: Conduct research on Distributed Consensus Protocols (e.g., Paxos, Raft) to ensure

data consistency and high availability. Research Adaptive Disaster Resilience algorithms to automate failover across multi-region cloud architectures. Process: Lead the end-to-end

lifecycle of high-availability solutions, from architectural design and prototyping to automated stress testing and chaos engineering to validate system robustness under extreme failure

modes.

  1. Large-Scale Distributed Systems Management & Tooling

Equipment & Systems: Work extensively with Kubernetes (K8s) orchestration, focusing on Custom Resource Definitions (CRDs) and Operators to manage stateful database workloads.

Tools & Technologies: Develop and maintain internal automation platforms using languages such as Go (Golang), Java, or Python. Utilize Prometheus, Grafana, and OpenTelemetry to

build advanced observability frameworks that provide real-time telemetry and predictive diagnostics for thousands of database nodes. Specific Projects: Development of an automated

Database Fleet Management System that handles seamless patching, scaling, and migration of large-scale distributed clusters without service interruption.

  1. Network Architecture and Cloud-Native Optimization

Technical Focus: Optimize the networking stack within virtualized environments (e.g., Service Mesh, VPC configurations, Load Balancers) to minimize tail latency and maximize throughput

for database traffic. Industry Application: These duties are situated within the Cloud Computing and Information Technology Services industry, specifically focusing on Infrastructure-as-

Software and Large-Scale Data Management.

  1. Incident Management and Security Performance

Process: Implement a systematic approach to Root Cause Analysis (RCA) for complex live-site incidents involving performance bottlenecks, such as CPU saturation, I/O wait times, or

memory leaks in distributed environments. Security: Design and implement automated security auditing tools to ensure database components comply with industry standards (e.g.,

encryption at rest/in transit, identity and access management).

Telecommuting may be permitted. When not telecommuting, must report to worksite.

Requirements

  • Bachelor's degree or foreign degree equivalent in Computer Science, Information Science, or related field.
  • 2 years of experience in the Site Reliability Engineer II, or any other related occupation, job title/position.

Apply for this position