Site Reliability Engineer II

Alibaba Cloud

Bellevue, United States of America

1 month ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Intermediate

Compensation

$ 173K

Job location

Remote

Bellevue, United States of America

Tech stack

Java

Big Data

Cloud Computing

Cloud Engineering

Databases

Distributed Systems

Identity and Access Management

Information Sciences

Python

Network Architecture

NoSQL

Reliability Engineering

Prometheus

Memory Leaks

SQL Databases

Database Engines

Virtualization Technology

Data Logging

Load Balancing

Cloud Platform System

Istio

System Availability

Grafana

Amazon Web Services (AWS)

Containerization

Kubernetes

Information Technology

Deployment Automation

Job description

Platform Stability & High Availability: Conduct health checks, risk assessments, and preventive maintenance for database platform components. Design and implement HA solutions (e.g.,

automated fault recovery, adaptive disaster resilience) and cloud-native technologies. Optimize network architecture and Kubernetes (k8s) cluster operations for database services. Operational Tooling & Automation: Develop platforms/tools for large-scale distributed systems management, including automated deployment, monitoring, and diagnostics. Enhance observability through metrics, logging, tracing, and alerting systems (e.g., Prometheus, Grafana, OpenTelemetry). Incident Management & Optimization: Resolve live-site issues, including performance bottlenecks, capacity scaling, and security threats. Collaborate with product teams to refine architectures, reduce latency, and improve availability. Cross-Functional Collaboration: Drive standardization of control-plane components (e.g., microservice frameworks, metadata services) across database engines.

Research and Development of Database Platform Infrastructure

Systems & Products: The employee will design and support Database-as-a-Service (DBaaS) platforms. This includes cloud-native database engines (such as PolarDB, RDS, or similar

distributed SQL/NoSQL databases) and their control-plane orchestration systems. Research Areas: Conduct research on Distributed Consensus Protocols (e.g., Paxos, Raft) to ensure

data consistency and high availability. Research Adaptive Disaster Resilience algorithms to automate failover across multi-region cloud architectures. Process: Lead the end-to-end

lifecycle of high-availability solutions, from architectural design and prototyping to automated stress testing and chaos engineering to validate system robustness under extreme failure

modes.

Large-Scale Distributed Systems Management & Tooling

Equipment & Systems: Work extensively with Kubernetes (K8s) orchestration, focusing on Custom Resource Definitions (CRDs) and Operators to manage stateful database workloads.

Tools & Technologies: Develop and maintain internal automation platforms using languages such as Go (Golang), Java, or Python. Utilize Prometheus, Grafana, and OpenTelemetry to

build advanced observability frameworks that provide real-time telemetry and predictive diagnostics for thousands of database nodes. Specific Projects: Development of an automated

Database Fleet Management System that handles seamless patching, scaling, and migration of large-scale distributed clusters without service interruption.

Network Architecture and Cloud-Native Optimization

Technical Focus: Optimize the networking stack within virtualized environments (e.g., Service Mesh, VPC configurations, Load Balancers) to minimize tail latency and maximize throughput

for database traffic. Industry Application: These duties are situated within the Cloud Computing and Information Technology Services industry, specifically focusing on Infrastructure-as-

Software and Large-Scale Data Management.

Incident Management and Security Performance

Process: Implement a systematic approach to Root Cause Analysis (RCA) for complex live-site incidents involving performance bottlenecks, such as CPU saturation, I/O wait times, or

memory leaks in distributed environments. Security: Design and implement automated security auditing tools to ensure database components comply with industry standards (e.g.,

encryption at rest/in transit, identity and access management).

Telecommuting may be permitted. When not telecommuting, must report to worksite.

Requirements

Bachelor's degree or foreign degree equivalent in Computer Science, Information Science, or related field.
2 years of experience in the Site Reliability Engineer II, or any other related occupation, job title/position.

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all