Site Reliability Engineer supporting backend services
Job Cloud Inc.
San Jose, United States of America
yesterday
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
IntermediateJob location
San Jose, United States of America
Tech stack
Artificial Intelligence
Systems Engineering
Bash
Software as a Service
Cloud Computing
Software Documentation
Collaborative Software
Continuous Integration
Linux
Distributed Systems
Python
Operational Data Store
Reliability Engineering
GIT
Kubernetes
Infrastructure Automation Frameworks
Information Technology
Software Version Control
Docker
Go
Programming Languages
Job description
As a Site Reliability Engineer supporting backend services for a large scale SaaS collaboration platform, you will play a critical role in ensuring the reliability, scalability, and resilience of services used by millions of users globally. This role focuses on operational excellence, automation, and continuous improvement across cloud and hybrid environments. The position is based in San Jose and requires onsite presence three days per week., * * Own deployment, operation, and reliability of critical collaboration services across cloud and hybrid environments
-
- Design, enhance, and optimize CI CD pipelines and automation frameworks, including AI driven tooling for deployment, monitoring, and incident response
-
- Lead complex production incident response, perform root cause analysis, and drive long term reliability and performance improvements
-
- Leverage observability and operational data to support capacity planning, scaling decisions, and resource optimization
-
- Establish and promote operational best practices, documentation standards, and a culture of reliability, accountability, and continuous improvement, Docker, Kubernetes, Linux, Python, Go, Bash, CI CD platforms, Git based version control, cloud and hybrid infrastructure, monitoring and observability tooling
Requirements
-
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
-
- Three to five years of experience in Site Reliability Engineering, Cloud Operations, Systems Engineering, or a related role
-
- Strong hands on experience operating production services using Docker and Kubernetes in cloud or hybrid environments
-
- Proficiency in one or more scripting or programming languages such as Python, Go, or Bash for automation and operational tooling
-
- Experience with monitoring, observability, incident response, on call operations, and post incident reviews in production environments
-
- Solid understanding of Linux systems, networking, distributed systems, CI CD pipelines, infrastructure as code, and Git based development workflows