DevOps / Site Reliability Engineer

Bayside Solutions

Cupertino, United States of America

11 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Compensation

$ 135K

Job location

Remote

Cupertino, United States of America

Tech stack

Cloud Computing

Continuous Integration

DevOps

Distributed Systems

Monitoring of Systems

Python

Network Troubleshooting

Log Analysis

Reliability Engineering

Prometheus

Datadog

Data Logging

Computer Networking Systems

Grafana

Kubernetes

Deployment Automation

Splunk

Job description

We are looking for a highly motivated DevOps / Site Reliability Engineer to support large-scale Kubernetes-based infrastructure and platform operations. This role is focused on building, automating, and operating highly reliable systems that power critical engineering platforms and services., * Design, build, automate, and support scalable Kubernetes-based platforms and services

Operate and troubleshoot production environments running at scale
Develop automation and tooling to improve operational efficiency and reliability
Monitor platform health, performance, and availability using observability tooling
Troubleshoot infrastructure, application, and networking issues across distributed systems
Work closely with engineering teams to improve deployment, reliability, and scalability practices
Participate in operational support, incident response, and root cause analysis
Improve CI/CD workflows and deployment automation
Drive operational excellence through documentation, automation, and process improvements
Take ownership of projects and independently drive deliverables to completion

Requirements

Strong hands-on experience with Kubernetes platforms such as EKS, GKE, AKS, or similar
Experience running and supporting applications on Kubernetes at scale
Strong understanding of containerized infrastructure and distributed systems
Experience with monitoring and observability tools, preferably Grafana and Prometheus
Experience with CI/CD pipelines and deployment automation
Experience with Splunk logging, log analysis, and troubleshooting
Strong scripting and automation experience using Python and/or Golang
Experience troubleshooting production systems under pressure
Strong communication and collaboration skills
Self-starter mentality with strong ownership and accountability, * Experience operating Ray clusters/services
Strong networking and troubleshooting experience
Experience with cloud infrastructure and platform services
Experience with Infrastructure as Code and automation frameworks
Experience supporting high-scale production systems
Familiarity with SRE principles and operational best practices

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all