DevOps / Site Reliability Engineer

Bayside Solutions
Cupertino, United States of America
11 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Compensation
$ 135K

Job location

Remote
Cupertino, United States of America

Tech stack

Cloud Computing
Continuous Integration
DevOps
Distributed Systems
Monitoring of Systems
Python
Network Troubleshooting
Log Analysis
Reliability Engineering
Prometheus
Datadog
Data Logging
Computer Networking Systems
Grafana
Kubernetes
Deployment Automation
Splunk
Go

Job description

We are looking for a highly motivated DevOps / Site Reliability Engineer to support large-scale Kubernetes-based infrastructure and platform operations. This role is focused on building, automating, and operating highly reliable systems that power critical engineering platforms and services., * Design, build, automate, and support scalable Kubernetes-based platforms and services

  • Operate and troubleshoot production environments running at scale
  • Develop automation and tooling to improve operational efficiency and reliability
  • Monitor platform health, performance, and availability using observability tooling
  • Troubleshoot infrastructure, application, and networking issues across distributed systems
  • Work closely with engineering teams to improve deployment, reliability, and scalability practices
  • Participate in operational support, incident response, and root cause analysis
  • Improve CI/CD workflows and deployment automation
  • Drive operational excellence through documentation, automation, and process improvements
  • Take ownership of projects and independently drive deliverables to completion

Requirements

  • Strong hands-on experience with Kubernetes platforms such as EKS, GKE, AKS, or similar
  • Experience running and supporting applications on Kubernetes at scale
  • Strong understanding of containerized infrastructure and distributed systems
  • Experience with monitoring and observability tools, preferably Grafana and Prometheus
  • Experience with CI/CD pipelines and deployment automation
  • Experience with Splunk logging, log analysis, and troubleshooting
  • Strong scripting and automation experience using Python and/or Golang
  • Experience troubleshooting production systems under pressure
  • Strong communication and collaboration skills
  • Self-starter mentality with strong ownership and accountability, * Experience operating Ray clusters/services
  • Strong networking and troubleshooting experience
  • Experience with cloud infrastructure and platform services
  • Experience with Infrastructure as Code and automation frameworks
  • Experience supporting high-scale production systems
  • Familiarity with SRE principles and operational best practices

Apply for this position