DevOps / Site Reliability Engineer
Bayside Solutions
Cupertino, United States of America
11 days ago
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Compensation
$ 135KJob location
Remote
Cupertino, United States of America
Tech stack
Cloud Computing
Continuous Integration
DevOps
Distributed Systems
Monitoring of Systems
Python
Network Troubleshooting
Log Analysis
Reliability Engineering
Prometheus
Datadog
Data Logging
Computer Networking Systems
Grafana
Kubernetes
Deployment Automation
Splunk
Go
Job description
We are looking for a highly motivated DevOps / Site Reliability Engineer to support large-scale Kubernetes-based infrastructure and platform operations. This role is focused on building, automating, and operating highly reliable systems that power critical engineering platforms and services., * Design, build, automate, and support scalable Kubernetes-based platforms and services
- Operate and troubleshoot production environments running at scale
- Develop automation and tooling to improve operational efficiency and reliability
- Monitor platform health, performance, and availability using observability tooling
- Troubleshoot infrastructure, application, and networking issues across distributed systems
- Work closely with engineering teams to improve deployment, reliability, and scalability practices
- Participate in operational support, incident response, and root cause analysis
- Improve CI/CD workflows and deployment automation
- Drive operational excellence through documentation, automation, and process improvements
- Take ownership of projects and independently drive deliverables to completion
Requirements
- Strong hands-on experience with Kubernetes platforms such as EKS, GKE, AKS, or similar
- Experience running and supporting applications on Kubernetes at scale
- Strong understanding of containerized infrastructure and distributed systems
- Experience with monitoring and observability tools, preferably Grafana and Prometheus
- Experience with CI/CD pipelines and deployment automation
- Experience with Splunk logging, log analysis, and troubleshooting
- Strong scripting and automation experience using Python and/or Golang
- Experience troubleshooting production systems under pressure
- Strong communication and collaboration skills
- Self-starter mentality with strong ownership and accountability, * Experience operating Ray clusters/services
- Strong networking and troubleshooting experience
- Experience with cloud infrastructure and platform services
- Experience with Infrastructure as Code and automation frameworks
- Experience supporting high-scale production systems
- Familiarity with SRE principles and operational best practices