Senior Site Reliability Engineer
Role details
Job location
Tech stack
Job description
focused only on Kubernetes, AWS, or infrastructure administration. The hiring manager is seeking a senior-level SRE/Infrastructure professional who understands the broader infrastructure ecosystem, production operations, observability, reliability engineering, and business impact of the systems they support. Key Responsibilities Infrastructure & Reliability Engineering
- Own and maintain production services and infrastructure.
- Ensure platform availability, reliability, scalability, and performance.
- Monitor and troubleshoot infrastructure across cloud and on-prem environments.
- Take ownership of services end-to-end rather than only supporting individual technologies.
- Participate in incident response and production issue management.
Observability & Monitoring
- Design, build, and maintain monitoring solutions.
- Create and manage:
- SLIs (Service Level Indicators)
- SLOs (Service Level Objectives)
- SLAs (Service Level Agreements)
- Build monitoring dashboards and reliability metrics in Grafana.
- Measure system health, traffic, performance, error rates, and resource utilization.
Production Operations
- Participate in on-call rotation (every 3 weeks).
- Respond to production incidents and service outages.
- Coordinate with global teams during incidents.
- Drive root-cause analysis and service improvements.
Cross-Functional Collaboration
- Work independently across multiple teams.
- Drive initiatives from inception to completion.
- Coordinate with engineering, platform, infrastructure, and operations teams.
- Operate in an agile/sprint-based environment with strong accountability.
AI & Automation
- Demonstrate practical understanding of AI beyond basic prompting.
- Understand:
- AI-assisted automation
- AI SDK deployment
- MCP (Model Context Protocol)
- AI workflows and operational use cases
- Leverage AI to improve infrastructure automation and operational efficiency.
Requirements
-
4+ years in Site Reliability Engineering, DevOps, or related operational roles with proven experience in Linux/Unix systems administration proficiency in scripting and programming languages such as Python, Bash, or Go for automation and tool development
-
Strong experience with cloud infrastructure and services across GCP, AWS, and OCI, as well as container orchestration tools like Kubernetes
-
Expertise in monitoring and observability tools such as Prometheus, Grafana, Splunk, Nagios,
-
Hands-on experience with Infrastructure-as-Code tools like Terraform, Ansible, or Helm
-
Proven ability to develop and track SLIs, SLOs, and SLAs to drive reliability improvements
Technical Knowledge
- Deep understanding of networking, DNS, load balancing, and CDN technologies
- Familiarity with databases (SQL, NoSQL, Vertica, MongoDB, Snowflake) and data pipeline technologies
- Knowledge of CI/CD pipelines, GitLab, and deployment automation
- Experience with workflow automation platforms is a strong plus
Benefits & conditions
Estimated Min Rate: $130000.00 Estimated Max Rate: $150000.00
What's In It for You?
We welcome you to be a part of the largest and legendary global staffing companies to meet your career aspirations. Yoh's network of client companies has been employing professionals like you for over 65 years in the U.S., UK and Canada. Join Yoh's extensive talent community that will provide you with access to Yoh's vast network of opportunities and gain access to this exclusive opportunity available to you. Benefit eligibility is in accordance with applicable laws and client requirements. Benefits include:
- Medical, Prescription, Dental & Vision Benefits (for employees working 20+ hours per week)
- Health Savings Account (HSA) (for employees working 20+ hours per week)
- Life & Disability Insurance (for employees working 20+ hours per week)
- MetLife Voluntary Benefits
- Employee Assistance Program (EAP)
- 401K Retirement Savings Plan
- Direct Deposit & weekly epayroll
- Referral Bonus Programs
- Certification and training opportunities