Staff Observability Platform Engineer (SRE)

CVS Health
Richardson, United States of America
14 days ago

Role details

Contract type
Temporary contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Intermediate
Compensation
$ 237K

Job location

Richardson, United States of America

Tech stack

Java
Amazon Web Services (AWS)
Azure
BigTable
Business Software
Cloud Computing
Cloud Engineering
Databases
Continuous Integration
Relational Databases
DevOps
Monitoring of Systems
Python
PostgreSQL
MySQL
NoSQL
Octopus Deploy
Online Transaction Processing
Reliability Engineering
Prometheus
Software Engineering
Data Streaming
Systems Integration
Cloud Platform System
Computer Network Technologies
Istio
System Availability
Grafana
Reliability of Systems
Cloudformation
Kubernetes
Infrastructure Automation Frameworks
Cassandra
Kafka
Data Management
Vertica
Terraform
Splunk
Appdynamics
Data Pipelines
Docker

Job description

CVS Health PBM is looking for hands-on, passionate people who want to join a high energy and growing team, who want to be on the forefront of digital innovation that aims to reinvent what a pharmacy and a health care company can be in the digital world.

As a Lead Platform Reliability Engineer , you will design and implement metrics and observability frameworks with a strong focus on service level objectives (SLOs), service level indicators (SLIs), error budgets, and cloud infrastructure scaling and capacity estimation.

This individual contributor role is critical to enhancing our monitoring and observability capabilities, while also driving automation initiatives related to quality gates within the release engineering process. You will work closely with cross-functional teams to ensure the reliability, performance, and scalable growth of our cloud-based systems.

Expectations for the Role:

Metrics Development: Define, implement, and maintain key performance metrics, SLOs, and SLIs to measure system reliability and performance. Ensure alignment with business objectives and operational goals.

Error Budgets: Manage error budgets effectively, collaborating with development teams to balance reliability and feature delivery. Analyze incidents and outages to inform adjustments to error budgets.

Monitoring & Observability: Design and implement comprehensive monitoring solutions to provide real-time visibility into system health. Utilize tools such as Prometheus, Grafana, Loki, Temp and other observability platforms to create dashboards and alerts.

Cloud Infrastructure Scaling: Architect, design, and implement scalable cloud infrastructure capable of supporting multiple business applications, ensuring reliability, performance, and future growth.

Quality Gates Automation: Develop and implement automated quality gates that ensure all releases meet defined reliability and performance standards. Lead the release Devops team to integrate these gates into the CI/CD pipeline.

Incident Management: Assist in incident response efforts by providing insights from metrics and monitoring tools. Conduct post-mortem analyses to identify root causes and recommend preventive measures.

Requirements

  • 10+ years of experience in Software Engineering, Platform Engineering, or SRE.

  • 7+ years of experience with observability practices, including SLIs/SLOs/SLAs, alerting, and incident management.

  • 7+ years building production-grade backend services in Java/python.

  • 7+ years implementing and operating OpenTelemetry, including OTLP, semantic conventions, and instrumentation patterns.

  • 7+ years with cloud-native and containerized platforms (Docker, Kubernetes, Argo CD).

  • 7+ years working with public cloud platforms (AWS, GCP, or Azure).

  • 5+ years designing and scaling distributed, high-volume data pipelines.

  • 5+ years working with Grafana OSS or comparable observability backends (e.g., Grafana, Loki, Tempo, Prometheus).

  • 5+ years with relational databases (PostgreSQL, MySQL).

PREFERRED QUALIFICATIONS

  • Excellent analytical skills and the ability to communicate complex technical concepts to non-technical stakeholders

  • Experience with service meshes and networking technologies such as Envoy and Istio

  • Experience integrating or operating commercial observability platforms (Splunk, AppDynamics, etc.)

  • Experience with streaming and data platforms such as Kafka, Pulsar, or similar technologies

  • Familiarity with time-series, NoSQL, or analytical databases (ClickHouse, Bigtable, Cassandra, etc.)

  • Experience with Infrastructure as Code tools such as Terraform or CloudFormation

  • Experience with cost optimization and capacity planning for large-scale cloud infra

  • Experience with chaos engineering, resiliency testing, or fault injection

  • Background in security-aware platform design, including secure service-to-service communication

  • Experience mentoring senior engineers and influencing platform standards across organizations

  • Strong operational experience supporting 24x7 production systems, including on-call responsibilities

  • Knowledge of security best practices in cloud environments, Bachelor's degree or equivalent experience (HS diploma + 4 years relevant experience)

Benefits & conditions

$118,450.00 - $236,900.00

This pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls. The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors. This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above. This position also includes an award target in the company's equity award program.

Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.

Great benefits for great people

We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families.

This full-time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well-being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.

Apply for this position