Staff Software Development Engineer

CVS Health

Baton Rouge, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 284K

Job location

Baton Rouge, United States of America

Tech stack

Java

Artificial Intelligence

Amazon Web Services (AWS)

Azure

Software as a Service

Cloud Engineering

Cluster Analysis

Decision Support Systems

DevOps

Distributed Systems

Fault Tolerance

Monitoring of Systems

Information Technology Operations

Python

Machine Learning

Performance Tuning

Systems Development Life Cycle

Reliability Engineering

Site Reliability Engineering Practices

Prometheus

Software Engineering

Datadog

Diagnostic Tools

Grafana

Mttr

Cloudformation

Kubernetes

Information Technology

Performance Monitor

ArcSight Event Correlation

Terraform

Appdynamics

Dynatrace

Microservices

Job description

Join Fortune 7 CVS Health as a Staff Software Engineer to lead and advance our Site Reliability Engineering (SRE), AIOps, Observability, and Monitoring capabilities in the CVS Digital team. This role is critical in advancing intelligent, automated, and scalable reliability practices across our platforms. You will drive the evolution from traditional monitoring to AI-driven operations (AIOps) leveraging automation, machine learning, and advanced analytics to improve system resilience, reduce operational toil, and accelerate incident detection and resolution. As a technical leader, you will influence architecture, build platforms, and mentor teams to embed reliability, observability, and automation into the software delivery lifecycle., * SRE Strategy & Reliability Engineering

Define and implement enterprise-wide SRE practices, including SLIs, SLOs, error budgets, and reliability governance.
Drive a culture of reliability, automation, and continuous improvement across engineering teams.
Establish metrics-driven approaches to measure system health, availability, and performance.

AIOps & Intelligent Operations

Lead adoption of AIOps solutions to enable predictive monitoring, anomaly detection, and automated root cause analysis.
Integrate machine learning models and analytics into monitoring pipelines to proactively detect and prevent incidents.
Develop intelligent alerting systems to reduce noise and improve signal quality.

Observability & Monitoring Platforms

Architect and build scalable observability frameworks covering metrics, logs, traces, and events.
Define standards for instrumentation, telemetry collection, and distributed tracing.
Enable real-time insights into system performance across microservices and cloud-native architectures.

Incident Management & Automation

Lead incident response practices, including on-call readiness, RCA, postmortems, and continuous learning loops.
Build self-healing systems and automate remediation workflows to reduce Mean Time to Resolution (MTTR).
Implement runbooks, playbooks, and automated escalations.

Platform Engineering & Tooling

Develop internal platforms and tools for observability, monitoring, and performance optimization.
Integrate observability into CI/CD pipelines to enable proactive quality and reliability checks.
Drive infrastructure automation using IaaC frameworks and GitOps principles.

Collaboration & Technical Leadership

Partner with engineering, platform, and product teams to embed reliability and observability into system design.
Mentor engineers and lead design reviews focused on scalability, resilience, and operability.
Influence enterprise architecture decisions and promote best practices across teams.

Requirements

Do you have experience in System performance monitoring?, * 5+ years of experience in software engineering, SRE, or production engineering in large-scale distributed systems.

Hands-on experience with Observability tools such as AppDynamics, Grafana, Prometheus, Datadog, OpenTelemetry, or similar.
Experience with AIOps or intelligent monitoring platforms, including anomaly detection and event correlation.
Strong expertise in cloud platforms (AWS, Azure, or GCP) and cloud-native architectures (Kubernetes, containers, microservices).
Proficiency in at least one programming language (e.g., Python, Java, Go).
Strong understanding of distributed systems, resiliency patterns, and fault tolerance.
Experience implementing incident management, on-call processes, and root cause analysis.
Hands-on expertise with Infrastructure as Code (Terraform, ARM, CloudFormation) and CI/CD pipelines.
Experience using GenAI/Automation tools and frameworks such as OpenAI, CoPilot, Gemini, Claude, MCP etc.
Proven ability to design scalable, reliable, and observable systems., * Experience designing and implementing AIOps platforms or predictive reliability systems at scale.
Strong knowledge of machine learning applications in IT operations (e.g., anomaly detection, forecasting, clustering).
Experience defining and managing SLIs/SLOs and error budgets at scale.
Experience with OpenTelemetry and modern observability standards.
Familiarity with chaos engineering, resilience testing, and fault injection frameworks.
Exposure to GenAI-driven operations or AI-assisted troubleshooting tools.
Experience in healthcare, finance, enterprise SaaS, or highly regulated industries.
Demonstrated leadership in driving cross-functional initiatives and influencing senior stakeholders.
Contributions to open-source projects in SRE, observability, or AIOps domains.

Education:

Bachelor's degree or equivalent work experience in Computer Science, Engineering, or related discipline.
Certifications in AIOps, SRE, OpenTelemetry, cloud platforms, or DevOps are a plus.

Leadership Competencies:

Strategic thinking and execution excellence
Strong communication and stakeholder influence
Data-driven decision making
Continuous improvement mindset

Benefits & conditions

Pulled from the full job description

Health insurance
Paid time off
Vision insurance
Dental insurance, $106,605.00 - $284,280.00

This pay range represents the base hourly rate or base annual full-time salary for all positions in the job grade within which this position falls. The actual base salary offer will depend on a variety of factors including experience, education, geography and other relevant factors. This position is eligible for a CVS Health bonus, commission or short-term incentive program in addition to the base pay range listed above. This position also includes an award target in the company's equity award program.

Our people fuel our future. Our teams reflect the customers, patients, members and communities we serve and we are committed to fostering a workplace where every colleague feels valued and that they belong.

Great benefits for great people

We take pride in offering a comprehensive and competitive mix of pay and benefits that reflects our commitment to our colleagues and their families. This full-time position is eligible for a comprehensive benefits package designed to support the physical, emotional, and financial well-being of colleagues and their families. The benefits for this position include medical, dental, and vision coverage, paid time off, retirement savings options, wellness programs, and other resources, based on eligibility.

Role details

Job location

Tech stack

Job description

Requirements

Benefits & conditions

Apply for this position

Good distractions

Moments

Videos View all