Sr. Site Reliability Engineer

VDart, Inc.

Frisco, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Compensation

$ 147K

Job location

Frisco, United States of America

Tech stack

Java

Agile Methodologies

Artificial Intelligence

Amazon Web Services (AWS)

Audit Trail

Azure

Bash

Oracle WebLogic Server

Unix

Cloud Engineering

Computer Programming

Databases

Continuous Integration

Information Engineering

Data Governance

Data Systems

Cursor (Graphical User Interface Elements)

DevOps

Disaster Recovery

Middleware

Monitoring of Systems

Python

Key Management

Log Analysis

Windows Server

MySQL

Oracle Data Service Integrator

Oracle Applications

Productivity Software

RabbitMQ

Reliability Engineering

Ansible

Standard Sql

Software Engineering

SQL Databases

Private Cloud Environment

Data Logging

Scripting (Bash/Python/Go/Ruby)

Cyberark

System Availability

Delivery Pipeline

Large Language Models

Snowflake

Prompt Engineering

Reliability of Systems

Generative AI

Gitlab

Containerization

Kubernetes

Infrastructure Automation Frameworks

Information Technology

Deployment Automation

Kafka

Virtual Agents

REST

Terraform

Splunk

Appdynamics

Data Pipelines

Docker

Jenkins

Databricks

Microservices

Job description

Senior Engineer, Systems Reliability (SRE) - Privacy ensures the stability, performance, and reliability of IT services and infrastructure. This role combines software engineering and operations expertise to build and maintain highly available, scalable systems. As a leader in DevOps and cloud reliability practices, the engineer supports continuous improvement of automation, deployment pipelines, observability, and incident management, while mentoring junior engineers and optimizing production workflows. The position plays a critical part in enabling software to be delivered faster, better, and more reliably to support business and customer needs. What You ll Do

Build and maintain CI/CD pipelines for data engineering deployments using GitLab and Azure DevOps Design and maintain CI/CD pipelines and DevOps automation solutions for REST APIs and microservices.
Implement robust monitoring, alerting, and logging for data pipelines, Snowflake and Azure services.
Respond to production incidents, troubleshoot failures and restore services quickly.
Perform root cause analysis and implement preventive measures.
Ensure high availability and disaster recovery planning for critical data systems.
Tune SQL queries, Snowflake features and Databricks clusters for optimal performance and cost efficiency.
Automate operational tasks to improve deployment reliability and reduce manual intervention. Manage secrets and credentials using Azure Key Vault and CyberArk.
Hands-on experience with Terraform, Helm, or Ansible for infrastructure provisioning
Working knowledge of containerization (Docker) and Kubernetes orchestration Hands-on experience with cloud platforms (Azure; AWS or GCP)
Understanding of deployment strategies (blue/green, rolling, canary), GitOps, and artifact management
Ensure compliance with data governance, privacy regulations and organizational security standards.
Work closely with data engineers, analysts and cloud teams to ensure smooth operations.
Maintain detailed runbooks, operational documentation and incident reports.
Perform regular OS patching on Unix and Windows servers to address security vulnerabilities and maintain system stability.
Apply critical and cumulative updates for middleware components such as Oracle Data Integrator (ODI), WebLogic and related software to mitigate risks and enhance performance.
Coordinate patching schedules with application and infrastructure teams to minimize downtime and ensure business continuity.
Use AI productivity tools daily (Claude and Cursor or similar IDE) across the SRE lifecycle including pipeline development, scripting, runbook authoring, log analysis, and incident response Design, build, and operate AI agents to automate SRE tasks such as incident triage, root cause analysis, alert correlation, runbook execution, and patching workflows
Apply foundation models, prompt engineering, and RAG patterns to operational use cases such as querying runbooks, summarizing incidents, and surfacing remediation guidance etc but not limited to these areas.
Implement audit logging, observability, and human-in-the-loop controls for AI agents and AI-assisted workflows operating in Tier-0 production environments
Build and host AI agents, identify gaps and convert them into AI agent use cases, and implement solutions to further modernize the SRE platform

Requirements

Bachelor s degree in computer science, Engineering, or equivalent practical experience 5-7 years of experience in systems reliability, software engineering, DevOps, or related technical roles
Experience working in Agile and DevOps delivery environments Demonstrated ability to mentor engineers and influence technical outcomes
Strong problem-solving skills with a systems-level perspective Strong automation, and agentic AI skills.
Familiarity with foundation models, prompt engineering, retrieval-augmented generation (RAG), and AI agent development applied to SRE and operational use cases

Must Have Skills

CI/CD tooling and automation experience (gitlab, azure devops, jenkins)
Experience working in public or private cloud environments Proficiency in one or more programming or scripting languages (Python, Java, Shell, etc.)
Experience with monitoring, logging, and APM tools such as AppDynamics, Splunk, or equivalents Strong understanding of system reliability concepts including scalability, performance, availability, and resilience Strong experience in writing SQLs, analyzing logs and troubleshooting issues.
Databases: SQL (Oracle/My SQL/ Snowflake)
Messaging: Kafka, Rabbit MQ
Hands-on experience with AI productivity tools (Claude and Cursor or similar IDE) and working knowledge of foundation models, prompt engineering, RAG, and AI agent development
Experience with containerization and orchestration technologies such as Docker and Kubernetes

Nice to Have

Experience migrating systems to cloud-native architectures
Familiarity with reliability metrics, service monitoring, or operational dashboards
Exposure to platform engineering or shared services environments

Key Skills: CI/CD, automation, AppDynamics, Splunk, Kafka, Rabbit, AI productivity tools

About the company

Bright Vision Technologies + Little Elm, TX + $100,000-150,000 per year Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations… + 2 days ago

Role details

Job location

Tech stack

Job description

Requirements

About the company

Apply for this position

Good distractions

Moments

Videos View all