Site Reliability Engineer

National Oilwell Varco

Houston, United States of America

yesterday

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Houston, United States of America

Tech stack

Amazon Web Services (AWS)

Azure

Bash

C Sharp (Programming Language)

Cloud Computing

Computer Programming

Continuous Integration

DevOps

Distributed Systems

Github

Python

PostgreSQL

Performance Tuning

Powershell

Query Optimization

Reliability Engineering

Prometheus

Datadog

Data Logging

Grafana

Gitlab

Gitlab-ci

Kubernetes

Low Latency

Deployment Automation

Terraform

Software Version Control

Bamboo

Docker

Job description

As a Site Reliability Engineer, you will be responsible for: Operational Excellence & Incident Management

Maintain and monitor production systems for availability, latency, and performance.
Lead incident response efforts, including communication, resolution, and postmortem documentation.
Design and implement health checks, alerting systems, and automated remediation workflows.
Drive root cause analysis and implement permanent resolutions for recurring issues.

Observability & Insights

Set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK.
Analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement.
Conduct post-incident reviews and use insights to inform future engineering investments.

Performance & Systems Optimization

Tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency.
Work with developers to evolve architecture and improve system throughput, latency, and stability.
Optimize PostgreSQL performance, queries, and maintenance strategies.

CI/CD & Automation

Design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI.
Automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency.

Requirements

5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.

Expertise in Kubernetes and container orchestration at scale.
Strong experience with AKKA.NET or similar actor-based frameworks.
Proficiency with scripting and automation (Bash, PowerShell, Python).
Experience with observability tools (Phobos,Datadog, Prometheus, Grafana, OpenTelemetry, ELK).
Hands-on experience with cloud platforms (AWS, Azure, or GCP).
Strong PostgreSQL knowledge-performance tuning, query optimization, maintenance.
Proven ability to lead incident management and drive postmortem processes.
A builder's mindset with high standards for operational excellence and technical ownership.

Preferred Tools & Ecosystem Experience

CI/CD: GitHub Actions, Azure Pipelines, GitLab CI
Infrastructure: Kubernetes, Docker, Terraform
Monitoring: Phobos (AKKA.NET), Datadog, Prometheus
Source Control: GitHub, GitLab, Azure DevOps
Programming: C#, Python, Bash, PowerShell

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all