Site Reliability Engineer

National Oilwell Varco
Houston, United States of America
yesterday

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Houston, United States of America

Tech stack

Amazon Web Services (AWS)
Azure
Bash
C Sharp (Programming Language)
Cloud Computing
Computer Programming
Continuous Integration
DevOps
Distributed Systems
Github
Python
PostgreSQL
Performance Tuning
Powershell
Query Optimization
Reliability Engineering
Prometheus
Datadog
Data Logging
Grafana
Gitlab
Gitlab-ci
Kubernetes
Low Latency
Deployment Automation
Terraform
Software Version Control
Bamboo
Docker

Job description

As a Site Reliability Engineer, you will be responsible for: Operational Excellence & Incident Management

  • Maintain and monitor production systems for availability, latency, and performance.

  • Lead incident response efforts, including communication, resolution, and postmortem documentation.

  • Design and implement health checks, alerting systems, and automated remediation workflows.

  • Drive root cause analysis and implement permanent resolutions for recurring issues.

Observability & Insights

  • Set up and maintain full observability stacks (logging, metrics, tracing) using tools like Prometheus, Grafana, Datadog, OpenTelemetry, or ELK.

  • Analyze telemetry and logs to identify trends, anomalies, and opportunities for improvement.

  • Conduct post-incident reviews and use insights to inform future engineering investments.

Performance & Systems Optimization

  • Tune and optimize distributed systems, including AKKA.NET actors, for performance and resource efficiency.

  • Work with developers to evolve architecture and improve system throughput, latency, and stability.

  • Optimize PostgreSQL performance, queries, and maintenance strategies.

CI/CD & Automation

  • Design and maintain modern CI/CD pipelines using GitHub Actions, Azure Pipelines, or GitLab CI.

  • Automate deployment, testing, and rollback processes to reduce friction and increase deployment frequency.

Requirements

5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.

  • Expertise in Kubernetes and container orchestration at scale.

  • Strong experience with AKKA.NET or similar actor-based frameworks.

  • Proficiency with scripting and automation (Bash, PowerShell, Python).

  • Experience with observability tools (Phobos,Datadog, Prometheus, Grafana, OpenTelemetry, ELK).

  • Hands-on experience with cloud platforms (AWS, Azure, or GCP).

  • Strong PostgreSQL knowledge-performance tuning, query optimization, maintenance.

  • Proven ability to lead incident management and drive postmortem processes.

  • A builder's mindset with high standards for operational excellence and technical ownership.

Preferred Tools & Ecosystem Experience

  • CI/CD: GitHub Actions, Azure Pipelines, GitLab CI

  • Infrastructure: Kubernetes, Docker, Terraform

  • Monitoring: Phobos (AKKA.NET), Datadog, Prometheus

  • Source Control: GitHub, GitLab, Azure DevOps

  • Programming: C#, Python, Bash, PowerShell

Apply for this position