Site Reliability Engineer Lead
ESG
Houston, United States of America
yesterday
Role details
Contract type
Permanent contract Employment type
Full-time (> 32 hours) Working hours
Regular working hours Languages
English Experience level
SeniorJob location
Houston, United States of America
Tech stack
Airflow
Amazon Web Services (AWS)
Google BigQuery
Cloud Storage
Continuous Integration
Information Engineering
Data Systems
DevOps
Data Flow Control
Github
Identity and Access Management
Python
Operational Data Store
Reliability Engineering
Site Reliability Engineering Practices
Cloud Services
Data Streaming
Data Logging
Google Cloud Platform
Cloud Platform System
Cloud Monitoring
Gitlab-ci
Kubernetes
Information Technology
Data Management
Terraform
Job description
- Lead SRE practices for Google Cloud Platform-based data platforms
- Design and own SLIs, SLOs, error budgets, and reliability metrics
- Build and maintain cloud-native observability (monitoring, logging, alerting)
- Lead incident response for production cloud systems and drive postmortems
- Partner with data engineering and platform teams to design reliable architectures
- Automate operational workflows using Python
- Drive improvements in CI/CD, infrastructure as code, and deployment safety
- Mentor engineers and set SRE best practices across the team
Requirements
We are seeking an Site Reliability Engineer Lead to own and evolve the reliability, scalability, and operational excellence of cloud-native data platforms running primarily on Google Cloud Platform (Google Cloud Platform). This role supports data systems that ingest, process, and serve large volumes of operational data from oilfield and energy environments. The ideal candidate is a cloud-first SRE with deep Google Cloud Platform experience, strong Python engineering skills, and a track record of leading reliability initiatives for data-intensive systems., * 7+ years in SRE, Cloud Platform Engineering, or DevOps
- Strong hands-on experience with Google Cloud Platform, including:
- Google Cloud Platform: GKE, Compute Engine, Cloud Storage, Pub/Sub (or equivalents)
- Cloud Monitoring & Logging
- BigQuery
- Dataflow
- Datastream
- IAM and networking
- Composer/AIrflow
- Kubernetes: deployment, scaling, reliability patterns
- CI/CD: GitHub Actions, GitLab CI, or similar
- Observability: Google Cloud Platform Cloud Monitoring, Logging
- Experience supporting cloud-native data systems (batch and streaming)
- Production experience with Python for automation, tooling, or services
- Infrastructure as Code experience (Terraform strongly preferred)
- Experience operating systems in 24/7 production environments, * Bachelor's degree in Business, Information Technology, Computer Science, or a related field.
- 5+ years experience in Site Reliability Engineering, Cloud Platform Engineering, or DevOps
- 3+ years operating production workloads on Google Cloud Platform (Google Cloud Platform)
- Prior technical leadership experience (lead engineer, tech lead, or ownership of reliability initiatives)
- Ability to understand and speak English at a level of proficiency allowing employee to issue, receive and respond to both safety and operations-related directions in English
Preferred Qualifications:
- Oil and Gas Industry knowledge
- Technology/Digital Industry knowledge
About the company
The Evolving Oil Field Demands Evolving Service Providers
NexTier is a leading provider of integrated completions that employs sustainable practices and equipment to support our customers' ESG goals while accelerating production in the most demanding US land basins.
Patterson-UTI is committed to a workplace free from discrimination and harassment, offering equal employment opportunities to all individuals regardless of personal characteristics protected by law. Employees are encouraged to report any concerns through multiple channels.