Lead Data SRE (Hybrid - Chennai, India)

Insight Global
Irvine, United States of America
1 month ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior
Compensation
$ 58K

Job location

Irvine, United States of America

Tech stack

Amazon Web Services (AWS)
Automation of Tests
Azure
Google BigQuery
Cloud Computing
Information Engineering
Data Infrastructure
ETL
Data Systems
Data Warehousing
Distributed Systems
Fault Tolerance
Performance Tuning
Reliability Engineering
Prometheus
Data Streaming
Datadog
System Availability
Snowflake
Grafana
Cloudformation
Containerization
Kubernetes
Kafka
Data Management
Terraform
Data Pipelines
Docker
Databricks

Job description

The Data SRE Lead is responsible for ensuring the reliability, scalability, performance, and operational excellence of the organization's data platforms and pipelines. This role bridges Data Engineering and Site Reliability Engineering practices, applying SRE principles to modern data ecosystems (batch, streaming, warehousing, and ML data infrastructure). This a hybrid role sitting in the clients Chennai, India location 3 days per week., Reliability & Operations Define and own SLIs, SLOs, and SLAs for data platforms and pipelines Design and implement monitoring, alerting, and observability solutions Lead incident response, root cause analysis (RCA), and postmortems Reduce toil through automation and self-healing infrastructure

Data Platform Stability Ensure high availability of: Data warehouses and lakehouses Streaming systems ETL/ELT pipelines Orchestration frameworks Implement capacity planning and performance tuning strategies Improve data pipeline reliability, freshness, and latency metrics

Infrastructure & Automation Manage infrastructure-as-code (IaC) frameworks Improve CI/CD pipelines for data workflows Implement automated testing and validation for data infrastructure Drive resilience patterns such as retries, circuit breakers, and graceful degradation

Leadership & Strategy Lead and mentor a team of Data SREs Define operational standards and reliability roadmaps Collaborate cross-functionally with Data, Engineering, and Product leadership Drive a culture of reliability and operational excellence

Requirements

8+ years in Site Reliability Engineering, Platform Engineering, or Data Engineering 3+ years in a technical leadership role Strong experience with: Cloud platforms (AWS, GCP, or Azure) Infrastructure as Code (Terraform, CloudFormation) Monitoring tools (Prometheus, Datadog, Grafana) Containerization & orchestration (Docker, Kubernetes) Deep understanding of distributed systems and failure modes Experience supporting large-scale data systems (batch & streaming)

Nice to Have Skills & Experience

Experience with modern data platforms (Snowflake, BigQuery, Databricks) Experience with streaming systems (Kafka, Pub/Sub, Kinesis) Knowledge of data quality frameworks and data observability Familiarity with ML platform reliability

Benefits & conditions

Benefit packages for this role will start on the 1st day of employment and include medical, dental, and vision insurance, as well as HSA, FSA, and DCFSA account options, and 401k retirement account access with employer matching. Employees in this role are also entitled to paid sick leave and/or other paid time off as provided by applicable law.

Apply for this position