Senior Site Reliability Engineer

Insight Global
Downers Grove, United States of America
20 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Downers Grove, United States of America

Tech stack

Java
API
Artificial Intelligence
Airflow
Application Performance Management
Azure
Bash
Cloud Database
ETL
Query Languages
Distributed Systems
Intrusion Detection Systems
JSON
Python
MongoDB
Role-Based Access Control
Reliability Engineering
Prometheus
Cloudera
YAML
Data Logging
Grafana
Apigee
GIT
Event Driven Architecture
Kubernetes
Low Latency
Drilldown
Kafka
Azure
Data Management
Data Pipelines
Dynatrace
Pagerduty

Job description

Lead SRE project plan and implementation for distributed applications across GCP and Azure covering API's , data pipelines , messaging/event driven systems and also external data platforms., * Design and implement comprehensive SRE monitoring for distributed applications

  • Implement distributed tracing and logging using W3C Trace Context headers and OpenTelemetry standards across all applications
  • Create drill-down Grafana dashboards with correlation between metrics, logs, and traces
  • Integrate GCP and Azure Monitoring, Logging, and Trace with existing Open telemetry standards by enterprise teams
  • Implement zero code instrumentation for monitoring and traceability
  • Experience in defining and working with core SRE models like SLI's , SLO's , Error budgets etc
  • Design reliability focused metrics (Latency, Request rate, Error, Duration, Availability) dashboards
  • Build service health dashboards with drill-down capabilities and error message analysis
  • Develop and maintain SRE automation/scripts within GKE namespaces for monitoring, deployment, and troubleshooting

-Configure APIGEE monitoring and API performance tracking for applications working with enterprise teams

Requirements

7+ years in SRE with proven Azure, GCP observability, Grafana stack, GKE, AKS, OpenTelemetry, and instrumentation implementation experience.

  • Technical: Prometheus, Grafana, Kubernetes, Loki, Tempo, GCP or Azure logging
  • Logging & Tracing: Distributed tracing, W3C Trace Context headers implementation, log aggregation standards, correlation IDs across systems/applications
  • Structured Logging: JSON format with specific fields (trace_id, service.name, log.level, customer.id, request.id)
  • Experience monitoring batch/data pipelines (Cloud composer,Dataproc,ETL workflows) including job failures, scheduling issues
  • Infrastructure: CI/CD pipelines , AI tools like GIT copilot etc.
  • Observability Tools & Query Languages: PromQL for querying metrics (Grafana)
  • Strong experience with Kubernetes (GKE,AKS), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Java/Python, Bash, YAML, Helm)
  • OpenTelemetry (OTEL): Instrumentation, collectors, data collection from GCP services
  • Alerting and Incident management :Implementing structured processes for handling failures, and conducting reviews that focus on fixing system issues - Experience in monitoring external managed services like Mongo DB ,Kafka,Cloud SQL, Azure based monitoring , Oncall systems designing and writing on call rotation policies and rules (Xmatters or PagerDuty or Opsgenie etc.)
  • AI experience

Apply for this position