Senior Site Reliability Engineer

Insight Global

Downers Grove, United States of America

20 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Downers Grove, United States of America

Tech stack

Java

API

Artificial Intelligence

Airflow

Application Performance Management

Azure

Bash

Cloud Database

ETL

Query Languages

Distributed Systems

Intrusion Detection Systems

JSON

Python

MongoDB

Role-Based Access Control

Reliability Engineering

Prometheus

Cloudera

YAML

Data Logging

Grafana

Apigee

GIT

Event Driven Architecture

Kubernetes

Low Latency

Drilldown

Kafka

Azure

Data Management

Data Pipelines

Dynatrace

Pagerduty

Job description

Lead SRE project plan and implementation for distributed applications across GCP and Azure covering API's , data pipelines , messaging/event driven systems and also external data platforms., * Design and implement comprehensive SRE monitoring for distributed applications

Implement distributed tracing and logging using W3C Trace Context headers and OpenTelemetry standards across all applications
Create drill-down Grafana dashboards with correlation between metrics, logs, and traces
Integrate GCP and Azure Monitoring, Logging, and Trace with existing Open telemetry standards by enterprise teams
Implement zero code instrumentation for monitoring and traceability
Experience in defining and working with core SRE models like SLI's , SLO's , Error budgets etc
Design reliability focused metrics (Latency, Request rate, Error, Duration, Availability) dashboards
Build service health dashboards with drill-down capabilities and error message analysis
Develop and maintain SRE automation/scripts within GKE namespaces for monitoring, deployment, and troubleshooting

-Configure APIGEE monitoring and API performance tracking for applications working with enterprise teams

Requirements

7+ years in SRE with proven Azure, GCP observability, Grafana stack, GKE, AKS, OpenTelemetry, and instrumentation implementation experience.

Technical: Prometheus, Grafana, Kubernetes, Loki, Tempo, GCP or Azure logging
Logging & Tracing: Distributed tracing, W3C Trace Context headers implementation, log aggregation standards, correlation IDs across systems/applications
Structured Logging: JSON format with specific fields (trace_id, service.name, log.level, customer.id, request.id)
Experience monitoring batch/data pipelines (Cloud composer,Dataproc,ETL workflows) including job failures, scheduling issues
Infrastructure: CI/CD pipelines , AI tools like GIT copilot etc.
Observability Tools & Query Languages: PromQL for querying metrics (Grafana)
Strong experience with Kubernetes (GKE,AKS), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Java/Python, Bash, YAML, Helm)
OpenTelemetry (OTEL): Instrumentation, collectors, data collection from GCP services
Alerting and Incident management :Implementing structured processes for handling failures, and conducting reviews that focus on fixing system issues - Experience in monitoring external managed services like Mongo DB ,Kafka,Cloud SQL, Azure based monitoring , Oncall systems designing and writing on call rotation policies and rules (Xmatters or PagerDuty or Opsgenie etc.)
AI experience

Role details

Job location

Tech stack

Job description

Requirements

Apply for this position

Good distractions

Moments

Videos View all