Sr. Site Reliability Engineer

Qode LLC
Austin, United States of America
5 days ago

Role details

Contract type
Permanent contract
Employment type
Full-time (> 32 hours)
Working hours
Regular working hours
Languages
English
Experience level
Senior

Job location

Austin, United States of America

Tech stack

API
Artificial Intelligence
Amazon Web Services (AWS)
Azure
Cloud Computing
Cloud Engineering
Noise Reduction
Distributed Systems
JSON
Machine Learning
Reliability Engineering
Prometheus
Runbook
Data Streaming
Large Language Models
Grafana
Mttr
Generative AI
Kafka
Terraform
Dynatrace
Microservices

Job description

Role: Sr. Site Reliability Engineer (SRE) - Unified Observability & AIOpsLocation: Austin, TX / Fort Mill, SC (Hybrid)Job Type: Full Time Role SummaryWe are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures. Key ResponsibilitiesObservability & Reliability Engineering

  • Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
  • Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
  • Build actionable dashboards for operations, engineering, and leadership
  • Implement alerting strategies using static and dynamic thresholds

Proactive Detection & AIOps

  • Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
  • Transition monitoring from reactive alerts to proactive insights
  • Implement noise reduction, alert correlation, and root cause analysis
  • Apply baseline modeling, seasonality detection, and anomaly scoring

Distributed Systems & Dependency Analysis

  • Monitor and troubleshoot multi-service architectures involving:
  • Microservices
  • Downstream APIs
  • Kafka / streaming platforms
  • Cloud infrastructure (Terraform, IaC)
  • Identify whether issues originate from:
  • Upstream/downstream dependencies
  • Streaming platform
  • Infrastructure
  • Application code

Tooling & Platforms

  • Deep hands-on experience with Dynatrace (mandatory)
  • Experience with:
  • OpenTelemetry
  • Prometheus / Grafana
  • ELK / EFK
  • Cloud-native monitoring (AWS/Azure/GCP)
  • Strong JSON-based telemetry manipulation and enrichment

GenAI & LLM Enablement

  • Apply GenAI / LLMs for:
  • Incident summarization
  • Root cause explanation
  • Runbook recommendations
  • Auto-remediation suggestions
  • Collaborate with platform teams to operationalize GenAI safely

Requirements

Required Skills & Experience 15+ years in SRE / Production Engineering Strong Unified Observability background (not infra-only) Hands-on Dynatrace experience (metrics, traces, logs, Davis AI) SLI/SLO engineering experience in production systems Experience implementing dynamic thresholds and anomaly detection Knowledge of AI/ML concepts applied to Ops (AIOps) Distributed systems troubleshooting expertise Experience with Kafka or streaming data platforms Differentiators (Highly Valued)

  • Experience in financial services or regulated environments
  • Proven reduction of alert noise and MTTR using AIOps
  • GenAI / LLM integration into operations workflows

Apply for this position