Sr. Site Reliability Engineer

Qode LLC

Austin, United States of America

5 days ago

Role details

Contract type

Permanent contract

Employment type

Full-time (> 32 hours)

Working hours

Regular working hours

Languages

English

Experience level

Senior

Job location

Austin, United States of America

Tech stack

API

Artificial Intelligence

Amazon Web Services (AWS)

Azure

Cloud Computing

Cloud Engineering

Noise Reduction

Distributed Systems

JSON

Machine Learning

Reliability Engineering

Prometheus

Runbook

Data Streaming

Large Language Models

Grafana

Mttr

Generative AI

Kafka

Terraform

Dynatrace

Microservices

Job description

Role: Sr. Site Reliability Engineer (SRE) - Unified Observability & AIOpsLocation: Austin, TX / Fort Mill, SC (Hybrid)Job Type: Full Time Role SummaryWe are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures. Key ResponsibilitiesObservability & Reliability Engineering

Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
Build actionable dashboards for operations, engineering, and leadership
Implement alerting strategies using static and dynamic thresholds

Proactive Detection & AIOps

Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
Transition monitoring from reactive alerts to proactive insights
Implement noise reduction, alert correlation, and root cause analysis
Apply baseline modeling, seasonality detection, and anomaly scoring

Distributed Systems & Dependency Analysis

Monitor and troubleshoot multi-service architectures involving:
Microservices
Downstream APIs
Kafka / streaming platforms
Cloud infrastructure (Terraform, IaC)
Identify whether issues originate from:
Upstream/downstream dependencies
Streaming platform
Infrastructure
Application code

Tooling & Platforms

Deep hands-on experience with Dynatrace (mandatory)
Experience with:
OpenTelemetry
Prometheus / Grafana
ELK / EFK
Cloud-native monitoring (AWS/Azure/GCP)
Strong JSON-based telemetry manipulation and enrichment

GenAI & LLM Enablement

Apply GenAI / LLMs for:
Incident summarization
Root cause explanation
Runbook recommendations
Auto-remediation suggestions
Collaborate with platform teams to operationalize GenAI safely

Requirements

Required Skills & Experience 15+ years in SRE / Production Engineering Strong Unified Observability background (not infra-only) Hands-on Dynatrace experience (metrics, traces, logs, Davis AI) SLI/SLO engineering experience in production systems Experience implementing dynamic thresholds and anomaly detection Knowledge of AI/ML concepts applied to Ops (AIOps) Distributed systems troubleshooting expertise Experience with Kafka or streaming data platforms Differentiators (Highly Valued)