Senior AI/ML Observability Engineer
Role details
Job location
Tech stack
Job description
- Design, build, and deploy AI/ML models for anomaly detection across telemetry data (logs, metrics, traces, KPIs)
- Translate early stage use cases into generalized, reusable observability solutions
- Modify and extend models to support multiple applications and teams
- Apply ML techniques to predict system anomalies before production impact
Telemetry & System Monitoring
- Analyze and correlate logs, metrics, traces, and system KPIs
- Identify early warning signals of instability or degradation
- Build dashboards and alerts using observability platforms
Collaboration & Strategy
- Work closely with Infrastructure, SRE, Developers, and Architects
- Contribute to enterprise observability strategy
- Act as a subject matter expert for AI driven observability
- Operate independently within a small, high impact team
Requirements
We are seeking a Senior AI/ML Observability Engineer to join a strategic observability team focused on building reusable, enterprise wide anomaly detection solutions. This role blends hands on AI/ML engineering, observability expertise, and automation to proactively detect system issues and improve production reliability.
The ideal candidate has strong Python-based ML experience, a solid grasp of observability principles (logs, metrics, traces), and has worked closely with Infrastructure, SRE, and Engineering teams to implement scalable observability solutions across complex systems.
This is a senior individual contributor role requiring independence, initiative, and subject matter expertise., * 6+ years of experience in AI/ML engineering, SRE, or observability focused roles
- Strong expertise in Python for data processing and ML development
- Hands on experience building ML models for anomaly detection
- Solid understanding of observability principles (logs, metrics, traces)
- Experience withobservability tools such as:
- Grafana (preferred)
- Splunk
- Dynatrace
- Familiarity with OpenTelemetry
- Strong automation skills (pipelines, workflows, reusable components)
- Experience working in cloud environments
- Excellent problem solving and communication skills, * Experience designing predictive models for system reliability
- Background supporting production systems in large scale environments
- Experience building reusable ML platforms or shared services
- Exposure to enterprise wide monitoring or observability programs, * Senior level, hands on engineer
- Strong ownership mindset; able to drive work end to end
- Comfortable operating with limited supervision
- Strategic thinker with pragmatic execution skills
- Passionate about reliability, automation, and proactive problem detection