Senior AI Engineer - APM Experiences
Role details
Job location
Tech stack
Job description
Datadogâs APM Experiences team owns the core product experience for Application Performance Monitoring â including distributed tracing, service representation, and more. Weâre building a new wave of AI-powered capabilities that help customers detect, resolve, and prevent performance issues faster. In this role, you will lead endâtoâend development of LLM- and Agentâbased features that can:
- Debug and investigate application performance issues down to the root cause, as both a developer assistant and a fully autonomous agent
- Proactively recommend performance and reliability-based optimizations to prevent the next incident
- Automatically create intelligent monitors and SLOs for the most important business flows and critical paths
This is a highly productâminded engineering role: youâll work from problem discovery and UX all the way to reliable, scalable production systems.
What youâll do:
- Shape AI experiences for APM. Design and ship LLM/agentic workflows that analyze traces, metrics, logs, and other telemetry to generate diagnoses, explanations, and guided fixes.
- Own the full loop. Prototype quickly, define success metrics and evals, run experiments, iterate, and ultimately productionize for scale and reliability.
- Build robust agent systems. Develop tools, retrieval and planning strategies, and guardrails; manage prompts/evals; design fallbacks and humanâinâtheâloop paths.
- Integrate with Datadogâs platform. Leverage surfaces like Trace Explorer, Service Catalog, monitors, and workflows to deliver endâtoâend value in the APM UI.
- Partner deeply. Collaborate with PM, Design, and partner teams to build cohesive experiences.
- Raise the bar on engineering. Write performant, maintainable backend code, own services in production, and improve reliability for highâthroughput, lowâlatency data systems.
Requirements
Productâminded engineer who ships AI to production
- 4+ years building backend or real-time ML systems; you value simplicity, correctness, and performance
- Proven experience delivering LLM/agent features to production (prompting, tooling, evals, safety/guardrails)
- Comfortable owning user journeys, iterating from prototype â alpha â GA, and measuring impact with clear product metrics
Strong ML / applied science fundamentals
- Solid grasp of the ML lifecycle (task definition, dataset collection, modeling, evaluation, deployment, iteration) and statistics (experiment design, confidence intervals)
- Experience choosing/modeling the right technique for the job (e.g., anomaly detection, ranking/recommendation, NLP), and knowing when a heuristic beats a model
- Fluency with offline/online evals for AI systems; can build reliable golden sets and automatic regressions
Distributed systems & observability savvy
- Experience with microservices performance: tracing, latency breakdowns, concurrency, and resiliency patterns
- Proficient in Go, Java, or Python; strong API/service design; production ops (monitoring, alerting, onâcall rotation)
Nice to have
- Handsâon with distributed tracing stacks (OpenTelemetry/Datadog APM), profilers, and logs/metrics pipelines
- Exposure to planning/agent frameworks, toolâuse orchestration, RAG, and retrieval/indexing for observability data
- Familiarity with SLO/SLA practices and incident response
Benefits & conditions
- Get to build tools for software engineers, just like yourself. And use the tools we build to accelerate our development.
- Have a lot of influence on product direction and impact on the business.
- Work with skilled, knowledgeable, and kind teammates who are happy to teach and learn.
- Competitive global benefits.
- Continuous professional development.
Benefits and Growth listed above may vary based on the country of your employment and the nature of your employment with Datadog.