Senior AI Engineer - APM Experiences
Role details
Job location
Tech stack
Job description
Datadog's APM Experiences team owns the core product experience for Application Performance Monitoring - including distributed tracing, service representation, and more. We're building a new wave of AI-powered capabilities that help customers detect, resolve, and prevent performance issues faster. In this role, you will lead end-to-end development of LLM- and Agent-based features that can:
- Debug and investigate application performance issues down to the root cause, as both a developer assistant and a fully autonomous agent
- Proactively recommend performance and reliability-based optimizations to prevent the next incident
- Automatically create intelligent monitors and SLOs for the most important business flows and critical paths
This is a highly product-minded engineering role: you'll work from problem discovery and UX all the way to reliable, scalable production systems.
At Datadog, we place value in our office culture - the relationships that it builds, the creativity it brings to the table, and the collaboration of being together. We operate as a hybrid workplace to ensure our employees can create a work-life harmony that best fits them.
What you'll do:
- Shape AI experiences for APM. Design and ship LLM/agentic workflows that analyze traces, metrics, logs, and other telemetry to generate diagnoses, explanations, and guided fixes.
- Own the full loop. Prototype quickly, define success metrics and evals, run experiments, iterate, and ultimately productionize for scale and reliability.
- Build robust agent systems. Develop tools, retrieval and planning strategies, and guardrails; manage prompts/evals; design fallbacks and human-in-the-loop paths.
- Integrate with Datadog's platform. Leverage surfaces like Trace Explorer, Service Catalog, monitors, and workflows to deliver end-to-end value in the APM UI.
- Partner deeply. Collaborate with PM, Design, and partner teams to build cohesive experiences.
- Raise the bar on engineering. Write performant, maintainable backend code, own services in production, and improve reliability for high-throughput, low-latency data systems., * Hands-on with distributed tracing stacks (OpenTelemetry/Datadog APM), profilers, and logs/metrics pipelines
- Exposure to planning/agent frameworks, tool-use orchestration, RAG, and retrieval/indexing for observability data
- Familiarity with SLO/SLA practices and incident response
Requirements
Do you have experience in Python?, * Product-minded engineer who ships AI to production
- 4+ years building backend or real-time ML systems; you value simplicity, correctness, and performance
- Proven experience delivering LLM/agent features to production (prompting, tooling, evals, safety/guardrails)
- Comfortable owning user journeys, iterating from prototype alpha GA, and measuring impact with clear product metrics
- Strong ML / applied science fundamentals
- Solid grasp of the ML lifecycle (task definition, dataset collection, modeling, evaluation, deployment, iteration) and statistics (experiment design, confidence intervals)
- Experience choosing/modeling the right technique for the job (e.g., anomaly detection, ranking/recommendation, NLP), and knowing when a heuristic beats a model
- Fluency with offline/online evals for AI systems; can build reliable golden sets and automatic regressions
- Distributed systems & observability savvy
- Experience with microservices performance: tracing, latency breakdowns, concurrency, and resiliency patterns
- Proficient in Go, Java, or Python; strong API/service design; production ops (monitoring, alerting, on-call rotation)
Benefits & conditions
- New hire stock equity (RSUs) and employee stock purchase plan (ESPP)
- Continuous professional development, product training, and career pathing
- Intradepartmental mentor and buddy program for in-house networking
- An inclusive company culture, ability to join our Community Guilds (Datadog employee resource groups)
- Access to Inclusion Talks, our Internal panel discussions
- Free, global mental health benefits for employees and dependents age 6+
- Competitive global benefits
Benefits and Growth listed above may vary based on the country of your employment and the nature of your employment with Datadog.