How DoorDash Powers a Reliable, High-Performance Ad Platform with Deductive AI

Executive Summary

DoorDash’s Ads Platform team operates one of the company’s most latency-sensitive systems — an ad-serving platform that powers real-time auctions and delivers high-quality ads in under 100 milliseconds. Every millisecond of performance and every minute of uptime directly translate into business impact, making reliability and responsiveness essential.

As part of its ongoing investments in engineering excellence, the Ads Engineering team has been focused on strengthening incident response and reducing the cognitive load on on-call engineers. By integrating Deductive AI’s on-call agent into its alert response workflows, the team is leveraging AI to accelerate triage, improve system reliability, and create a calmer, more efficient on-call experience.

Deductive automatically correlates code, telemetry, and change metadata to identify likely root causes within minutes – transforming incident response from a manual, high-pressure process into an intelligent, data-driven workflow. The result is reduced mean time to mitigation (MTTM) and faster triage of alerts.

Our Ads Platform operates at a pace where manual, slow-moving investigations are no longer viable. Every minute of downtime directly affects company revenue. In those high-stress, ambiguous early moments of an incident, AI-driven triage plays a crucial role in accelerating our path to mitigation—supporting our 2026 goal of a 10-minute resolution window. Deductive has become a critical extension of our team, rapidly synthesizing signals across dozens of services and surfacing the insights that matter—within minutes.

— Shahrooz Ansari, Senior Director of Engineering, DoorDash

‍

Diagnosing Ad-Exchange Incidents in Minutes with AI-Powered Root Cause Analysis

The Ad-Exchange platform is responsible for search, real-time auction, and ranking across DoorDash’s marketplace. Every transaction happens in milliseconds and depends on a network of streaming jobs, RPCs, feature stores, and configuration pipelines. Even a minor service degradation can affect latency, ads delivery quality, or auction fairness, directly impacting revenue and user experience.

Like any large-scale microservice architecture, incidents can ripple across dozens of interconnected services, making it difficult to quickly pinpoint what changed and why. Engineers needed to determine whether a failure stemmed from a deployment, configuration flag, or dependency issue — all while piecing together signals scattered across observability tools.

When an alert fired, engineers consulted multiple dashboards, reviewed logs, inspected deployment histories, and manually reconstructed dependency paths. This manual process could take a long time to complete. The Ad-Exchange team recognized an opportunity to augment this workflow with Deductive, a system that could reason holistically across telemetry and code, learn from prior incidents, and dramatically accelerate time-to-mitigation while reducing operational load.

The Solution: An AI SRE Agent Embedded in the Team’s Workflows

DoorDash collaborated with Deductive AI to deploy an intelligent, code-aware SRE agent that reasons about complex systems the way a seasoned engineer would. Deductive is integrated directly into the company’s existing operational environment, connecting to live metrics, internal logging systems, incident management tools, code, configuration, and deployment metadata. This deep integration enabled Deductive to build a unified context for every incident, bridging the gap between observability and action.

When an alert fired, Deductive immediately initiated an investigation by gathering live system data, identifying recent code and configuration changes, and correlating these with telemetry patterns. Within seconds, it produced a structured summary highlighting the likely root cause and supporting evidence. The findings were shared directly in Slack, allowing engineers to interact with the system via natural-language feedback or quick Helpful or Not Helpful responses. Each interaction helped Deductive refine its reasoning model, reinforcing patterns that led to accurate diagnoses and discarding those that did not. Over time, the agent developed a rich internal model of DoorDash’s systems, learning the typical failure modes and dependencies that defined Ad Platform reliability.

How It Works: From Alert to Action in Minutes

When a production alert (such as a QPS drop or latency spike) occurred, Deductive launched an end-to-end investigation. It traced the relationships between affected services, correlated recent changes with telemetry anomalies, and mapped dependencies across upstream systems. Rather than simply listing correlated signals, the agent performed causal reasoning to identify the precise change that triggered the incident. For example, a configuration update to an upstream service might coincide with a drop in QPS. Still, Deductive could analyze dependency graphs and request traces to determine whether the update actually propagated to the affected service.

Every major incident leaves behind lessons that we want our systems to internalize. Deductive learns from those lessons and applies them intelligently in future investigations.

‍— Dmitry Nikitin, Engineering Leader, DoorDash

Throughout the investigation, Deductive surfaced insights in real time on a shared incident page and corresponding Slack thread. Engineers could guide the analysis by adding comments or upvoting insights that aligned with their hypotheses. As new information emerged, the agent dynamically refined its conclusions, converging on the most probable root cause. The workflow mirrored how an experienced engineer would think, but at machine speed and scale.

The Results: Inside a Real Ads Platform Investigation

In one of many production incidents where Deductive assisted the team, a spike in P90 latency for an API triggered an alert. Within seconds, the alert information was passed to the AI on-call agent, which began its investigation automatically.

Deductive AI pinpoints the sequence of events leading to an incident within minutes of detection.

‍

Deductive first reconstructed the full system context, incorporating historical signals from Slack discussions and tracing upstream dependencies to identify which components could influence Ad-Exchange latencies. Leveraging knowledge from prior incidents, the system followed the same logical steps an experienced on-call engineer would take by reviewing dashboards, examining recent deployments and configuration updates, and correlating anomalies across metrics, traces, and logs.

Automated root-cause analysis surfaces correlated metrics and contributing factors for faster triage.

Each phase of this reasoning, including Historical Incident Patterns, Impact Quantification, Change Event Discovery, and Trace Analysis, was displayed in real time, creating a transparent view of how the AI analyzed the problem. Engineers could follow the investigation as it unfolded, review evidence in a shared interface, and guide it directly from Slack by upvoting or commenting on insights.

Real-time anomaly detection highlights performance degradation and recovery trends across services

‍

In this particular case, Deductive’s analysis revealed that the latency spike originated from timeouts to an upstream system, the ML Platform Sibyl. It was also highlighted that Sibyl was undergoing a deployment during the same timeframe as the alert's activation. This finding was subsequently confirmed by reviewing logs volume and individual traces within the same period.

Automated root-cause analysis summary generated by Deductive AI, correlating upstream timeouts, latency metrics, and deployment changes to pinpoint incident drivers.

‍

This sequence demonstrated Deductive’s ability to move beyond surface-level correlation and perform causal reasoning—explaining not just what changed, but how and why it impacted production behavior.

*Knowledge generated by Deductive AI connect DoorDash’s telemetry, code insights, and collaboration context to accelerate RCA and learning.*

Under the hood, these insights are powered by Deductive’s learning map — a continuously evolving model that organizes every signal, alert, and observation into a shared context. Rather than relying on static rules, Deductive dynamically clusters related telemetry, code, and change events to discover patterns and relationships that repeat across incidents. When new alerts appear, the system uses these clusters to recognize similar failure modes, retrieve relevant insights, and refine its hypotheses. This gives engineers not just visibility into what happened, but a deeper understanding of how their systems behave under stress.

Our systems generate many events. Deductive's ability to interpret that data in the right context and provide actionable insight that otherwise would take us a long time to achieve, has fundamentally improved how we operate under pressure.
‍
— Igor Nodelman, Engineering Leader, DoorDash

‍

As a result, the Ads Platform team mitigated the issue far more quickly by identifying the upstream service responsible and coordinating with the ML Platform team to roll back the problematic deployment. The entire investigation was captured with a clear audit trail showing how the AI reasoned through each step. Over time, Deductive has continued to refine its analysis through feedback from engineers, turning each investigation into new learning data for future incidents. The outcome is a continuously improving on-call partner that combines human intuition with machine reasoning to keep DoorDash’s ad platform reliable and fast.

Deductive’s Take: AI SREs for Real-Time Systems

The collaboration with DoorDash represents one of the most advanced applications of AI-driven reliability engineering in production today. The Ad-Exchange environment is a true stress test for any operational intelligence system — real-time, high throughput, and extremely interdependent. Working with DoorDash’s infrastructure and SRE teams allowed Deductive to refine its reasoning engine in an environment that demands both speed and precision. Together, the teams built an agent capable of navigating a production-scale graph of code, telemetry, and change metadata with the same intuition and discipline that expert engineers bring to on-call. The partnership went beyond integration. It was a joint effort to push the limits of what AI reasoning can achieve in mission-critical systems.

DoorDash has been an incredible partner in helping us bring the next generation of AI-driven reliability to life. Their systems operate at real-time scale, and that has challenged and inspired us to make Deductive smarter, faster, and more adaptive with every incident. What we’ve built together shows what’s possible when AI becomes a true collaborator in engineering.
‍
— Sameer Agarwal, Co-Founder & CTO, Deductive AI

Deductive’s reinforcement learning framework was strengthened through this collaboration, continuously adapting to how DoorDash engineers think, triage, and communicate during incidents. What emerged is not just a faster way to debug, but a fundamentally more intelligent way to operate combining human intuition with machine consistency.