Engineering

Datadog, Thank You for Blocking Us

Datadog cut off our observability overnight. We migrated to an open Grafana stack in 48 hours. Here’s why vendor lock-in is fading in an AI-native world.

Karan Abrol
Yating Zhou
Pratyush Verma
Aditya Bhandari
Sameer Agarwal
January 9, 2026
  -  
9
Share this article

On December 15, 2025, the Deductive team received an unexpected email from Datadog notifying us that our account was under review. While it is always a little flattering to receive that level of attention from an industry incumbent, we initially assumed this would be resolved quickly through clarification. It did not immediately occur to us that a public company at Datadog’s scale could view a much smaller startup like Deductive as a potential competitive threat.

Datadog access was revoked, and production telemetry stopped immediately

We responded within minutes to clarify what appeared to be a misunderstanding. Deductive had been using Datadog’s APM platform since February 2025 strictly to instrument our own production systems with logs, metrics, traces, and application performance data, in exactly the same way most teams operating distributed systems at scale rely on telemetry. This usage had no relationship to Bits AI, which was first announced in 2023 and operates as a direct competitor to Deductive.

Despite this clarification, our Datadog accounts and API keys were deactivated shortly thereafter. Telemetry ingestion stopped across all clusters almost immediately, and within minutes, we lost visibility into production systems that depend fundamentally on continuous observability signals to operate safely.

A Single Point of Failure, or a Single Point of Opportunity?

Given how deeply integrated Datadog was into our production environment, the initial instinct was to wait for a response and attempt to resolve the issue through support channels. This would have been the rational choice in a world where observability tooling carries high switching costs and deeply entrenched operational dependencies.

That world no longer exists.

When it became clear that access would not be restored in a timely manner, we treated the situation not as an outage, but as a forced experiment in the real cost of vendor lock-in under modern tooling assumptions. To understand why this moment felt risky at all, it helps to be explicit about what Datadog gets right.

The Emperor Has Great Clothes

In 2025, Gartner once again placed Datadog in the “Leader” quadrant in their Observability Magic Quadrant for the fifth consecutive year, with user reviews consistently highlighting the platform’s breadth of services and polished user experience.

That assessment is not wrong. Datadog’s UX is objectively excellent. We chose Datadog precisely because of its legacy as a best-in-class observability vendor.

A great interface does not change the power dynamics of lock-in

Two things, however, were consistently true throughout our usage. First, our Datadog bills were steep, roughly 2-3x of what we would otherwise expect to pay for equivalent telemetry storage and retention. Second, despite the richness of the platform, we rarely used Datadog for anything beyond being a reliable system of record for logs, metrics, and traces. We were paying for workflows we almost never touched.

We were aware of this imbalance, but like many teams, we did not have the bandwidth to think seriously about switching.

Two weeks into operating entirely without Datadog, it is clear that its dashboards and workflows remain more refined. What is striking, however, is how little this difference has mattered in practice.

Switching Costs and the Shrinking Perils of Vendor Lock-in

When we first adopted Datadog, we weren’t naïve about switching costs - we accepted them deliberately. Although OpenTelemetry already existed as a vendor‑neutral alternative, Datadog’s native SDK and agent offered faster time-to-value and broader functionality out of the box. We therefore chose the most tightly integrated configuration available: Datadog’s native SDK paired with the Datadog agent, maximizing feature coverage across tracing, logs, custom metrics, monitors, and incident response while minimizing operational overhead. This decision reflected our organizational reality at the time: no dedicated infrastructure team, no appetite for maintaining abstraction layers, and no expectation that we would ever need to migrate.

In short, we consciously opted into vendor lock‑in.

What surprised us was not that Datadog could shut us off, but how little that lock‑in ultimately mattered in practice. Within the same day, we selected an alternative stack and produced a working end-to-end demo. By the end of the following day, all logs, traces, metrics, and alerts were fully restored.

Inaction is Expensive

Conventional wisdom suggests that losing a deeply embedded observability platform should trigger a multi-week migration. That expectation is increasingly disconnected from reality. Not because observability is simpler, but because the cost of making targeted changes in code has collapsed.

Within hours, we had telemetry flowing again using a fully open-source stack.

Observability fully restored on an open stack in under 48 hours

Grafana Cloud was not selected by default. We evaluated several alternatives in parallel, including New RelicSigNoz, and ClickHouse-backed configurations,  examining trade-offs around log search semantics, cost curves, operational complexity, and long-term portability. We ultimately converged on an architecture that maximized our use of open standards at the application layer while delegating the operational responsibility of storage, scaling, and availability to Grafana Cloud:

As a result, the practical impact of a similar shutdown event today is close to zero. Grafana Cloud is now the only managed dependency, but the entire stack remains open source. It can be self-hosted with a configuration change that propagates across the fleet in minutes rather than hours.

The core lesson from this experience is not about observability tooling per se, but about the economics of software change: code is cheap.

While OpenTelemetry configurations required to reach feature parity with a vertically integrated platform like Datadog are non-trivial to author manually, modern AI-assisted development tools such as Claude Code, Cursor, and Deductive itself have reduced this effort from a specialized infrastructure task to routine engineering work. As a result, the historical moat created by superior UX, proprietary integrations, and expansive managed feature sets is eroding rapidly.

Closing the Loop: Code, Telemetry, and MCP

During the migration, one factor materially compressed our iteration time: we integrated Cursor directly with Deductive via MCP, with Deductive pulling live telemetry from Grafana while the code was being written and deployed.

This created a tight feedback loop where the system could inspect the logs, metrics, and traces emitted by each change almost immediately, reason about whether the instrumentation and behavior matched expectations, and then guide subsequent code changes accordingly. In practice, this meant the code was being written while simultaneously being evaluated against the telemetry it produced.

Code changes validated continuously against live telemetry

Rather than treating observability as a post-hoc verification step, the migration became an interactive process where code and telemetry co-evolved. Instrumentation gaps, misconfigured exporters, and unexpected signal shapes were identified and corrected inline, without context switching between editors, dashboards, or manual inspection workflows.

Today, Deductive is the primary consumer of this telemetry. Nearly all of our debugging and investigation happens through Deductive itself, with Grafana serving as a clean, reliable backend rather than a human-facing control plane.

Two Structural Shifts in Observability

Our migration surfaced two broader shifts that matter more than any individual vendor choice. These are structural changes in how observability systems are built and used.

1. The "Broad Integration" Moat is Crumbling

For years, observability platforms competed on integration breadth and depth, betting that tighter coupling across agents, SDKs, dashboards, and workflows would create durable switching costs. That strategy made sense when integration work was slow and expensive.

With OpenTelemetry as a common substrate and AI-assisted development tools handling much of the integration glue, the marginal cost of reaching feature parity has collapsed. Integration breadth still matters, but it no longer compounds into a lasting moat. Once organizational inertia is broken, replacing even deeply embedded tooling is far easier than it used to be.

2. AI-Native Observability Is Displacing Dashboard-Centric Workflows

Traditional observability assumes humans explore systems through dashboards, charts, and query builders. In practice, engineers reason in hypotheses, code paths, and recent changes, not visualizations.

In an AI-native world, agents are the real users of this data. Dashboards do not disappear, but they become secondary. Mechanical tasks like building queries or traversing trace graphs increasingly belong with machines, especially during time-sensitive incidents.

Closing Thoughts: Observability in an AI-Native World

The lesson from this migration is not that Datadog is a bad product, or that Grafana is categorically better. Datadog remains an exceptionally well-built platform with a best-in-class user experience, and that investment shows. Similarly, the Grafana ecosystem has earned its place as a robust, principled foundation for modern observability, with strong open-source roots, clear abstractions, and a composable architecture that scales from individual teams to large production environments.

The deeper lesson is that the center of gravity in observability is shifting. If you are paying Datadog primarily as a high-end storage and ingestion layer, it is worth seriously reconsidering whether that spend aligns with how observability is actually consumed in an AI-native workflow.

In that world, resilience does not come from avoiding change, but from making change cheap. Our forced migration demonstrated that what once looked like catastrophic vendor lock-in is now closer to a recoverable configuration change. That is a meaningful shift in the operational economics of modern software systems.

Observability is no longer just about collecting signals. It is about enabling fast, accurate reasoning under uncertainty. The tools that win will be the ones that accept this reality and are designed for a future in which humans and AI agents collaborate directly at the level of intent, not through interfaces.

That future is already here.

Get ready to redefine the way your developers and SREs
root cause software issues

Request early access and offload your debugging to Deductive AI

Codeblock
Deductive monogram
Codeblock
CTA Image