For engineering leaders running large, distributed systems, speed and stability are often in tension. Shipping software quickly can unlock innovation, but even small mistakes can ripple through complex infrastructures and lead to costly outages. At Datadog, whose platform is relied on globally to monitor and diagnose system failures, reliability is not optional. It must be built into the development process long before code reaches production.

As Datadog’s engineering organization grew, the limits of traditional code review became clear. Human reviewers, no matter how experienced, struggle to maintain full context across sprawling codebases and tightly coupled services. Code review remained a critical safeguard, but it was increasingly difficult to scale without gaps.
To address this challenge, Datadog’s AI Development Experience (AI DevX) team integrated OpenAI’s Codex into its code review workflow. The goal was not to replace engineers, but to surface systemic risks that are easy to miss when changes are reviewed in isolation.
Why earlier automation wasn’t enough
Automated code review tools are not new in enterprise environments. Static analysis and linting tools have long helped teams catch syntax errors and enforce style guidelines. However, these tools often fall short when it comes to understanding how a single change affects a broader system.
At Datadog, earlier tools frequently generated feedback that engineers dismissed as irrelevant. The problem was context. Identifying a problematic line of code is useful, but understanding how that change interacts with dependencies, tests, and adjacent services is what prevents real-world incidents.
The AI-powered agent was designed to close that gap. Integrated directly into one of Datadog’s most active repositories, it reviews every pull request automatically. Instead of scanning for surface-level issues, it evaluates developer intent, runs tests, and reasons about how changes behave across the system.
Proving value through real incidents
For many technology leaders, adopting generative AI comes down to evidence. Productivity gains are appealing, but they can be difficult to quantify. Datadog took a different approach by measuring the tool’s impact on reliability.
The AI DevX team built what they call an “incident replay harness,” recreating past pull requests that were later linked to production incidents. These were changes that had already passed human review. The AI agent was then run against those historical cases to see whether it would have flagged the risks.
The results were striking. In more than 10 cases, roughly 22% of the incidents examined, the AI identified issues that could have prevented the outage. For Datadog’s leadership, this shifted the conversation. As AI DevX lead Brad Carter noted, efficiency matters, but preventing incidents at scale matters more.
Changing how engineers review code
The rollout of AI-assisted code review to over 1,000 Datadog engineers has reshaped internal workflows. Rather than acting as an automated critic, the system functions as a second set of eyes with an unusually broad view.
Engineers report that the AI often flags missing test coverage, hidden cross-service dependencies, and interactions with modules untouched by the immediate change. These insights are difficult to catch during manual reviews, especially under time pressure.
Carter describes the experience as working alongside an engineer with unlimited time and attention. By handling the cognitive load of tracing system-wide effects, the AI allows human reviewers to focus on higher-level questions around design, architecture, and long-term maintainability.
From bug detection to reliability strategy
Datadog’s experience highlights a broader shift in how enterprises may think about code review. Rather than a final checkpoint or a way to measure development speed, code review is becoming part of a larger reliability system.

By uncovering risks that extend beyond individual context, AI-assisted review helps organizations scale confidence alongside team growth. For Datadog, whose customers depend on its platform during critical failures, that reliability directly supports trust.
As Carter puts it, Datadog is the platform teams turn to when everything else is breaking. Using AI to prevent incidents before they happen strengthens that role. The case suggests that the real value of AI in software development may lie less in writing code faster and more in protecting systems, customers, and the bottom line.