Why Your AI Program Stalls Between Pilot and Production

The production gap in enterprise AI isn’t about models. It’s about the missing layer between experimentation and reliable operation — and there’s a discipline that already knows how to build it.

TL;DR: The 12:1 ratio between AI experimentation and production deployment isn’t a model problem — it’s an infrastructure problem. Organizations conflate two distinct relationships (“AI as a tool” vs. “AI as a product”) that have completely different failure modes. The discipline of platform engineering already knows how to close this gap. Measure DORA metrics, not “% of code written by AI.” If you read nothing else, skip to the five diagnostic questions.

A narrow suspension bridge stretching across a misty gorge Photo by Luca Bravo on Unsplash

Here’s a number that should concern anyone leading an AI program: while 71% of Chief Data Officers are experimenting with generative AI, only 6% have successfully deployed it in production. That’s not a rounding error — it’s a 12:1 ratio between experimentation and operational reality.

The picture gets worse. S&P Global reports that 42% of companies abandoned most AI initiatives in 2025, up from 17% the year before. Gartner predicts that more than 40% of agentic AI projects will be cancelled by the end of 2027 — not because the models failed, but because of rising costs, unclear business value, and inadequate risk controls. MIT’s 2025 research found that only 5% of enterprise AI pilots deliver measurable business value, despite tens of billions in investment.

The distance between “works in a demo” and “runs reliably at scale” has a name. It’s called the platform layer — deployment patterns, observability, governance, cost management. And there’s a discipline that’s been solving exactly this problem for traditional software workloads over the past four years: platform engineering.

This post explains why the production gap exists and what it takes to close it. Part 2 goes deep on the technical architecture for platform teams building this layer. Part 3 takes the AWS-opinionated view — how AgentCore and the new Agent Registry map to these challenges, grounded in a real enterprise case study.

Two Relationships, One Word

When executives talk about “AI agents,” they’re usually conflating two fundamentally different things. Separating them is the first step toward a coherent strategy.

AI as a tool for your teams — AI agents that help your engineers build and ship software faster. They write code, query internal systems, trigger deployments, read documentation. Think of them as a new class of developer that works alongside your human teams. They consume your organization’s infrastructure and knowledge to do their job.

AI as part of your product — AI agents that you ship to customers or run as part of your business operations. A customer-facing chatbot, an automated claims processor, an internal HR agent. These are production workloads that need to be deployed, monitored, scaled, and governed — just like any other software your organization operates.

The failure modes are completely different, and they map to different business risks:

Getting the first wrong — AI tools for your teams — means your engineering organization adopts AI coding tools that generate plausible but incorrect code because they can’t see service dependencies, compliance requirements, or organizational context. The result: more code, same (or worse) quality, growing technical debt. Individual productivity metrics look great while organizational delivery stalls.

Getting the second wrong — AI as part of your product — means your customer-facing AI products crash in production because nobody owns their reliability contracts. The result: reputational damage, compliance violations, and the kind of incident that makes the board question the entire AI investment.

Same word — “agent” — entirely different problems, entirely different solutions.

Key takeaway: Before any AI investment discussion, ask: “Are we talking about AI as a tool for our teams, or AI as part of our product?” The answer determines the architecture, the risk profile, and who owns the problem.

The data on AI developer tools tells a cautionary tale about measuring the wrong things.

Port analyzed 63 public earnings calls from major tech companies discussing AI developer tools. The finding: not a single company connected AI code generation metrics to actual engineering outcomes — deployment frequency, mean time to recovery, or change failure rates.

The individual numbers look impressive. HubSpot reports 97% of committed code had AI assist. Google says roughly half of code is written by AI agents. Pinterest reports similar figures. DoorDash sees 90%+ daily active AI usage among engineers.

But a study of 10,000 developers tells the rest of the story: individual output rose 21%, PRs merged increased 98% — yet organizational delivery velocity remained flat. The bottleneck shifted. PR review times grew 91%. PR sizes increased 154%. AI accelerated code writing but created downstream congestion in review, testing, and integration.

This is the “AI as a tool” problem made visible. Without organizational infrastructure — automated testing gates, intelligent review routing, deployment guardrails, standardized quality checks — more code production simply creates more work-in-progress. The constraint moves downstream, and throughput doesn’t improve.

The lesson for leadership: “percentage of code written by AI” is an input metric, not an outcome metric. Deployment frequency, lead time, change failure rate, and mean time to recovery — the DORA metrics — are what actually correlate with engineering performance. If your AI adoption dashboard doesn’t track these, you’re measuring activity, not value.

Key takeaway: If your AI dashboard tracks “% of code written by AI” but not deployment frequency or change failure rate, you’re measuring activity, not value. The bottleneck shifts downstream — more code without platform-level orchestration just creates more work-in-progress.

Why the Gap Exists

The production gap isn’t a model quality problem. Models are good enough. The gap exists because most organizations lack the operational layer between “AI experiment” and “AI in production.” Specifically:

No standardized deployment patterns. Teams are independently figuring out how to containerize, deploy, and scale AI agents. Every team reinvents the wheel — different frameworks, different monitoring, different cost structures. This is what platform engineering practitioners call “organizational debt,” and it compounds fast.

No AI-specific observability. Traditional application monitoring tracks request latency and error rates. AI workloads need additional signals: token consumption, tool call success rates, inference latency, cost per request, reasoning trace visibility. Without these, you can’t diagnose why an agent is producing poor results or burning through budget.

No governance framework. Who approved the agent that’s now making decisions on behalf of your customers? What are its reliability contracts? What happens when it produces a wrong answer in a regulated context? The EU AI Act requires auditable logs, explainable decision chains, and human-in-the-loop for high-risk actions. Most organizations can’t answer these questions today. One financial services team learned this the hard way: their customer-facing agent had been live for three weeks before anyone realized it was citing an internal policy document that had been superseded six months earlier. The agent was confident, the answers were plausible, and no one had built the feedback loop to catch it.

No cost visibility. Traditional cloud cost management tracks spend by service and team. With AI agents, a single request may trigger a chain of actions across multiple services. The question shifts from “who spent what?” to “which agents consumed resources, and were those decisions justified?” Without this visibility, AI costs spiral unpredictably.

These aren’t model problems. They’re infrastructure problems. And they’re exactly the kind of problems that platform engineering — the discipline of building self-service internal platforms for engineering teams — was designed to solve.

Key takeaway: The production gap has four components: no standardized deployment, no AI-specific observability, no governance framework, and no cost visibility. All four are infrastructure problems, not model problems.

The Platform Engineering Connection

Platform engineering emerged because the DevOps movement, for all its benefits, left too much cognitive load on individual teams. Everyone was expected to own their own infrastructure, tooling, and deployment pipelines. The result was inconsistency, duplication, and teams spending more time on plumbing than on product.

The discipline’s answer: build an Internal Developer Platform — a self-service layer that abstracts infrastructure complexity and provides standardized, governed paths to production. The correlation with performance is striking. A 2023 benchmarking study found that 93% of top-performing engineering teams operate on an Internal Developer Platform maintained by a dedicated team. Among low performers, the figure was 2%.

The agentic AI wave is creating the same pattern one level up. Every team is building their own agent infrastructure, their own integrations, their own governance. The organizations that recognize this pattern early — and invest in a shared platform layer for AI workloads — will close the production gap. The ones that don’t will join the 40% that Gartner predicts will cancel their agentic AI projects by 2027.

Google Cloud’s research found that 94% of organizations identify AI as critical to the future of platform engineering, and 86% believe platform engineering is essential for realizing AI’s business value. The convergence is not in question. The question is whether your organization builds the platform layer before or after the organizational debt compounds.

The Autonomy Question

One framework that’s useful for leadership decisions: how much autonomy should AI agents have in your organization?

The industry is converging on a five-level model, analogous to self-driving vehicles — from L1 (manual, script-based) through L3 (AI recommends, humans approve) to L5 (full autonomy). The model applies to both relationships: for AI as a tool, it describes how much you trust coding agents to act without human approval; for AI as a product, it describes operational maturity from manually supervised to self-healing.

Most organizations sit at L2–L3 today. ServiceNow’s 2025 Enterprise AI Maturity Index found that fewer than 1% of organizations scored above 50 out of 100 on AI operational maturity. L3 is a reasonable near-term target. L4 requires platform maturity that most teams don’t yet have. The responsible approach is what practitioners call “bounded autonomy” — agents operate within platform-defined constraints, never as independent actors. Part 2 maps this autonomy ladder to specific platform engineering requirements for both the “on” and “in” relationships.

Five Questions for Your Next AI Strategy Review

These map directly to the two relationships described above. If you answer “no” to questions 3 and 4, you have gaps in both:

1. Is there a single team accountable for AI infrastructure, or is every team building independently? If GPU orchestration, model serving, and AI governance are being reinvented by every team, that’s organizational debt — and it tends to compound.

2. Are your AI metrics measuring business outcomes or just adoption? Deployment frequency, lead time, change failure rate, and mean time to recovery correlate with performance. “Percentage of code written by AI” does not.

3. Can AI tools access your organizational context, or are they working blind? If your internal systems aren’t machine-readable, AI coding tools generate plausible but wrong code because they can’t see service dependencies, compliance requirements, or team ownership. This is the “AI as a tool” diagnostic.

4. Can your organization ship AI agents to production with the same rigor as traditional software? Standardized deployment paths, observability for AI-specific signals, reliability contracts, governance frameworks — this is the “AI as a product” diagnostic.

5. What level of autonomy have you granted, and what needs to be true to move up? L3 (AI recommends, humans approve) is reasonable for most organizations today. Moving to L4 (AI operates within defined boundaries) requires platform maturity that most teams don’t yet have. Part 2 details how the autonomy ladder maps to specific platform engineering requirements.

What Comes Next

The production gap is real, but it’s not inevitable. Organizations that invest in the platform layer — standardized deployment patterns, AI-specific observability, governance frameworks, cost visibility — will close it. The ones that treat AI as purely a model selection problem will continue to see pilots that never reach production.

In Part 2, we go deep on the technical architecture: what the platform layer looks like in practice, how the vendor landscape is evolving, the specific protocols and standards emerging for agent communication and observability, and concrete examples from organizations that have built this layer successfully. That post is written for the platform teams and engineering leaders who will actually build it.

This is Part 1 of the Platform Engineering for AI Agents series. Part 2: The Platform Engineering Playbook covers the technical architecture. Part 3: The AWS Playbook covers the AWS-opinionated view with AgentCore and Agent Registry.

Sources

AWS: Accelerating GenAI with Platform Engineering (November 2025) — 71% CDO experimentation vs 6% production deployment — aws.amazon.com/blogs/machine-learning
S&P Global (2025) — 42% of companies abandoned most AI initiatives, up from 17% in 2024
Gartner (June 2025) — More than 40% of agentic AI projects will be cancelled by end of 2027
MIT (2025) — Only 5% of enterprise AI pilots deliver measurable business value
Port: “63 Earnings Calls, 0 Engineering Outcomes Tied to AI” (March 2026) — port.io/blog
Faros AI: Developer Productivity Study (10,000 developers) — cited via Port’s analysis (source 5)
Humanitec DevOps Benchmarking Study 2023 — humanitec.com/whitepapers/devops-benchmarking-study-2023
Google Cloud: AI and Platform Engineering Survey (2025) — futurumgroup.com
ServiceNow Enterprise AI Maturity Index 2025
Gartner Hype Cycle for Emerging Technologies (2022–2024); Gartner prediction on 80% platform team adoption by 2026
DORA Accelerate State of DevOps Report 2024 — dora.dev/research/2024/dora-report
EU AI Act — requirements for auditable logs, explainable decision chains, human-in-the-loop
AWS: “When Software Thinks and Acts — Reimagining Cloud Platform Engineering for Agentic AI” (Parasuraman, Madden, March 2026) — aws.amazon.com/blogs/migration-and-modernization