Enterprise AI Agents Are Getting a DevOps Stack
Agent ops is becoming a real engineering discipline. Microsoft and Cyara both shipped operational tooling for AI agents in March 2026. Here's what that means.
Two things happened in the last week of March 2026 that, taken separately, look like product announcements. Together, they signal that enterprise AI is passing a threshold.
Microsoft's Azure Developer CLI shipped a full local run-and-debug loop for AI agents. Cyara launched agentic testing and AI governance capabilities for validating and controlling AI agents across customer service channels — before and after deployment. Different vendors, different markets, same basic message: agents need operational infrastructure. The demo phase is closing.
What happened
The Azure Developer CLI (azd) March 2026 release added a terminal-native agent workflow through its azure.ai.agents extension. Developers can now run agents locally with azd ai agent run, send them messages with azd ai agent invoke, inspect container health with azd ai agent show, and stream live logs with azd ai agent monitor. Deployment to Microsoft Foundry is integrated into the same flow.
Those are not demo verbs. Run. Invoke. Show. Monitor. Deploy. That's a production workflow.
At roughly the same time, Cyara shipped new capabilities specifically for validating AI agents before and after they go live on voice and digital channels. The company's CEO, Sushil Kumar, was fairly direct about the logic: "Every enterprise wants to deploy AI agents in their contact center. The ones who actually will are the ones who can prove those agents work, before customers find out they don't."
The assurance has to match the autonomy. That's the point.
Why this matters
For two years, enterprise AI conversations have centered on capability: what models can do, which benchmarks they hit, whether a proof of concept could handle a real use case. Enterprises could forgive a lot in a demo. They cannot forgive it in production.
The questions have shifted. It's no longer just "can the agent handle this task?" It's "can we verify it before it goes live?", "what happens when it makes a mistake?", "how do we know it's still behaving correctly next month?", and "if something goes wrong, who can audit what it did and why?"
Those questions don't have answers that come from the model. They come from the operational layer built around it. And until recently, that layer barely existed outside of well-resourced research teams.
What Microsoft and Cyara shipped is the productization of that layer. When two major vendors independently ship run/debug/test/monitor tooling for agents in the same week, it means the category is ready — and that enterprises asking for it are reaching critical mass.
The breakdown
Run, debug, deploy is now the baseline
Software systems have had this loop for decades. Write code, run it locally, find the bug, deploy when it works, monitor in production. AI agents had the model and the API and not much else.
The azd extension changes that. You can run an agent locally before it ever touches a production environment. You can invoke it with test messages and observe how it behaves. You can stream container logs the same way you'd tail application logs. You can inspect health the same way you'd check a service's status endpoint.
This is not a minor feature. It's the difference between treating agents as magic and treating them as software.
Validation and trust are becoming product features
Cyara's framing is instructive. They're not selling testing in the abstract — they're selling confidence that a deployed agent will handle real customer interactions correctly, follow regulations, and not introduce bias. That's a trust product.
Gartner's projection that agentic AI will autonomously resolve 80% of common customer service issues by 2029 is widely cited. What's less discussed is that the gap between that projection and current performance is largely a trust problem. Customers and enterprises don't trust the agents yet. The path to closing that gap runs through validation and monitoring, not through better models alone.
Why agent ops goes beyond customer service
Cyara's launch is in the customer service space, and it's easy to read this as a contact-center story. It isn't. It's an early signal in a pattern that will repeat across every enterprise function where agents are deployed.
Healthcare agents that reference clinical guidelines need the same kind of pre-deployment validation. Legal agents that summarize contracts need auditable behavior trails. Sales agents that pull from pricing and product documentation need someone to verify they're not giving customers stale information. The operational requirements are the same regardless of domain.
The governance frameworks being built for customer service will be repurposed. The tooling will generalize. Agent ops is not a customer service discipline — it's an enterprise software discipline that customer service happened to need first.
The missing layer: governed knowledge in production
Here's where the conversation usually stops short. Monitoring tells you what the agent did. Testing validates whether it does the right thing given a prompt. But neither tells you whether the knowledge the agent was reasoning from was trustworthy in the first place.
Production-grade agent ops has to include the knowledge layer. An agent that passes all its tests against a document set that was accurate in January is not a reliable agent in April if those documents haven't been updated. Runtime health checks won't surface that problem. Test suites won't catch it unless someone thought to update the test fixtures too.
As we've covered in earlier work on agentic failure rates, a significant share of agent failures in production trace back to knowledge issues — contradictory documents, outdated policies, ungoverned content that the agent treats as authoritative because nobody flagged it otherwise.
Observability tells you what the agent did. Governed knowledge helps explain whether it should have done it. Those are different instruments measuring different things, and production-grade agent ops needs both.
Tracing alone isn't trust — a point that gets clearer once agents are running continuously against real document repositories instead of curated demo data.
What it means for enterprise AI teams
Teams standing up agent infrastructure right now should treat this as a checklist shift. If your current setup can't answer the following, you have gaps:
- Can you test agent behavior before deployment against a representative prompt set?
- Can you monitor agent behavior after deployment and detect drift?
- Can you audit what sources the agent used when it made a specific decision?
- Can you confirm those sources were current and free of contradictions when the agent used them?
- Can you update the knowledge the agent depends on without a redeployment cycle?
The first three questions are what Microsoft and Cyara are helping answer. The last two are the knowledge governance layer — the part most teams haven't wired up yet.
That's not a criticism; it's just where the industry is. The runtime observability tooling came first because it was more visible. The knowledge health question is next, because once agents are observable, it becomes obvious that "what the agent did" and "whether the agent had good information to act on" are separate concerns.
Platforms like Mojar AI are built around the second problem: keeping the knowledge an agent reads current, attributed, and contradiction-free — automatically, at the source, not as a downstream audit. That's not a monitoring tool. It's the foundation that makes monitoring meaningful.
What to watch
The next 12 months will distinguish the teams that ran production agents from the teams that ran production agents reliably. The difference won't be model selection or prompt engineering. It will be operational discipline — testability, observability, policy enforcement, and the unglamorous work of keeping knowledge bases accurate and governed.
Vendors will keep shipping pieces of this stack. The enterprises that win will be the ones that treat agent ops as a first-class engineering concern rather than a problem to retrofit after something goes wrong in front of a customer.
The pattern in enterprise software is consistent: capability ships first, then reliability, then operations, then governance. With agents, all four phases are collapsing into roughly the same 18-month window. That's the pressure that's now showing up in product announcements.
Frequently Asked Questions
Agent ops is the emerging discipline of running AI agents in production — covering deployment, monitoring, testing, validation, rollback, and governance. It borrows concepts from DevOps and SRE but applies them to AI agent behavior, not just infrastructure. The term is gaining traction as enterprises move from agent demos to agents that actually run continuously.
Because observability only tells you what an agent did, not whether it should have done it. If the agent acts on stale, contradictory, or permission-blind documents, runtime health checks won't catch the problem. Governed knowledge — current, attributed, contradiction-free — is what makes an agent's actions trustworthy, not just trackable.
The Azure Developer CLI (azd) March 2026 release added a full local agent run-and-debug loop: run agents locally, invoke them with messages, inspect container health, stream logs in real time, and deploy to Microsoft Foundry. The tooling treats agents as software systems that need a proper development and operational workflow.