The agent orchestration framework market is loud, fragmented, and converging fast. Ten serious contenders, three architectural families, and a market that the 2026 HBR Analytic Services / Deloitte paper expects to lose 40% of its projects to cancellation by the end of 2027. The framework you pick rarely causes the failure, but the wrong family for your stage makes everything downstream harder. This guide scores the ten on production criteria and walks the sequencing question alongside it.
Key takeaways
- Three families — Graph-based (LangGraph, AutoGen v0.4), role-based (CrewAI, OpenAI Agents SDK), and durable-execution (Temporal, Inngest, Mastra, Letta). Pick the family first.
- Production criteria — State durability, observability, language coverage, and pricing-model fit with token economics. The fifth, community size, matters less than people think.
- What the HBR/Deloitte paper found — The 2026 HBR Analytic Services white paper cites Gartner's projection that 40% of agentic AI projects will cancel by end of 2027, and a September 2025 Gartner survey of 360 IT application leaders where only 13% strongly agreed they had the right governance in place.
- Sequencing — Anil Vijayan at Everest Group: start with low-error high-volume task automation, move to end-to-end multisystem processes, then function-by-function transformation. Frameworks differ by which stage they're optimised for.
Why orchestration is the bottleneck
A single agent that calls a model and a tool is a function. The orchestration question shows up the moment you need two agents to coordinate, a workflow to survive a process restart, a human to approve a step, or an audit log to explain why an agent took an action three weeks ago. Most production agentic systems hit at least three of those constraints in the first month.
The 2026 HBR Analytic Services white paper, sponsored by Deloitte, frames the category problem cleanly. It cites Gartner's projection that 40% of agentic AI projects will be canceled by the end of 2027 and attributes most cancellations to four categories of organizational debt that an orchestration framework alone cannot fix:
- Process debt. Workflows that were designed for human execution and cannot be run by agents without redesign.
- Data debt. Fragmented or inconsistent information that prevents reliable agent decision-making.
- Technical debt. Legacy systems that do not integrate smoothly with the orchestration layer.
- Cultural resistance. The friction that surfaces whenever human roles shift, even when the new arrangement is objectively better.
Framework choice cannot fix any of those four directly. What it can do is meet you where you are and make the rest of the work cheaper. A framework that makes durability free pays for itself the first time a long-running workflow restarts mid-step. A framework that ships native observability pays for itself the first time you have to explain a regulated decision to compliance. A framework with the wrong economic shape (seat-based pricing on a token-economy workload) creates an argument with finance that the technology cannot win.
Choose the family first, the framework second.
The three families
Underneath the ten contenders are three architectural families. Most of the meaningful differences live at this level; the framework-by-framework differences inside a family are smaller than the marketing decks suggest.
Graph-based
The agent system is modelled as a state machine. Nodes are functions or agent calls; edges are transitions; the runtime executes the graph and persists state at checkpoints. Strongest for systems where you need explicit control over what runs next, deterministic resumability, and a visible execution path you can show to auditors. LangGraph is the category default. AutoGen v0.4 is a credible second, particularly inside Microsoft estates.
Role-based
The agent system is modelled as a crew of agents with named roles, goals, and communication patterns. The runtime handles the message passing between them and the agent-to-tool calls. Strongest for the fast prototype, for systems where the decomposition is naturally social ("planner, researcher, writer, critic"), and for teams that prize ergonomics over explicit control. CrewAI and the OpenAI Agents SDK (the rebranded and matured Swarm) are the two with real adoption.
Durable-execution
The agent system runs on top of a workflow engine. Steps are durable, retryable, and survive process restarts; the agent loop is one kind of step among many. Strongest for any system that already runs durable workflows for non-agent reasons, for compliance-heavy industries, and for backend-heavy systems where the "agent" is one of several long-running concerns. Temporal is the heavyweight; Inngest Agent Kit is the developer-experience-first lightweight; Mastra and Letta are the two newer entrants built durable-first.
Scored comparison
The scoring rubric: architecture family, primary language, state durability, native observability, human-in-the-loop primitives, open-source status, hosted offering, and pricing-model fit with token economics. Twelve axes in total, scored across the ten frameworks.
| Feature | LangGraph | CrewAI | AutoGen v0.4 | OpenAI Agents SDK | Letta | Mastra | Inngest Agent Kit | Temporal | Claude Agent SDK | Vercel AI SDK |
|---|---|---|---|---|---|---|---|---|---|---|
| Architecture family | ||||||||||
| Family | Graph-based (explicit state machine) | Role-based (crew of agents) | Graph-based (v0.4 rewrite) | Role-based with handoffs | Stateful agents with memory | Durable-execution (graph + workflows) | Durable-execution (step-functions style) | Durable-execution (workflow engine) | Tool-loop with subagents | Streaming SDK + UI primitives |
| Primary language | Python + JS/TS | Python (TS in beta) | Python + .NET | Python + JS/TS | Python + JS/TS | TypeScript-first | TypeScript-first | Go, Java, TS, Python, .NET, PHP, Ruby | Python + JS/TS | TypeScript-first |
| Production readiness | ||||||||||
| State durability | Checkpointer abstraction | In-memory by default | Pluggable persistence | Caller-managed | Native, persistent | Postgres-backed | Durable step-execution | Battle-tested workflow durability | Caller-managed | Streaming state only |
| Native observability | LangSmith built-in | Hooks for external tools | OpenTelemetry first-class | OpenAI traces UI | Web UI included | Built-in eval + tracing | Inngest dashboard | Temporal UI | Bring your own | AI SDK Inspector (dev only) |
| Human-in-the-loop primitives | Interrupt + resume | Approval callbacks | UserProxyAgent | Approval before tool call | Persistent inbox | Step approvals | Workflow pauses | Signals + waitForSignal | App-layer responsibility | App-layer responsibility |
| Economics and licensing | ||||||||||
| Open source | MIT | MIT | MIT | MIT | Apache 2.0 | Apache 2.0 | SDK open, runtime hosted | MIT (self-host) + Temporal Cloud | Apache 2.0 | Apache 2.0 |
| Hosted offering | LangGraph Platform | CrewAI Enterprise | Via Azure AI Foundry | OpenAI platform | Letta Cloud | Mastra Cloud (beta) | Inngest Cloud | Temporal Cloud | Via Anthropic API | Vercel |
| Pricing model fit with token economics | Usage-based, transparent | Per-execution tiers | Usage-based | Per-token at runtime | Per-agent + token passthrough | Usage-based | Per-step pricing | Per-action billing | Per-token at runtime | Usage-based + Vercel hosting |
The radar verdict
Same data, grouped by recommendation tier. The same framework can be "Recommended" for one team and "Specialist" for another; treat the labels as starting points, not endpoints.
Recommended
- LangGraph. The default choice for graph-based orchestration with serious production needs. Strongest observability story (LangSmith), best checkpoint primitives, broadest community. Tax: tighter coupling to the LangChain ecosystem than some teams want.
- Temporal. If your operation already runs Temporal workflows, the Temporal-for-AI extensions are the lowest-risk path. Durability and language coverage are unmatched. Tax: workflow engines have a learning curve unrelated to agents.
- Mastra. TypeScript-first, durable, opinionated, with eval and tracing built in. Strongest pick for teams whose stack is Node.js end-to-end and who don't want to reach for Python services.
- OpenAI Agents SDK. Easiest on-ramp for OpenAI-stack teams. Handoff primitives match Vijayan's phased-autonomy model cleanly. Tax: vendor lock-in by design; cross-provider use is possible but unsupported as a first-class path.
Strong contenders
- AutoGen v0.4. The v0.4 rewrite from Microsoft Research recovered most of the architectural ground v0.2 was losing. OpenTelemetry-first is rare in this category. Best fit for teams already in the Azure AI Foundry / .NET ecosystem.
- Letta. Stateful agents with persistent memory as the design centre, not an add-on. The right pick if your product needs each agent to be a long-running identity rather than a stateless function.
- Claude Agent SDK. Anthropic's tool-loop SDK; tight, well-documented, opinionated. Best for narrow agent surfaces where you want Anthropic's defaults. Subagent pattern is the most thought-through in the category.
- Inngest Agent Kit. Durable steps wrapped around an agent loop, with a friendly TypeScript surface and a hosted dashboard. Solid choice for product-engineering teams allergic to workflow-engine sprawl.
Specialist or watching
- CrewAI. Strong onboarding story and a clean role-based mental model. Production gaps around state durability and observability widen against the pack the further the system scales.
- Vercel AI SDK. Best-in-class as a streaming SDK for chat-first apps; not really an orchestration framework. Often paired with one of the others rather than competing.
Sequencing: which framework for which stage
Anil Vijayan at Everest Group, interviewed for the HBR/Deloitte paper, lays out a three-stage adoption sequence that maps cleanly to framework choice:
- Task automation. Low cost of error, high volume, single-system. The cost of getting a framework choice wrong here is small. CrewAI and the OpenAI Agents SDK have the lowest on-ramp; LangGraph and Mastra are fine if you expect to grow into them.
- End-to-end multisystem processes. Workflows that touch several backends, run for hours or days, and must survive failures. Durable-execution frameworks dominate this stage. Temporal if you have the engineering bench; Inngest or Mastra if you want a faster ramp.
- Function-by-function transformation. Whole business functions reimagined around agents. Observability becomes a first-class concern because the impact surface is larger than any one team. Graph-based frameworks with deep tracing (LangGraph, AutoGen v0.4) tend to win here, often paired with a durable-execution layer underneath.
Governance and the human-in-the-loop ramp
The same HBR/Deloitte paper cites a September 2025 Gartner survey of 360 IT application leaders worldwide in which only 13% strongly agreed they had the right governance structures in place to manage agentic AI. The number is consistent with what practitioners report: governance is the slowest moving piece of the puzzle, and the cost of getting it wrong scales with autonomy.
Vijayan describes the right operating model as a phased transition: humans review every decision an agent makes, then move to spot-checking a sample of decisions as confidence grows, then reach a point where humans step in only when something goes wrong. Frameworks differ in how cheaply they let you operationalise that ramp.
- LangGraph, Letta, Temporal, OpenAI Agents SDK ship native interrupt-and-resume primitives. The phased-autonomy ramp is configuration, not application code.
- Mastra, Inngest, AutoGen v0.4 expose approvals as durable steps or message types. Slightly more wiring; same result.
- Claude Agent SDK, Vercel AI SDK, CrewAI leave most of the human-in-the-loop wiring to the application. Faster to prototype, more expensive to operate at autonomy levels two and three.
Kunal Basal, chief digital officer at Mankind Pharma and one of the HBR/Deloitte paper's case studies, frames the adoption pattern that travels well: agentic recommendations as "decision support, not decision replacement," with outputs that are "transparent, contextual, and explainable rather than prescriptive." Frameworks with strong observability make that pattern easier to implement and audit; frameworks without it make the pattern depend on the application team's discipline alone.
Economics: tokens, autonomy levels, and the death of seat pricing
The HBR/Deloitte paper makes the economic point that practitioners feel monthly: every agent interaction consumes tokens, each token carries a trackable cost, and total spend scales nonlinearly with usage. Classic total-cost-of-ownership frameworks built for licenses and seats do not capture this shape.
Alex Bakker at ISG Research, quoted in the same paper, describes the pricing innovation that is starting to land in agentic AI contracts: autonomy-level pricing. Predefined tiers established at the start of a multiyear contract give providers clear incentives to pursue automation without requiring a change order and governance review every time an improvement is proposed. The mechanism is still rare; the direction is clear.
For framework selection, the practical implication is to weight the pricing-model axis as heavily as the technical axes. A framework that prices per execution or per token aligns with the workload; a framework that prices per seat or per fixed deliverable will create a financial argument the technology cannot win.
Field evidence from CTAIO Labs
CTAIO Labs is the practitioner surface of our network. Season 2 ran a head-to-head field test on six orchestration frameworks against the same workload, measured on the same axes scored in the table above. The methodology and per-framework results are the empirical layer underneath this guide.
Related reads
Frequently asked questions
What is an agent orchestration framework?
An agent orchestration framework is the software layer that coordinates multiple AI agents, tools, and state across long-running workflows. It handles state management between agent turns, retries when calls fail, observability so you can see what the agents did, human-in-the-loop interrupts, and the communication patterns between agents. The framework sits between the model APIs and the application that delivers the user-facing product.
What are the three families of orchestration frameworks?
Graph-based frameworks model an agent system as a state machine with explicit nodes and edges; LangGraph and AutoGen v0.4 are the canonical examples. Role-based frameworks let you instantiate a 'crew' of agents with named roles and let them communicate; CrewAI and OpenAI's Agents SDK lean here. Durable-execution frameworks come from the workflow-engine world (Temporal, Inngest) or are built on top of one (Mastra, Letta) and emphasise restartability, durability, and step-level retries. Pick the family that matches the failure mode you most need to engineer around.
Why do so many agentic AI projects fail?
The 2026 HBR Analytic Services white paper sponsored by Deloitte cites Gartner's projection that 40% of agentic AI projects will be canceled by the end of 2027. The interviews in that paper attribute most cancellations to organizational rather than technical causes: process debt (workflows built for humans), data debt (fragmented information), technical debt (legacy systems that won't integrate), and cultural resistance to role changes. Framework choice matters, but it is rarely the binding constraint on success.
How do you choose between LangGraph, CrewAI, and AutoGen?
If you need explicit state transitions, deterministic resumability, and the best observability story in the category, LangGraph. If you want the fastest path to a prototype and a clean role-based mental model, CrewAI, but plan to harden the production layer yourself. If you are inside the Microsoft or .NET ecosystem, or your team is comfortable with OpenTelemetry-first observability, AutoGen v0.4. CTAIO Labs' Season 2 ran a head-to-head on six frameworks against the same workload; the methodology and per-framework scores are on ctaio.dev.
Where does Temporal fit?
Temporal is a general-purpose workflow engine that added first-class agent primitives in 2025 and through 2026. If you already run Temporal for non-agent workloads, adding agents on top is the lowest-risk durable-execution path. If you don't run Temporal, the learning curve is shaped less by agents and more by the workflow-engine model itself. The trade is: heavier upfront investment, much fewer surprises at scale.
What is the right way to handle human-in-the-loop oversight?
Anil Vijayan at Everest Group, quoted in the HBR/Deloitte paper, describes a phased transition that starts with humans reviewing every decision an agent makes, then shifts to spot-checking a sample of decisions as confidence grows, and eventually reaches a point where humans step in only when something goes wrong. Frameworks that ship native interrupt-and-resume primitives (LangGraph, Letta, Temporal, OpenAI Agents SDK with approval-before-tool-call) make this transition cheaper to operationalise than frameworks where it lives in application code.
Why does pricing matter so much in this category?
Because token economics breaks classic total-cost-of-ownership frameworks. The HBR/Deloitte paper makes the point sharply: every agent interaction consumes tokens, each carrying a cost that scales with usage rather than seat count. Frameworks priced per execution, per token, or by autonomy tier (ISG Research's Alex Bakker discusses 'autonomy-level pricing' in the paper) align provider incentives with progressive automation. Frameworks priced per seat or per fixed deliverable do not.
Is agent orchestration the same as enterprise AI agent platforms?
Related but not the same. The platforms surface (OpenAI Frontier, Microsoft Agent365, Amazon Bedrock AgentCore, Google Vertex AI Agent Builder) is covered in the sister guide at /en/guides/enterprise-ai-agent-platforms-2026/. Those platforms are end-to-end vendor offerings with hosting, security, and governance bundled in. The frameworks in this guide are code-level libraries that you compose into your own stack. Most production deployments end up using both: a hyperscaler platform for hosting and identity, and a framework for the in-code orchestration.
What is the sequencing recommendation for adopting agents?
Vijayan at Everest Group lays it out in the HBR/Deloitte paper: start with automation of tasks with a low cost of error and high volume; then move to end-to-end multisystem processes with more meaningful business KPIs; then move to function-by-function transformation. The framework choice often follows the stage. Task-level: anything works; CrewAI and OpenAI Agents SDK have the best on-ramp. Multisystem: durable-execution frameworks (Temporal, Inngest, Mastra) start to dominate. Function-level: graph-based frameworks with deep observability (LangGraph, AutoGen v0.4) tend to win.
What is an agent orchestration framework?
An agent orchestration framework is the software layer that coordinates multiple AI agents, tools, and state across long-running workflows. It handles state management between agent turns, retries when calls fail, observability so you can see what the agents did, human-in-the-loop interrupts, and the communication patterns between agents. The framework sits between the model APIs and the application that delivers the user-facing product.
What are the three families of orchestration frameworks?
Graph-based frameworks model an agent system as a state machine with explicit nodes and edges; LangGraph and AutoGen v0.4 are the canonical examples. Role-based frameworks let you instantiate a 'crew' of agents with named roles and let them communicate; CrewAI and OpenAI's Agents SDK lean here. Durable-execution frameworks come from the workflow-engine world (Temporal, Inngest) or are built on top of one (Mastra, Letta) and emphasise restartability, durability, and step-level retries. Pick the family that matches the failure mode you most need to engineer around.
Why do so many agentic AI projects fail?
The 2026 HBR Analytic Services white paper sponsored by Deloitte cites Gartner's projection that 40% of agentic AI projects will be canceled by the end of 2027. The interviews in that paper attribute most cancellations to organizational rather than technical causes: process debt (workflows built for humans), data debt (fragmented information), technical debt (legacy systems that won't integrate), and cultural resistance to role changes. Framework choice matters, but it is rarely the binding constraint on success.
How do you choose between LangGraph, CrewAI, and AutoGen?
If you need explicit state transitions, deterministic resumability, and the best observability story in the category, LangGraph. If you want the fastest path to a prototype and a clean role-based mental model, CrewAI, but plan to harden the production layer yourself. If you are inside the Microsoft or .NET ecosystem, or your team is comfortable with OpenTelemetry-first observability, AutoGen v0.4. CTAIO Labs' Season 2 ran a head-to-head on six frameworks against the same workload; the methodology and per-framework scores are on ctaio.dev.
Where does Temporal fit?
Temporal is a general-purpose workflow engine that added first-class agent primitives in 2025 and through 2026. If you already run Temporal for non-agent workloads, adding agents on top is the lowest-risk durable-execution path. If you don't run Temporal, the learning curve is shaped less by agents and more by the workflow-engine model itself. The trade is: heavier upfront investment, much fewer surprises at scale.
What is the right way to handle human-in-the-loop oversight?
Anil Vijayan at Everest Group, quoted in the HBR/Deloitte paper, describes a phased transition that starts with humans reviewing every decision an agent makes, then shifts to spot-checking a sample of decisions as confidence grows, and eventually reaches a point where humans step in only when something goes wrong. Frameworks that ship native interrupt-and-resume primitives (LangGraph, Letta, Temporal, OpenAI Agents SDK with approval-before-tool-call) make this transition cheaper to operationalise than frameworks where it lives in application code.
Why does pricing matter so much in this category?
Because token economics breaks classic total-cost-of-ownership frameworks. The HBR/Deloitte paper makes the point sharply: every agent interaction consumes tokens, each carrying a cost that scales with usage rather than seat count. Frameworks priced per execution, per token, or by autonomy tier (ISG Research's Alex Bakker discusses 'autonomy-level pricing' in the paper) align provider incentives with progressive automation. Frameworks priced per seat or per fixed deliverable do not.
Is agent orchestration the same as enterprise AI agent platforms?
Related but not the same. The platforms surface (OpenAI Frontier, Microsoft Agent365, Amazon Bedrock AgentCore, Google Vertex AI Agent Builder) is covered in the sister guide at /en/guides/enterprise-ai-agent-platforms-2026/. Those platforms are end-to-end vendor offerings with hosting, security, and governance bundled in. The frameworks in this guide are code-level libraries that you compose into your own stack. Most production deployments end up using both: a hyperscaler platform for hosting and identity, and a framework for the in-code orchestration.
What is the sequencing recommendation for adopting agents?
Vijayan at Everest Group lays it out in the HBR/Deloitte paper: start with automation of tasks with a low cost of error and high volume; then move to end-to-end multisystem processes with more meaningful business KPIs; then move to function-by-function transformation. The framework choice often follows the stage. Task-level: anything works; CrewAI and OpenAI Agents SDK have the best on-ramp. Multisystem: durable-execution frameworks (Temporal, Inngest, Mastra) start to dominate. Function-level: graph-based frameworks with deep observability (LangGraph, AutoGen v0.4) tend to win.
Ready to Find the Right AI Tools?
Browse our data-driven rankings to find the best AI tools for your team.