15 min read

May 20, 2026

AI Inference Platforms 2026: Vendor Comparison

Ten AI inference platforms compared on speed, model coverage, fine-tuned-model support, pricing, and procurement fit. Fireworks, Together, Modal, Baseten, Replicate, RunPod, OpenRouter, Anyscale, Vercel AI Gateway, plus hyperscaler baseline.

AI Inference LLM Serving Fireworks AI Together AI

We The Flywheel Research & Analysis

Published May 20, 2026

Inference is where AI products actually live, and where the unit economics get decided. Ten platforms split across five lanes, a 90-percent decline in OSS-model cost in two years, and a procurement model that looks nothing like the compute layer underneath it. This is the buyer-side guide: how the lanes split, which platform wins which lane, and the levers that actually move effective cost.

Key takeaways

Five lanes — Optimised OSS hosts (Fireworks, Together) for the fastest path on open weights; serverless GPU (Modal, Baseten, Replicate, RunPod) for bring-your-own-model; aggregators (OpenRouter, Vercel AI Gateway) for multi-model routing; distributed frameworks (Anyscale) for complex pipelines; hyperscaler-managed for enterprise procurement.
What changes in 2026 — OSS models (Llama 3.x, Qwen 2.5, DeepSeek V3, Mistral Large 2) closed the quality gap with closed models for most production use cases. The economics shifted toward OSS-on-optimised-hosts as the default for cost-sensitive workloads.
What buyers underweight — Tail latency and the variance profile. P50 looks fine almost everywhere; P99 separates the mature platforms from the rest. Caching strategy, batch scheduling, and continuous-batching maturity show up at the tail.
What buyers overweight — Per-token list price. Effective cost depends on prompt-cache hit rate, reservation discounts, fine-tuned-adapter pricing, and the network egress that the architecture forces. The headline rate explains under half of real spend.

Resources

10 Platforms in this comparison

5 Distinct lanes (optimised OSS, serverless GPU, routers, distributed, hyperscaler)

~90% Decline in OSS-model inference cost on the named vendors between mid-2023 and early 2026

50–90% Effective cost reduction on agentic workloads with mature prompt caching

The five lanes

Inference platforms marketed against each other often serve different jobs. The lane split is the first cut.

1. Optimised OSS hosts

Fireworks AI and Together AI. The fastest path to production on open-weights models, with deeply optimised serving stacks (continuous batching, speculative decoding, prefix caching, fused kernels). Fireworks is the speed leader on latency-sensitive workloads; Together is the catalogue and full-stack leader. Pick this lane whenever the model is OSS and either latency or breadth matters.

2. Serverless GPU

Modal Labs, Baseten, Replicate, and RunPod. Bring-your-own-model platforms with per-second billing and container-based packaging. The right shape when the workload is not pure LLM inference, when the model is custom-architecture, or when the team needs control over the serving runtime. Modal is the container-native default. Baseten is the framework-led enterprise-friendly pick. Replicate is strongest for discovery and prototyping. RunPod is the cost-optimised lane.

3. Aggregators and routers

OpenRouter and Vercel AI Gateway. A layer above the serving providers that exposes a unified API, routes across underlying vendors, and centralises observability. No longer optional for any product running a multi-model strategy. OpenRouter is the broad default; Vercel AI Gateway is the right pick inside the Vercel ecosystem.

4. Distributed-inference frameworks

Anyscale, built on Ray Serve by the same team that created Ray. Fits when the inference workload is not a single model call but a graph with preprocessing, multi-model fan-out, or shared state that needs Ray's distributed primitives. Sales-led commercial posture and a steeper learning curve than the managed serverless platforms, in exchange for flexibility the managed platforms cannot match.

5. Hyperscaler-managed

AWS Bedrock, Azure OpenAI, and GCP Vertex AI. The procurement-friendly default and the only path to certain closed models under enterprise contract. List per-token pricing runs higher than the optimised OSS hosts. The gap closes once enterprise discounts and the savings from staying inside an existing compliance perimeter are loaded in.

Caching is the biggest cost lever in the stack

For any workload with substantial repeated context (long system prompts, RAG with stable retrievals, multi-turn agentic chains), prompt caching cuts effective per-call cost by 50 to 90 percent. The mechanism is simple: the platform stores the KV cache for the prefix portion of the prompt and skips the compute on subsequent calls that share that prefix. Per-prefix caching with configurable TTLs is mature on Fireworks, Together, Baseten, and the hyperscalers. Modal, RunPod, and Replicate leave caching mostly to the application layer.

For any agentic workload (long system prompts, persistent tool definitions, long-running conversation context), evaluate caching maturity before locking in a platform. The headline per-token rate is genuinely misleading when caching can move the effective rate by an order of magnitude.

LoRA hot-swapping is the 2026 differentiator

Multi-tenant fine-tuned-model serving is the inference economics question for any product running customer-specific or use-case-specific fine-tunes. Vendors who treat LoRA hot-swapping as first-class (Fireworks, Together, Anyscale) serve many adapters on the same base-model GPU memory at near-base-model economics. Vendors who do not force each fine-tune onto a dedicated endpoint, which typically multiplies per-throughput cost by 5 to 20x depending on the underlying utilisation.

If the roadmap involves serving fine-tuned models at scale, this is the single criterion that will dominate the platform decision two years into the contract. Ask explicitly about multi-LoRA throughput, adapter swap cost, and per-adapter pricing before signing.

Scored comparison

The scoring rubric: lane positioning, strongest capability, OSS catalogue depth, closed-model access, LoRA hot-swap support, prompt caching, continuous batching and speculative decoding, dedicated endpoint option, pricing model, multi-region, compliance footprint, and observability. Twelve axes across ten platforms.

Feature	Fireworks AI	Together AI	Modal Labs	Baseten	Replicate	RunPod	OpenRouter	Anyscale	Vercel AI Gateway	Hyperscalers (Bedrock / Azure OpenAI / Vertex)
Lane and positioning
Primary lane	Optimised OSS host; speed leader	Optimised OSS host; full-stack	Serverless GPU; container-native	Serverless GPU; framework-led (Truss)	Serverless GPU; model marketplace	Serverless GPU; cheapest serverless	Aggregator and router	Distributed-inference framework (Ray)	Routing + caching + observability layer	Hyperscaler-managed; enterprise procurement
Strongest at	Lowest P99 latency on OSS models; FireOptimizer; FireAttention	Broadest OSS model catalogue; training-plus-inference from one vendor	Custom containers; flexible runtime; scientific compute	Opinionated serving framework; enterprise-friendly contracts	Discovery, prototyping; pay-per-second on a wide catalogue	Lowest serverless cost; cold-start ergonomics	Unified API across 200+ models; price-aware routing	Complex pipelines; multi-model graphs; Ray-native shops	Edge-cached routing across providers; observability	Enterprise compliance; integrated stack; closed-model access
Model and adapter coverage
OSS model catalogue	Curated, deeply-optimised set (Llama, Qwen, DeepSeek, Mixtral)	Broadest catalogue (200+ OSS models)	Bring-your-own (any container)	Bring-your-own (Truss); curated catalogue available	Curated catalogue; community models	Bring-your-own; templates for major models	Routes to OSS-host providers under the hood	Bring-your-own; Ray Serve abstraction	Aggregates across providers	Curated; Llama / Mistral / Cohere where available
Closed-model access	Not the focus	Not the focus	Bring-your-own only	Bring-your-own only	OSS catalogue only	Bring-your-own only	OpenAI, Anthropic, Google, xAI, more via aggregation	Bring-your-own only	Aggregates across providers including closed	Bedrock (Claude, Cohere); Azure OpenAI; Vertex (Gemini)
LoRA / fine-tuned-adapter hot-swap	First-class; multi-LoRA serving at base-model cost	First-class; multi-LoRA	Supported via container	Supported via Truss configuration	Per-model; limited multi-LoRA	Supported via container	Not applicable (aggregator)	Supported via Ray Serve	Not applicable (router)	Supported (Bedrock custom models; Azure fine-tunes)
Performance and economics
Prompt caching	Mature; per-prefix, configurable TTL	Mature	Application-managed	Built-in; configurable	Per-model; limited	Application-managed	Routes to underlying-provider cache when present	Ray-managed; application-led	Edge-cached at the gateway	Bedrock prompt caching; Azure OpenAI cache
Continuous batching / speculative decoding	Both; deep optimisation work	Both	Depends on user container	Continuous batching	Per-model	Depends on user container	Inherits from underlying provider	Configurable via Ray Serve	Inherits from underlying provider	Provider-managed
Dedicated endpoint option	Available; per-GPU pricing	Available	Native model	Native model	Available	Native model	Not applicable	Native model	Not applicable	Provisioned throughput / PTUs
Pricing model	Public per-token + dedicated	Public per-token + dedicated	Public per-second GPU	Public + sales-led for enterprise	Public per-second	Public per-second	Pass-through with markup; public	Sales-led	Public; usage-based	Public + Savings Plans / PTUs
Operations and procurement
Multi-region	Multi-region	Multi-region	Multi-region	Multi-region	Multi-region	Multi-region	Routes by region	Multi-region	Edge network	Global regions; sovereign tiers
Compliance footprint	SOC 2, HIPAA	SOC 2, HIPAA	SOC 2	SOC 2, HIPAA	SOC 2	SOC 2	SOC 2	SOC 2, HIPAA	SOC 2	FedRAMP / IL5 / sovereign; full enterprise stack
Observability	Built-in; per-request traces	Built-in	Built-in	Built-in; deep	Built-in	Basic	Strong; cross-provider	Ray Dashboard + integrations	Strong; cross-provider	CloudWatch / Monitor / Cloud Operations

Included Partial Not included Hover for details

The verdict by lane

Same data, organised by lane and recommendation. Most production AI products end up with two relationships: an optimised OSS host or serverless GPU platform for the bulk of inference, and an aggregator or hyperscaler for routing flexibility and closed-model access.

Recommended for fastest OSS-model production

Fireworks AI. The default when latency and tail-latency matter. FireOptimizer and FireAttention deliver the lowest P99 on Llama, Qwen, and DeepSeek across the named vendors, with mature continuous batching, speculative decoding, and per-prefix caching. First-class multi-LoRA serving means fine-tunes run at near-base-model economics. Tax: catalogue is curated rather than exhaustive; bring-your-own-model is supported but not the focus.
Together AI. The broadest OSS catalogue in the category and the only vendor that combines compute, fine-tuning, and dedicated inference under one contract. Right when the lifecycle should not be split across vendors. Tax: not always the lowest P99 on the most-optimised models, where Fireworks pulls ahead.

Recommended for bring-your-own-model serving

Modal Labs. The container-native pick. Any model that runs in Python or a custom container, with developer ergonomics that match how research teams already work. Strong on scientific compute and non-LLM workloads (image, audio, video). Tax: less opinionated than Baseten on serving framework; some teams want more scaffolding.
Baseten. The framework-led pick. Truss is opinionated about how a model should be packaged, which translates into faster paths to production and stronger SLAs once it is there. Strongest enterprise-friendly commercial posture in the serverless-GPU lane. Tax: framework-led means the team has to learn Truss; for ad-hoc research workloads it can feel heavy.
Replicate. The discovery and prototyping pick. The catalogue is wide, the pay-per-second model is genuinely friction-free, and the Cog runtime is straightforward to package against. Right for early-stage product work where the model lineup is still in flux. Tax: harder to push to production-grade SLAs than Modal or Baseten without moving to dedicated endpoints.
RunPod. The cost-optimised pick in the serverless-GPU lane. AMD MI300X alongside NVIDIA, fast cold-starts, and per-second pricing that ends up cheapest on suitable workloads. Tax: less mature observability and SLAs than Modal or Baseten; right for cost-sensitive deployments and research, not enterprise-grade production.

Recommended for multi-model routing

OpenRouter. The default aggregator. Unified API across 200+ models (OSS and closed), price-aware routing, and a markup that is genuinely modest. Right for any product that wants to switch models without rewriting integration code, or that benefits from price-routing across providers. Tax: pass-through pricing means the underlying-provider cache profile matters; some optimisations are not portable through the aggregator.
Vercel AI Gateway. The Vercel-native pick. Edge-cached routing, observability across providers, and integration with the broader Vercel platform. Right for teams already on Vercel where the gateway becomes a natural extension of the deployment surface. Tax: tighter integration with the Vercel ecosystem; less of a fit outside it.

Recommended for distributed and enterprise

Anyscale. The Ray-native pick. Right when the inference workload involves complex multi-model graphs, heavy preprocessing, or shared state that Ray handles well. Distributed-inference framework rather than a managed endpoint product. Tax: sales-led commercial posture; the right tool when the team already runs on Ray.
Hyperscalers (Bedrock, Azure OpenAI, Vertex). The procurement-friendly default. Bedrock surfaces Claude, Cohere, and Llama; Azure OpenAI is the canonical path to GPT-4-class models under an enterprise contract; Vertex covers Gemini and a curated catalogue. Right when existing master agreements, compliance perimeter, or closed-model access require it. Tax: per-token list price runs higher than the optimised OSS hosts; the gap closes once enterprise discounts and integration savings are loaded in.

The six-step procurement playbook

What separates working inference procurement from the version most teams settle for.

Profile the workload before the first sales call. Token shape (input length, output length, ratio), QPS profile, P99 target, expected fine-tune count, caching opportunity. Without these, every vendor benchmarks against a workload that flatters their stack.
Shortlist three platforms per lane. Three forces real differentiation and preserves negotiating leverage. Vendors in different lanes are not really competing; comparing one optimised OSS host against one serverless GPU platform against one hyperscaler is a category error.
Benchmark on production-shaped traffic. Run a paid pilot with realistic prompts, realistic concurrency, and realistic cache opportunity. The metrics that matter are P50 and P99 latency, sustained throughput, effective per-call cost net of caching, and tail behaviour under load. Sales decks are not predictive.
Validate the caching and LoRA story. Ask for the prompt-caching configuration, the TTL behaviour, the cross-region cache posture, and the LoRA hot-swap throughput. Vendors who answer cleanly have mature serving infrastructure. Vendors who deflect are usually serving fine-tunes on dedicated endpoints under the hood.
Negotiate around effective cost, not list price. Per-token rate, cache hit-rate behaviour, fine-tune pricing, dedicated-endpoint floor, reservation discounts, and the egress and routing overhead. The headline rate explains under half of real spend.
Build the multi-platform strategy explicitly. An optimised OSS host or serverless GPU platform for the bulk of inference; an aggregator for routing flexibility and closed-model access; a hyperscaler relationship for the parts of the product that need it. Document which workload lives where and why.

When to combine platforms

Production AI products converge on multi-platform strategies. The combinations that work in practice:

Fireworks or Together for the bulk + OpenRouter for routing and closed-model access. The optimised-OSS-plus-aggregator pattern. Maximum performance on the OSS models that dominate inference spend, with a routing layer that handles closed-model calls and price-aware fallback.
Modal or Baseten for custom models + Fireworks for LLM inference. The bring-your-own-plus-optimised pattern. Right when the product runs both custom models (image, audio, custom-architecture) and standard LLM inference; trying to force both onto a single platform usually compromises one of them.
Anyscale for the inference graph + Fireworks or Together as the LLM engine inside the graph. The framework-plus-engine pattern. Right for complex pipelines where Ray's distributed primitives matter and LLM inference is one node in a larger graph.
Hyperscaler for the regulated surface + an optimised OSS host for the rest. The compliance-plus-economics pattern. Closed-model calls and regulated workloads on Bedrock or Azure OpenAI; bulk OSS inference on Fireworks or Together where the economics are 3 to 10x better.

Field evidence

Frequently asked questions

What is an AI inference platform?

A service that serves AI model predictions over an API. The 2026 category covers five lanes: optimised OSS hosts that deliver maximum speed on open-weights models (Fireworks, Together); serverless GPU platforms that host customer-supplied models (Modal, Baseten, Replicate, RunPod); aggregators and routers that abstract over multiple providers (OpenRouter, Vercel AI Gateway); distributed-inference frameworks for complex pipelines (Anyscale); and hyperscaler-managed services that ship under enterprise contracts (Bedrock, Azure OpenAI, Vertex).

How do Fireworks AI and Together AI compare?

Both are optimised OSS hosts. Fireworks is the speed leader, with FireOptimizer and FireAttention delivering the lowest P99 on Llama, Qwen, and DeepSeek across the category, plus mature continuous batching and speculative decoding. Together is broader, with a 200-plus model catalogue and a full-stack story (compute, fine-tuning, dedicated inference from one vendor). If latency is the binding constraint and the model lineup is small and stable, Fireworks. If catalogue breadth matters or the lifecycle is consolidated under one vendor, Together.

When does a serverless GPU platform beat an optimised OSS host?

When the model is not on the optimised catalogue, or when the workload is not pure LLM inference. Modal, Baseten, Replicate, and RunPod accept any model packaged in a container, which is the right shape for image, audio, video, and custom-architecture workloads where the optimised OSS hosts are not the focus. The tradeoff is per-token economics: the optimised hosts have invested heavily in inference-specific optimisations that a generic serverless platform does not match on like-for-like LLM workloads.

Is OpenRouter cheaper than going direct?

Roughly the same in most cases. OpenRouter passes through underlying provider pricing with a modest markup, and price-aware routing can save money on workloads where the cheapest model that meets quality varies request-to-request. The savings come from routing flexibility, not from arbitrage. The real value is the unified API surface and the ability to switch models without rewriting integration code, which is increasingly important as multi-model strategies become standard.

What is LoRA hot-swapping and why does it matter?

Low-Rank Adaptation (LoRA) fine-tunes are small adapter weights that modify a base model's behaviour without retraining the full model. Hot-swapping means the serving stack can route requests to different LoRA adapters on top of the same base-model GPU memory, which means many fine-tunes can be served at near-base-model economics rather than at dedicated-endpoint prices. Fireworks, Together, and Anyscale treat this as first-class. Vendors who do not force every fine-tune onto a dedicated endpoint, which can multiply cost by 5 to 20x for the same throughput. For any product that serves multiple customer-specific or use-case-specific fine-tunes, LoRA hot-swapping is the difference between viable and unviable economics.

How does prompt caching change effective cost?

For workloads with substantial repeated context (long system prompts, RAG with stable retrievals, multi-turn agentic chains), prompt caching cuts effective per-call cost by 50 to 90 percent. The mechanism is straightforward: the platform stores the KV cache for the prefix portion of the prompt and skips the compute on subsequent calls that share the prefix. Fireworks, Together, Baseten, and the hyperscalers ship mature implementations with configurable TTLs and per-prefix granularity. Modal, RunPod, and Replicate leave caching mostly to the application layer. For any agentic workload, evaluate caching maturity before locking in a platform; it is the largest single cost lever available.

Should I self-host with vLLM or SGLang instead?

Sometimes. Self-hosting on rented GPUs (from a neocloud like CoreWeave or Lambda) with vLLM or SGLang is competitive on cost above roughly 200 to 500 RPS of sustained traffic, depending on the model and the topology. Below that, the per-token economics on optimised OSS hosts beat self-hosting once the operational overhead is loaded in. Above it, self-hosting wins on cost and on the ability to tune the serving stack to the workload. The cross-over point is moving downward as serving frameworks mature, so re-evaluate annually. Note that self-hosting still requires a compute relationship; see the companion compute and neocloud guide.

How long is the procurement cycle?

Self-serve on-demand is minutes for Fireworks, Together, Modal, Baseten, Replicate, RunPod, OpenRouter, and Vercel AI Gateway. Enterprise contracts at any of those vendors run 2 to 8 weeks. Anyscale and the hyperscalers run 6 to 16 weeks on enterprise-grade commitments. For most teams, the right path is to start self-serve on the appropriate lane, validate throughput and latency on real traffic, and convert to a committed contract once the workload is sized.

Key takeaways

The five lanes

1. Optimised OSS hosts

2. Serverless GPU

3. Aggregators and routers

4. Distributed-inference frameworks

5. Hyperscaler-managed

Caching is the biggest cost lever in the stack

LoRA hot-swapping is the 2026 differentiator

Scored comparison

The verdict by lane

Recommended for fastest OSS-model production

Recommended for bring-your-own-model serving

Recommended for multi-model routing

Recommended for distributed and enterprise

The six-step procurement playbook

When to combine platforms

Field evidence

CTAIO Labs

Related reads

AI Compute and Neocloud Providers 2026

AI Training Data Providers 2026

Best Agent Orchestration Frameworks 2026

Enterprise AI Agent Platforms 2026

Agentic Search

Frequently asked questions

What is an AI inference platform?

How do Fireworks AI and Together AI compare?

When does a serverless GPU platform beat an optimised OSS host?

Is OpenRouter cheaper than going direct?

What is LoRA hot-swapping and why does it matter?

How does prompt caching change effective cost?

Should I self-host with vLLM or SGLang instead?

How long is the procurement cycle?

What is an AI inference platform?

How do Fireworks AI and Together AI compare?

When does a serverless GPU platform beat an optimised OSS host?

Is OpenRouter cheaper than going direct?

What is LoRA hot-swapping and why does it matter?

How does prompt caching change effective cost?

Should I self-host with vLLM or SGLang instead?

How long is the procurement cycle?

Ready to Find the Right AI Tools?

Continue Reading