Inference is where AI products actually live, and where the unit economics get decided. Ten platforms split across five lanes, a 90-percent decline in OSS-model cost in two years, and a procurement model that looks nothing like the compute layer underneath it. This is the buyer-side guide: how the lanes split, which platform wins which lane, and the levers that actually move effective cost.
Key takeaways
- Five lanes — Optimised OSS hosts (Fireworks, Together) for the fastest path on open weights; serverless GPU (Modal, Baseten, Replicate, RunPod) for bring-your-own-model; aggregators (OpenRouter, Vercel AI Gateway) for multi-model routing; distributed frameworks (Anyscale) for complex pipelines; hyperscaler-managed for enterprise procurement.
- What changes in 2026 — OSS models (Llama 3.x, Qwen 2.5, DeepSeek V3, Mistral Large 2) closed the quality gap with closed models for most production use cases. The economics shifted toward OSS-on-optimised-hosts as the default for cost-sensitive workloads.
- What buyers underweight — Tail latency and the variance profile. P50 looks fine almost everywhere; P99 separates the mature platforms from the rest. Caching strategy, batch scheduling, and continuous-batching maturity show up at the tail.
- What buyers overweight — Per-token list price. Effective cost depends on prompt-cache hit rate, reservation discounts, fine-tuned-adapter pricing, and the network egress that the architecture forces. The headline rate explains under half of real spend.
The five lanes
Inference platforms marketed against each other often serve different jobs. The lane split is the first cut.
1. Optimised OSS hosts
Fireworks AI and Together AI. The fastest path to production on open-weights models, with deeply optimised serving stacks (continuous batching, speculative decoding, prefix caching, fused kernels). Fireworks is the speed leader on latency-sensitive workloads; Together is the catalogue and full-stack leader. Pick this lane whenever the model is OSS and either latency or breadth matters.
2. Serverless GPU
Modal Labs, Baseten, Replicate, and RunPod. Bring-your-own-model platforms with per-second billing and container-based packaging. The right shape when the workload is not pure LLM inference, when the model is custom-architecture, or when the team needs control over the serving runtime. Modal is the container-native default. Baseten is the framework-led enterprise-friendly pick. Replicate is strongest for discovery and prototyping. RunPod is the cost-optimised lane.
3. Aggregators and routers
OpenRouter and Vercel AI Gateway. A layer above the serving providers that exposes a unified API, routes across underlying vendors, and centralises observability. No longer optional for any product running a multi-model strategy. OpenRouter is the broad default; Vercel AI Gateway is the right pick inside the Vercel ecosystem.
4. Distributed-inference frameworks
Anyscale, built on Ray Serve by the same team that created Ray. Fits when the inference workload is not a single model call but a graph with preprocessing, multi-model fan-out, or shared state that needs Ray's distributed primitives. Sales-led commercial posture and a steeper learning curve than the managed serverless platforms, in exchange for flexibility the managed platforms cannot match.
5. Hyperscaler-managed
AWS Bedrock, Azure OpenAI, and GCP Vertex AI. The procurement-friendly default and the only path to certain closed models under enterprise contract. List per-token pricing runs higher than the optimised OSS hosts. The gap closes once enterprise discounts and the savings from staying inside an existing compliance perimeter are loaded in.
Caching is the biggest cost lever in the stack
For any workload with substantial repeated context (long system prompts, RAG with stable retrievals, multi-turn agentic chains), prompt caching cuts effective per-call cost by 50 to 90 percent. The mechanism is simple: the platform stores the KV cache for the prefix portion of the prompt and skips the compute on subsequent calls that share that prefix. Per-prefix caching with configurable TTLs is mature on Fireworks, Together, Baseten, and the hyperscalers. Modal, RunPod, and Replicate leave caching mostly to the application layer.
For any agentic workload (long system prompts, persistent tool definitions, long-running conversation context), evaluate caching maturity before locking in a platform. The headline per-token rate is genuinely misleading when caching can move the effective rate by an order of magnitude.
LoRA hot-swapping is the 2026 differentiator
Multi-tenant fine-tuned-model serving is the inference economics question for any product running customer-specific or use-case-specific fine-tunes. Vendors who treat LoRA hot-swapping as first-class (Fireworks, Together, Anyscale) serve many adapters on the same base-model GPU memory at near-base-model economics. Vendors who do not force each fine-tune onto a dedicated endpoint, which typically multiplies per-throughput cost by 5 to 20x depending on the underlying utilisation.
If the roadmap involves serving fine-tuned models at scale, this is the single criterion that will dominate the platform decision two years into the contract. Ask explicitly about multi-LoRA throughput, adapter swap cost, and per-adapter pricing before signing.
Scored comparison
The scoring rubric: lane positioning, strongest capability, OSS catalogue depth, closed-model access, LoRA hot-swap support, prompt caching, continuous batching and speculative decoding, dedicated endpoint option, pricing model, multi-region, compliance footprint, and observability. Twelve axes across ten platforms.
| Feature | Fireworks AI | Together AI | Modal Labs | Baseten | Replicate | RunPod | OpenRouter | Anyscale | Vercel AI Gateway | Hyperscalers (Bedrock / Azure OpenAI / Vertex) |
|---|---|---|---|---|---|---|---|---|---|---|
| Lane and positioning | ||||||||||
| Primary lane | Optimised OSS host; speed leader | Optimised OSS host; full-stack | Serverless GPU; container-native | Serverless GPU; framework-led (Truss) | Serverless GPU; model marketplace | Serverless GPU; cheapest serverless | Aggregator and router | Distributed-inference framework (Ray) | Routing + caching + observability layer | Hyperscaler-managed; enterprise procurement |
| Strongest at | Lowest P99 latency on OSS models; FireOptimizer; FireAttention | Broadest OSS model catalogue; training-plus-inference from one vendor | Custom containers; flexible runtime; scientific compute | Opinionated serving framework; enterprise-friendly contracts | Discovery, prototyping; pay-per-second on a wide catalogue | Lowest serverless cost; cold-start ergonomics | Unified API across 200+ models; price-aware routing | Complex pipelines; multi-model graphs; Ray-native shops | Edge-cached routing across providers; observability | Enterprise compliance; integrated stack; closed-model access |
| Model and adapter coverage | ||||||||||
| OSS model catalogue | Curated, deeply-optimised set (Llama, Qwen, DeepSeek, Mixtral) | Broadest catalogue (200+ OSS models) | Bring-your-own (any container) | Bring-your-own (Truss); curated catalogue available | Curated catalogue; community models | Bring-your-own; templates for major models | Routes to OSS-host providers under the hood | Bring-your-own; Ray Serve abstraction | Aggregates across providers | Curated; Llama / Mistral / Cohere where available |
| Closed-model access | Not the focus | Not the focus | Bring-your-own only | Bring-your-own only | OSS catalogue only | Bring-your-own only | OpenAI, Anthropic, Google, xAI, more via aggregation | Bring-your-own only | Aggregates across providers including closed | Bedrock (Claude, Cohere); Azure OpenAI; Vertex (Gemini) |
| LoRA / fine-tuned-adapter hot-swap | First-class; multi-LoRA serving at base-model cost | First-class; multi-LoRA | Supported via container | Supported via Truss configuration | Per-model; limited multi-LoRA | Supported via container | Not applicable (aggregator) | Supported via Ray Serve | Not applicable (router) | Supported (Bedrock custom models; Azure fine-tunes) |
| Performance and economics | ||||||||||
| Prompt caching | Mature; per-prefix, configurable TTL | Mature | Application-managed | Built-in; configurable | Per-model; limited | Application-managed | Routes to underlying-provider cache when present | Ray-managed; application-led | Edge-cached at the gateway | Bedrock prompt caching; Azure OpenAI cache |
| Continuous batching / speculative decoding | Both; deep optimisation work | Both | Depends on user container | Continuous batching | Per-model | Depends on user container | Inherits from underlying provider | Configurable via Ray Serve | Inherits from underlying provider | Provider-managed |
| Dedicated endpoint option | Available; per-GPU pricing | Available | Native model | Native model | Available | Native model | Not applicable | Native model | Not applicable | Provisioned throughput / PTUs |
| Pricing model | Public per-token + dedicated | Public per-token + dedicated | Public per-second GPU | Public + sales-led for enterprise | Public per-second | Public per-second | Pass-through with markup; public | Sales-led | Public; usage-based | Public + Savings Plans / PTUs |
| Operations and procurement | ||||||||||
| Multi-region | Multi-region | Multi-region | Multi-region | Multi-region | Multi-region | Multi-region | Routes by region | Multi-region | Edge network | Global regions; sovereign tiers |
| Compliance footprint | SOC 2, HIPAA | SOC 2, HIPAA | SOC 2 | SOC 2, HIPAA | SOC 2 | SOC 2 | SOC 2 | SOC 2, HIPAA | SOC 2 | FedRAMP / IL5 / sovereign; full enterprise stack |
| Observability | Built-in; per-request traces | Built-in | Built-in | Built-in; deep | Built-in | Basic | Strong; cross-provider | Ray Dashboard + integrations | Strong; cross-provider | CloudWatch / Monitor / Cloud Operations |
The verdict by lane
Same data, organised by lane and recommendation. Most production AI products end up with two relationships: an optimised OSS host or serverless GPU platform for the bulk of inference, and an aggregator or hyperscaler for routing flexibility and closed-model access.
Recommended for fastest OSS-model production
- Fireworks AI. The default when latency and tail-latency matter. FireOptimizer and FireAttention deliver the lowest P99 on Llama, Qwen, and DeepSeek across the named vendors, with mature continuous batching, speculative decoding, and per-prefix caching. First-class multi-LoRA serving means fine-tunes run at near-base-model economics. Tax: catalogue is curated rather than exhaustive; bring-your-own-model is supported but not the focus.
- Together AI. The broadest OSS catalogue in the category and the only vendor that combines compute, fine-tuning, and dedicated inference under one contract. Right when the lifecycle should not be split across vendors. Tax: not always the lowest P99 on the most-optimised models, where Fireworks pulls ahead.
Recommended for bring-your-own-model serving
- Modal Labs. The container-native pick. Any model that runs in Python or a custom container, with developer ergonomics that match how research teams already work. Strong on scientific compute and non-LLM workloads (image, audio, video). Tax: less opinionated than Baseten on serving framework; some teams want more scaffolding.
- Baseten. The framework-led pick. Truss is opinionated about how a model should be packaged, which translates into faster paths to production and stronger SLAs once it is there. Strongest enterprise-friendly commercial posture in the serverless-GPU lane. Tax: framework-led means the team has to learn Truss; for ad-hoc research workloads it can feel heavy.
- Replicate. The discovery and prototyping pick. The catalogue is wide, the pay-per-second model is genuinely friction-free, and the Cog runtime is straightforward to package against. Right for early-stage product work where the model lineup is still in flux. Tax: harder to push to production-grade SLAs than Modal or Baseten without moving to dedicated endpoints.
- RunPod. The cost-optimised pick in the serverless-GPU lane. AMD MI300X alongside NVIDIA, fast cold-starts, and per-second pricing that ends up cheapest on suitable workloads. Tax: less mature observability and SLAs than Modal or Baseten; right for cost-sensitive deployments and research, not enterprise-grade production.
Recommended for multi-model routing
- OpenRouter. The default aggregator. Unified API across 200+ models (OSS and closed), price-aware routing, and a markup that is genuinely modest. Right for any product that wants to switch models without rewriting integration code, or that benefits from price-routing across providers. Tax: pass-through pricing means the underlying-provider cache profile matters; some optimisations are not portable through the aggregator.
- Vercel AI Gateway. The Vercel-native pick. Edge-cached routing, observability across providers, and integration with the broader Vercel platform. Right for teams already on Vercel where the gateway becomes a natural extension of the deployment surface. Tax: tighter integration with the Vercel ecosystem; less of a fit outside it.
Recommended for distributed and enterprise
- Anyscale. The Ray-native pick. Right when the inference workload involves complex multi-model graphs, heavy preprocessing, or shared state that Ray handles well. Distributed-inference framework rather than a managed endpoint product. Tax: sales-led commercial posture; the right tool when the team already runs on Ray.
- Hyperscalers (Bedrock, Azure OpenAI, Vertex). The procurement-friendly default. Bedrock surfaces Claude, Cohere, and Llama; Azure OpenAI is the canonical path to GPT-4-class models under an enterprise contract; Vertex covers Gemini and a curated catalogue. Right when existing master agreements, compliance perimeter, or closed-model access require it. Tax: per-token list price runs higher than the optimised OSS hosts; the gap closes once enterprise discounts and integration savings are loaded in.
The six-step procurement playbook
What separates working inference procurement from the version most teams settle for.
- Profile the workload before the first sales call. Token shape (input length, output length, ratio), QPS profile, P99 target, expected fine-tune count, caching opportunity. Without these, every vendor benchmarks against a workload that flatters their stack.
- Shortlist three platforms per lane. Three forces real differentiation and preserves negotiating leverage. Vendors in different lanes are not really competing; comparing one optimised OSS host against one serverless GPU platform against one hyperscaler is a category error.
- Benchmark on production-shaped traffic. Run a paid pilot with realistic prompts, realistic concurrency, and realistic cache opportunity. The metrics that matter are P50 and P99 latency, sustained throughput, effective per-call cost net of caching, and tail behaviour under load. Sales decks are not predictive.
- Validate the caching and LoRA story. Ask for the prompt-caching configuration, the TTL behaviour, the cross-region cache posture, and the LoRA hot-swap throughput. Vendors who answer cleanly have mature serving infrastructure. Vendors who deflect are usually serving fine-tunes on dedicated endpoints under the hood.
- Negotiate around effective cost, not list price. Per-token rate, cache hit-rate behaviour, fine-tune pricing, dedicated-endpoint floor, reservation discounts, and the egress and routing overhead. The headline rate explains under half of real spend.
- Build the multi-platform strategy explicitly. An optimised OSS host or serverless GPU platform for the bulk of inference; an aggregator for routing flexibility and closed-model access; a hyperscaler relationship for the parts of the product that need it. Document which workload lives where and why.
When to combine platforms
Production AI products converge on multi-platform strategies. The combinations that work in practice:
- Fireworks or Together for the bulk + OpenRouter for routing and closed-model access. The optimised-OSS-plus-aggregator pattern. Maximum performance on the OSS models that dominate inference spend, with a routing layer that handles closed-model calls and price-aware fallback.
- Modal or Baseten for custom models + Fireworks for LLM inference. The bring-your-own-plus-optimised pattern. Right when the product runs both custom models (image, audio, custom-architecture) and standard LLM inference; trying to force both onto a single platform usually compromises one of them.
- Anyscale for the inference graph + Fireworks or Together as the LLM engine inside the graph. The framework-plus-engine pattern. Right for complex pipelines where Ray's distributed primitives matter and LLM inference is one node in a larger graph.
- Hyperscaler for the regulated surface + an optimised OSS host for the rest. The compliance-plus-economics pattern. Closed-model calls and regulated workloads on Bedrock or Azure OpenAI; bulk OSS inference on Fireworks or Together where the economics are 3 to 10x better.
CTO POV and field evidence
Related reads
Frequently asked questions
What is an AI inference platform?
A service that serves AI model predictions over an API. The 2026 category covers five lanes: optimised OSS hosts that deliver maximum speed on open-weights models (Fireworks, Together); serverless GPU platforms that host customer-supplied models (Modal, Baseten, Replicate, RunPod); aggregators and routers that abstract over multiple providers (OpenRouter, Vercel AI Gateway); distributed-inference frameworks for complex pipelines (Anyscale); and hyperscaler-managed services that ship under enterprise contracts (Bedrock, Azure OpenAI, Vertex).
How do Fireworks AI and Together AI compare?
Both are optimised OSS hosts. Fireworks is the speed leader, with FireOptimizer and FireAttention delivering the lowest P99 on Llama, Qwen, and DeepSeek across the category, plus mature continuous batching and speculative decoding. Together is broader, with a 200-plus model catalogue and a full-stack story (compute, fine-tuning, dedicated inference from one vendor). If latency is the binding constraint and the model lineup is small and stable, Fireworks. If catalogue breadth matters or the lifecycle is consolidated under one vendor, Together.
When does a serverless GPU platform beat an optimised OSS host?
When the model is not on the optimised catalogue, or when the workload is not pure LLM inference. Modal, Baseten, Replicate, and RunPod accept any model packaged in a container, which is the right shape for image, audio, video, and custom-architecture workloads where the optimised OSS hosts are not the focus. The tradeoff is per-token economics: the optimised hosts have invested heavily in inference-specific optimisations that a generic serverless platform does not match on like-for-like LLM workloads.
Is OpenRouter cheaper than going direct?
Roughly the same in most cases. OpenRouter passes through underlying provider pricing with a modest markup, and price-aware routing can save money on workloads where the cheapest model that meets quality varies request-to-request. The savings come from routing flexibility, not from arbitrage. The real value is the unified API surface and the ability to switch models without rewriting integration code, which is increasingly important as multi-model strategies become standard.
What is LoRA hot-swapping and why does it matter?
Low-Rank Adaptation (LoRA) fine-tunes are small adapter weights that modify a base model's behaviour without retraining the full model. Hot-swapping means the serving stack can route requests to different LoRA adapters on top of the same base-model GPU memory, which means many fine-tunes can be served at near-base-model economics rather than at dedicated-endpoint prices. Fireworks, Together, and Anyscale treat this as first-class. Vendors who do not force every fine-tune onto a dedicated endpoint, which can multiply cost by 5 to 20x for the same throughput. For any product that serves multiple customer-specific or use-case-specific fine-tunes, LoRA hot-swapping is the difference between viable and unviable economics.
How does prompt caching change effective cost?
For workloads with substantial repeated context (long system prompts, RAG with stable retrievals, multi-turn agentic chains), prompt caching cuts effective per-call cost by 50 to 90 percent. The mechanism is straightforward: the platform stores the KV cache for the prefix portion of the prompt and skips the compute on subsequent calls that share the prefix. Fireworks, Together, Baseten, and the hyperscalers ship mature implementations with configurable TTLs and per-prefix granularity. Modal, RunPod, and Replicate leave caching mostly to the application layer. For any agentic workload, evaluate caching maturity before locking in a platform; it is the largest single cost lever available.
Should I self-host with vLLM or SGLang instead?
Sometimes. Self-hosting on rented GPUs (from a neocloud like CoreWeave or Lambda) with vLLM or SGLang is competitive on cost above roughly 200 to 500 RPS of sustained traffic, depending on the model and the topology. Below that, the per-token economics on optimised OSS hosts beat self-hosting once the operational overhead is loaded in. Above it, self-hosting wins on cost and on the ability to tune the serving stack to the workload. The cross-over point is moving downward as serving frameworks mature, so re-evaluate annually. Note that self-hosting still requires a compute relationship; see the companion compute and neocloud guide.
How long is the procurement cycle?
Self-serve on-demand is minutes for Fireworks, Together, Modal, Baseten, Replicate, RunPod, OpenRouter, and Vercel AI Gateway. Enterprise contracts at any of those vendors run 2 to 8 weeks. Anyscale and the hyperscalers run 6 to 16 weeks on enterprise-grade commitments. For most teams, the right path is to start self-serve on the appropriate lane, validate throughput and latency on real traffic, and convert to a committed contract once the workload is sized.
What is an AI inference platform?
A service that serves AI model predictions over an API. The 2026 category covers five lanes: optimised OSS hosts that deliver maximum speed on open-weights models (Fireworks, Together); serverless GPU platforms that host customer-supplied models (Modal, Baseten, Replicate, RunPod); aggregators and routers that abstract over multiple providers (OpenRouter, Vercel AI Gateway); distributed-inference frameworks for complex pipelines (Anyscale); and hyperscaler-managed services that ship under enterprise contracts (Bedrock, Azure OpenAI, Vertex).
How do Fireworks AI and Together AI compare?
Both are optimised OSS hosts. Fireworks is the speed leader, with FireOptimizer and FireAttention delivering the lowest P99 on Llama, Qwen, and DeepSeek across the category, plus mature continuous batching and speculative decoding. Together is broader, with a 200-plus model catalogue and a full-stack story (compute, fine-tuning, dedicated inference from one vendor). If latency is the binding constraint and the model lineup is small and stable, Fireworks. If catalogue breadth matters or the lifecycle is consolidated under one vendor, Together.
When does a serverless GPU platform beat an optimised OSS host?
When the model is not on the optimised catalogue, or when the workload is not pure LLM inference. Modal, Baseten, Replicate, and RunPod accept any model packaged in a container, which is the right shape for image, audio, video, and custom-architecture workloads where the optimised OSS hosts are not the focus. The tradeoff is per-token economics: the optimised hosts have invested heavily in inference-specific optimisations that a generic serverless platform does not match on like-for-like LLM workloads.
Is OpenRouter cheaper than going direct?
Roughly the same in most cases. OpenRouter passes through underlying provider pricing with a modest markup, and price-aware routing can save money on workloads where the cheapest model that meets quality varies request-to-request. The savings come from routing flexibility, not from arbitrage. The real value is the unified API surface and the ability to switch models without rewriting integration code, which is increasingly important as multi-model strategies become standard.
What is LoRA hot-swapping and why does it matter?
Low-Rank Adaptation (LoRA) fine-tunes are small adapter weights that modify a base model's behaviour without retraining the full model. Hot-swapping means the serving stack can route requests to different LoRA adapters on top of the same base-model GPU memory, which means many fine-tunes can be served at near-base-model economics rather than at dedicated-endpoint prices. Fireworks, Together, and Anyscale treat this as first-class. Vendors who do not force every fine-tune onto a dedicated endpoint, which can multiply cost by 5 to 20x for the same throughput. For any product that serves multiple customer-specific or use-case-specific fine-tunes, LoRA hot-swapping is the difference between viable and unviable economics.
How does prompt caching change effective cost?
For workloads with substantial repeated context (long system prompts, RAG with stable retrievals, multi-turn agentic chains), prompt caching cuts effective per-call cost by 50 to 90 percent. The mechanism is straightforward: the platform stores the KV cache for the prefix portion of the prompt and skips the compute on subsequent calls that share the prefix. Fireworks, Together, Baseten, and the hyperscalers ship mature implementations with configurable TTLs and per-prefix granularity. Modal, RunPod, and Replicate leave caching mostly to the application layer. For any agentic workload, evaluate caching maturity before locking in a platform; it is the largest single cost lever available.
Should I self-host with vLLM or SGLang instead?
Sometimes. Self-hosting on rented GPUs (from a neocloud like CoreWeave or Lambda) with vLLM or SGLang is competitive on cost above roughly 200 to 500 RPS of sustained traffic, depending on the model and the topology. Below that, the per-token economics on optimised OSS hosts beat self-hosting once the operational overhead is loaded in. Above it, self-hosting wins on cost and on the ability to tune the serving stack to the workload. The cross-over point is moving downward as serving frameworks mature, so re-evaluate annually. Note that self-hosting still requires a compute relationship; see the companion compute and neocloud guide.
How long is the procurement cycle?
Self-serve on-demand is minutes for Fireworks, Together, Modal, Baseten, Replicate, RunPod, OpenRouter, and Vercel AI Gateway. Enterprise contracts at any of those vendors run 2 to 8 weeks. Anyscale and the hyperscalers run 6 to 16 weeks on enterprise-grade commitments. For most teams, the right path is to start self-serve on the appropriate lane, validate throughput and latency on real traffic, and convert to a committed contract once the workload is sized.
Ready to Find the Right AI Tools?
Browse our data-driven rankings to find the best AI tools for your team.