AI Compute and Neocloud Providers 2026: Vendor Comparison

Ten AI compute providers compared on chip access, power footprint, contract terms, and procurement fit. CoreWeave, Crusoe, Lambda, Nebius, Together, Fluidstack, Voltage Park, RunPod, Vast.ai, plus hyperscaler baseline.

Stylised GPU server array with energy-grid traces converging into a central data centre

Compute is where most of the money in an AI programme actually goes, but the procurement model is closer to building a power plant than to buying SaaS. Ten providers split across five lanes, none of them with comparable headline pricing, and a binding constraint (energy) that has nothing to do with GPUs. This is the buyer-side guide: how the lanes split, which vendor wins which lane, and the questions to ask before signing a reserved-capacity contract.

Key takeaways

  • Five lanes — Hyperscaler-managed (AWS, Azure, GCP) for procurement-friendly enterprise; tier-one neocloud (CoreWeave, Crusoe, Lambda, Nebius) for scale and price; full-stack compute-plus-inference (Together) for one-vendor pipelines; spot and marketplace (RunPod, Vast.ai) for research; specialty (Voltage Park, Fluidstack) for niche fit.
  • What changes in 2026 — Energy and power-purchase agreements are the real constraint. CoreWeave's IPO and multi-year Microsoft anchor contract proved the model; the next eighteen months are a land grab for B200 and GB200 supply.
  • What buyers underweight — Networking topology and storage tier. InfiniBand fabric quality, NVLink domain size, and parallel-filesystem throughput often matter more than per-GPU hourly rate on training runs above 256 GPUs.
  • What buyers overweight — Hourly rate. Real cost includes egress, storage, networking, idle premiums on reserved capacity, and the procurement-cycle overhead. Sticker shock between hyperscaler and neocloud narrows by 30 to 50 percent once those are loaded in.
10 Providers in this comparison
5 Distinct lanes (hyperscaler, tier-one neocloud, full-stack, spot, specialty)
~$50B Estimated 2025 neocloud capex across the named operators (company disclosures and industry estimates)
2–3x Hyperscaler list price for H100 GPU-hour versus tier-one neocloud rate

The five lanes

The lane split is the first cut. Pick the lane and two-thirds of the comparison work disappears.

1. Hyperscaler-managed

AWS, Azure, and GCP. The procurement-friendly default if you already have a master agreement, a sovereign-region requirement, or a compliance perimeter that has to include compute. List price for current-generation GPUs runs 2 to 3x the tier-one neocloud rate. Enterprise discounts, Savings Plans, and integration savings close part of that gap. Pick this lane when the chip-level price-performance is not the only criterion and procurement reality dominates the decision.

2. Tier-one neocloud

CoreWeave, Crusoe, Lambda Labs, Nebius. Purpose-built GPU fleets, non-blocking InfiniBand fabric, parallel filesystems, and a commercial posture that puts reserved capacity ahead of on-demand. The category that emerged to serve the frontier labs and got validated when CoreWeave's IPO held up post-listing. Any training programme above a few hundred GPUs with a procurement team and a real reservation budget belongs here.

3. Full-stack compute plus inference

Together AI sits on its own. The unusual shape is compute plus training plus fine-tuning plus dedicated inference under one vendor, with the broadest OSS-model coverage in the category. Use it when the team wants one relationship across the lifecycle instead of three or four separate ones.

4. Spot and marketplace

RunPod and Vast.ai. Per-second billing, marketplace price discovery, no minimum commitments, and the cheapest hourly rate available. Fine for research, ablations, and bursty inference. A bad idea for any training run that has to finish on a deadline. Treat them as the bottom rung of a tiered strategy, not the layer your production work depends on.

5. Specialty

Fluidstack and Voltage Park. Smaller footprints, narrower fit, real differentiation in their lanes. Fluidstack for mid-scale European deployments where geography matters and the tier-ones are sized too large. Voltage Park for research and non-profit training capacity allocated through a grant model rather than a commercial cycle.

Energy, not silicon, is the constraint

The 2026 story is power, not chips. A modern training campus draws 50 to 200 MW at full utilisation; a single GB200 NVL72 rack pulls around 120kW. Vendors with multi-year power-purchase agreements, behind-the-meter generation, or pre-cleared utility interconnects have inventory they can actually sell. Vendors without those are stuck waiting on interconnect studies that run 18 to 36 months.

Which is why Crusoe scores so well on the energy axis (greenfield siting on stranded power, early-footprint flared-gas deployments) and why CoreWeave's pre-2024 PPA position translated directly into the anchor-tenant contracts that funded the IPO. Some of the late-2024 entrants will struggle to deliver on 2026 commitments for the same reason: chips can be on a truck inside a quarter, but a substation cannot.

Chip generations: what to ask for and when

Four generations matter in 2026: H100, H200, B200, and GB200. The procurement criteria differ across them.

H100 is commodity-priced and broadly available. Fine for most production training and inference where the model fits in 80GB of HBM and the throughput is enough. The hourly rate has fallen from $8 in early 2024 to under $2 across the tier-one neoclouds in early 2026, and is still falling.

H200 adds 141GB of HBM3e, which matters for larger models and for inference batch sizes that push memory. Available across the named providers; still rationed at the largest reserved commitments.

B200 is the Blackwell-architecture successor with significantly higher training throughput and inference efficiency. Supply through 2026 is anchored to a handful of customers, with CoreWeave taking early allocations and the hyperscalers following on their own ramps (Azure ND GB200 v6, AWS P6, GCP A4).

GB200 NVL72 is the rack-scale product (a tray of B200s plus Grace CPUs, NVLink-connected at scale) and is the chip behind the largest 2026 training builds. Reserved capacity only, multi-quarter lead times, and a procurement profile that looks more like buying a data centre than like buying compute.

Scored comparison

The scoring rubric: lane positioning, strongest capability, chip availability across H100 and H200 and B200 and AMD, network fabric, parallel filesystem options, multi-region footprint, pricing model, minimum engagement, and compliance footprint. Eleven axes across ten providers.

Feature CoreWeaveCrusoeLambda LabsNebiusTogether AIFluidstackVoltage ParkRunPodVast.aiHyperscalers (AWS / Azure / GCP)
Lane and positioning
Primary lane
Tier-one neocloud; frontier-lab scale
Tier-one neocloud; energy-integrated
Tier-one neocloud; developer-first
Tier-one neocloud; geographic-diversified
Full-stack: compute plus training plus inference
Specialty; mid-scale neocloud
Specialty; mission-driven, grant-allocated training capacity
Spot and marketplace; serverless GPU
Pure marketplace; lowest-cost spot
Hyperscaler-managed; enterprise procurement
Strongest at
Frontier training runs; large reserved contracts
Greenfield power; flared-gas and stranded-energy siting
On-demand and small reserved; developer ergonomics
EU + US footprint; AMS-listed governance
OSS-model fine-tuning; one-vendor pipelines
Mid-scale European deployments
Heavily discounted training for research and non-profit
Serverless inference; per-second billing
Cheapest H100 hour on the market; research workloads
Enterprise compliance, integrated storage, sovereign regions
Chip access and inventory
H100 availability (May 2026)
Abundant; reserved and on-demand
Abundant; reserved
Abundant; on-demand and reserved
Abundant; multi-region
Abundant; pooled across underlying vendors
Available; on-demand and reserved
Available; allocated by research grant model
Available; spot and on-demand
Cheapest market price; mixed inventory quality
Available; on-demand and reserved instances
H200 availability
Available; allocations to anchor tenants
Available; ramping
Limited; waitlist for non-reserved
Available; multi-region ramp
Limited; via underlying partners
Limited
Not the focus
Limited spot pools
Some marketplace listings
GA on Azure ND H200 v5; AWS P5e; GCP A3 Ultra
B200 / GB200 availability
Early allocations; anchor-tenant priority
Coming online late 2026
Reservations open; allocation deferred
Reservations open
Not yet
Not yet
Not yet
Not yet
Not yet
Azure ND GB200 v6; AWS P6; GCP A4 ramp through 2026
AMD MI300X / MI325X
Limited
Limited
Available
Limited
Via partner network
Limited
No
MI300X on RunPod
Sparse listings
Azure ND MI300X v5; GCP HPC pools
Network and storage
InfiniBand fabric
Full non-blocking 3.2Tbps NDR
Full non-blocking NDR
Non-blocking NDR on reserved clusters
Full non-blocking NDR
Inherits from underlying partners
Non-blocking NDR
Non-blocking NDR
Per-pod; not cross-pod
Ethernet only
EFA / IB / NVLink Switch by SKU
Parallel filesystem options
VAST Data, Lustre, WEKA
VAST Data, WEKA
Lustre, WEKA
Lustre, NVMe pools
Inherits from underlying partners
WEKA, NVMe
Lustre
S3-compatible only
Local NVMe + S3-compatible
FSx for Lustre / Azure Managed Lustre / GCS
Multi-region
US + EU primary, expanding
Multi-region US, EU coming
US primary
EU primary; US and Israel
US + EU via partners
US + EU + UK
US single-region
Multi-region
Global but uneven
Global region map; sovereign regions for EU and AU
Procurement and contracts
Pricing model
Sales-led; reserved-capacity-first
Sales-led
Public on-demand + sales-led reserved
Sales-led + listed on-demand
Public per-token and per-GPU-hour
Sales-led
Grant / non-profit allocation
Public per-second pricing
Public marketplace bidding
Public on-demand; sales-led for reserved / Savings Plans
Minimum engagement
Mid-market to frontier
Mid-market to enterprise
Self-serve to enterprise
Self-serve to enterprise
Self-serve to enterprise
Mid-market
Application-based
Self-serve, no minimum
Self-serve, no minimum
Self-serve to global enterprise
Compliance footprint
SOC 2, HIPAA, ISO 27001
SOC 2, ISO 27001
SOC 2
SOC 2, ISO 27001, EU GDPR posture
SOC 2, HIPAA
SOC 2
Limited
SOC 2
Marketplace; per-host varies
FedRAMP / IL5 / sovereign; full enterprise stack
Included Partial Not included Hover for details

The verdict by lane

Same data, organised by lane and recommendation. Most production AI programmes end up with two relationships: a tier-one neocloud for training reservations, and a hyperscaler for the parts of the pipeline that have to sit inside the enterprise compliance perimeter. A third relationship (marketplace for research) is common but optional.

Recommended for frontier training and large reserved contracts

  • CoreWeave. The default for any project at frontier-lab scale. Multi-year anchor-tenant contracts with Microsoft and OpenAI validated the model and the IPO funded the next round of build. Best-in-class InfiniBand fabric, the deepest H100 pool in the category, and early B200 allocations. Tax: reserved-capacity-first commercial posture; on-demand exists but is not where the value is.
  • Crusoe. The energy-arbitrage play. Siting strategy has favoured stranded power and behind-the-meter generation across the early footprint, and the AI-era builds layer in long-dated power contracts on top. The mix delivers lower marginal cost and stronger PUE than most peers. Strong on-prem-style cluster ergonomics. Tax: smaller footprint than CoreWeave; commercial team is leaner; reserved capacity is the route in.
  • Nebius. The geographic-diversified pick. Strong EU presence with US and Israel build-outs, AMS-listed governance, and a multi-region story that matters for data-residency-bound workloads. Engineering culture inherited from Yandex's research compute. Tax: smaller US footprint than CoreWeave or Lambda; some procurement teams need extra cycles to underwrite the parent entity.

Recommended for mid-scale and developer-driven teams

  • Lambda Labs. The developer-first pick. Strong on-demand pricing, clean dashboard, the cleanest path from sign-up to a training run that anyone running 8 to 256 GPUs will recognise. Public catalogue, transparent pricing, and the deepest community among the named providers. Tax: US-primary footprint; reserved capacity for the very largest runs goes to the tier-one neoclouds first.
  • Together AI. The full-stack pick. Compute plus model hosting plus fine-tuning plus dedicated inference endpoints from one vendor, with the strongest OSS-model coverage. Right when the training and serving pipelines should not be split across two vendors. Tax: the compute layer is pooled across underlying partners, which means networking topology varies; ask explicitly about the cluster you will get.
  • Fluidstack. The mid-scale European option. UK and EU footprint, mid-market commercial posture, and a track record of delivering 32 to 256 GPU clusters with non-blocking InfiniBand at reasonable lead times. Tax: smaller than the tier-one neoclouds; capacity at the very high end can be paced.

Recommended for spot, research, and bursty workloads

  • RunPod. The serverless option with the cleanest developer ergonomics in this lane. Per-second billing, AMD MI300X access alongside NVIDIA, fast cold-starts, and an inference path that scales to zero. Right for research, ablations, and bursty inference where reserved capacity would sit idle. Tax: not the right shape for week-long training runs that need a stable cluster.
  • Vast.ai. The cheapest H100 hour on the market, full stop. Pure marketplace model surfaces consumer GPUs alongside datacentre listings; the price discovery is genuine and the savings are real for the right workload. Tax: per-host variance is high, networking is Ethernet-only, and the marketplace shape means production-grade SLAs are not the product.
  • Voltage Park. The mission-driven research pick. Funded by Jed McCaleb's Navigation Fund and structured to allocate heavily-discounted training capacity through an application-and-grant process rather than a commercial sales cycle. Right for academic and high-impact research workloads. Tax: not a commercial procurement route; expect an application and review cycle rather than an SOW.

Recommended for enterprise procurement and compliance

  • Hyperscalers (AWS, Azure, GCP). The procurement-friendly default. Existing master agreements, sovereign-region coverage, FedRAMP / IL5 footprint, integrated storage and networking, and a single bill that covers compute alongside everything else. Right when the organisation cannot or will not stand up a second cloud-vendor relationship. Tax: list price for current-generation GPUs runs 2 to 3x the tier-one neocloud rate; some of that gap closes once enterprise discounts, Savings Plans, and integration costs are loaded in.

The six-step procurement playbook

The mechanics that separate working procurement from the deck-led version most teams settle for.

  1. Specify the workload before the first sales call. Model class and size, training horizon, batch shape, parallelism strategy, expected utilisation. Without these, every vendor anchors the conversation to a generic cluster spec and the negotiation runs on the wrong dimensions.
  2. Shortlist three providers per lane. Not five and not one. Three forces real differentiation; three preserves negotiating leverage on the production contract.
  3. Benchmark on a representative job. Run a short, identical, paid benchmark across the shortlist. The metrics that matter are tokens-per-second per GPU, all-reduce bandwidth at the parallelism shape you actually use, and storage-side throughput from a realistic training-data layout. Sales decks are not predictive of production behaviour.
  4. Validate the network topology. Ask for the InfiniBand topology diagram, the NVLink-domain layout, and a real-world all-reduce result at the cluster size you plan to use. Vendors who answer cleanly have mature infrastructure. Vendors who deflect inherit topology from underlying partners and cannot tell you what you will actually get.
  5. Negotiate around the right axes. Hourly rate is one. The others are storage, networking egress, idle premium on reserved capacity, ramp schedule, and the off-ramp clause for the back end of the contract. The off-ramp is the most-skipped axis and the one that hurts most when it is missing.
  6. Build the tiered strategy explicitly. A tier-one neocloud reservation for training, a hyperscaler relationship for the compliance-bound parts of the pipeline, and a marketplace or spot account for research and ablations. Document which workload goes where and why, so the next planning cycle does not relitigate the same decisions.

When to combine providers

Production AI programmes converge on multi-provider strategies. The combinations that work in practice:

  • CoreWeave or Nebius for training reservations + AWS or Azure for the enterprise envelope. The frontier-plus-enterprise pattern. Reserved neocloud capacity for the training run, hyperscaler footprint for storage, identity, and the compliance-bound surfaces.
  • Lambda or Crusoe for mid-scale training + RunPod for research and bursty inference. Mid-scale-plus-spot. Right for teams running 64 to 512 GPU training jobs without a frontier-scale reservation, with research and ablation work routed to per-second billing.
  • Together AI for end-to-end OSS-model work. The one-vendor pattern. Compute plus fine-tuning plus dedicated inference endpoints, with the strongest OSS-model coverage among the named providers. Right when splitting the pipeline across vendors costs more in integration than it saves on per-GPU-hour.
  • Voltage Park for grant-allocated research + a commercial provider for the production work. Research-plus-production. Common shape in academic labs and AI-for-science programmes where part of the work qualifies for grant-allocated capacity and part of it does not.

CTO POV and field evidence

Frequently asked questions

What is a neocloud?

A purpose-built cloud provider optimised for AI workloads, primarily large-scale GPU training and high-throughput inference. The defining traits in 2026 are NVIDIA-aligned fleets at scale, full non-blocking InfiniBand fabric, parallel filesystems for training data, and multi-year reserved-capacity contracts with anchor tenants. CoreWeave, Crusoe, Lambda Labs, and Nebius are the canonical examples. The neocloud category emerged from the gap between hyperscaler list prices and the unit economics that frontier-lab training requires, and was validated by CoreWeave's 2025 IPO.

How do CoreWeave and Lambda Labs compare?

Both are tier-one neoclouds but they serve different jobs. CoreWeave is reserved-capacity-first, with multi-year anchor-tenant contracts (Microsoft, OpenAI) shaping the commercial posture and an InfiniBand fabric built for frontier training runs. Lambda is developer-first, with public catalogue pricing, clean on-demand ergonomics, and the deepest community of practitioners running 8 to 256 GPU jobs. If the project is frontier-scale with a procurement team and a reserved budget, CoreWeave. If the project is developer-led, mid-scale, and benefits from self-serve, Lambda.

Is a neocloud cheaper than a hyperscaler?

On per-GPU-hour list price for current-generation NVIDIA chips, neoclouds run roughly 2 to 3x cheaper than hyperscaler list rates. The real comparison loads in storage, networking egress, integration cost, procurement-cycle overhead, and the enterprise discounts that hyperscalers offer on large commitments. Net of all of those, the savings on a fully-loaded multi-year contract typically end up in the 30 to 60 percent range rather than 200 percent, but the gap is still material on any workload where compute dominates total cost.

What is the difference between an H100, an H200, a B200, and a GB200?

H100 is NVIDIA's previous-generation training and inference GPU, shipping in volume since late 2022 and now effectively commodity-priced across the neocloud market. H200 is a memory-upgraded H100 with 141GB of HBM3e, shipping since mid-2024, still rationed at the highest tiers. B200 is the Blackwell-architecture successor with significantly higher training throughput and inference efficiency; supply through 2026 is anchored to a handful of customers. GB200 is the rack-scale Blackwell product (a tray of B200s plus Grace CPUs, NVLink-connected at scale) and is the chip behind the largest 2026 training builds. Procurement priority should match model size and training horizon: H100 for most production work, H200 for inference of larger models, B200 / GB200 reserved capacity for frontier training programmes that need it.

Why is energy the binding constraint?

The gating factor on 2026 capacity is not GPU supply, it is the megawatts to run them. A modern training campus draws 50 to 200 MW at full utilisation; a single GB200 NVL72 rack draws around 120kW. Power-purchase agreements, grid interconnects, and substation capacity are now multi-year projects that have to be in motion years before the chips arrive. Vendors who locked up power in 2023 and 2024 own 2026's capacity; vendors who tried to lock it up in late 2025 are queued behind utility interconnect studies. Energy strategy is the real moat in this market, which is why Crusoe (stranded gas, behind-the-meter renewables) and CoreWeave (long-dated PPAs) score so highly on it.

When should I use a marketplace provider like Vast.ai or RunPod?

For research, ablations, bursty inference, and any workload where the cost-per-hour matters more than the cluster-quality guarantees. Both deliver real savings on suitable workloads. Neither is the right shape for a week-long training run that has to finish on a deadline, because the underlying inventory shifts and the SLA posture does not match production training requirements. Use them as the bottom rung of a tiered compute strategy: marketplace for experimentation, tier-one neocloud reservations for training, hyperscaler for the parts of the pipeline that have to live inside the enterprise compliance perimeter.

How do I evaluate networking topology before signing?

Ask three specific questions. First, what is the fabric (NVIDIA NDR InfiniBand, Spectrum-X Ethernet, AWS EFA, GCP Jupiter, Azure HPC Ethernet) and is it non-blocking inside a pod? Second, what is the NVLink-domain size (eight GPUs for standard HGX, seventy-two for NVL72) and how does cross-domain traffic route? Third, what is the storage-side throughput (parallel filesystem, NVMe pool, S3-compatible) and what is the peak read bandwidth from a 128-GPU job? Vendors who answer cleanly across all three are the ones with mature training infrastructure. Vendors who deflect on any of the three are usually inheriting topology from an underlying partner.

How long is the procurement cycle?

Self-serve on-demand is minutes for Lambda, RunPod, Vast.ai, and the hyperscaler short-form SKUs. Reserved-capacity contracts at tier-one neoclouds run 4 to 12 weeks of negotiation for a first contract, faster on renewals. Hyperscaler enterprise commitments run 6 to 16 weeks. Multi-year frontier contracts with custom power, custom networking, or custom data-centre build run 6 to 18 months. Procurement-cycle length is itself a procurement criterion; pick the lane whose cycle matches the project timeline.

What is a neocloud?

A purpose-built cloud provider optimised for AI workloads, primarily large-scale GPU training and high-throughput inference. The defining traits in 2026 are NVIDIA-aligned fleets at scale, full non-blocking InfiniBand fabric, parallel filesystems for training data, and multi-year reserved-capacity contracts with anchor tenants. CoreWeave, Crusoe, Lambda Labs, and Nebius are the canonical examples. The neocloud category emerged from the gap between hyperscaler list prices and the unit economics that frontier-lab training requires, and was validated by CoreWeave's 2025 IPO.

How do CoreWeave and Lambda Labs compare?

Both are tier-one neoclouds but they serve different jobs. CoreWeave is reserved-capacity-first, with multi-year anchor-tenant contracts (Microsoft, OpenAI) shaping the commercial posture and an InfiniBand fabric built for frontier training runs. Lambda is developer-first, with public catalogue pricing, clean on-demand ergonomics, and the deepest community of practitioners running 8 to 256 GPU jobs. If the project is frontier-scale with a procurement team and a reserved budget, CoreWeave. If the project is developer-led, mid-scale, and benefits from self-serve, Lambda.

Is a neocloud cheaper than a hyperscaler?

On per-GPU-hour list price for current-generation NVIDIA chips, neoclouds run roughly 2 to 3x cheaper than hyperscaler list rates. The real comparison loads in storage, networking egress, integration cost, procurement-cycle overhead, and the enterprise discounts that hyperscalers offer on large commitments. Net of all of those, the savings on a fully-loaded multi-year contract typically end up in the 30 to 60 percent range rather than 200 percent, but the gap is still material on any workload where compute dominates total cost.

What is the difference between an H100, an H200, a B200, and a GB200?

H100 is NVIDIA's previous-generation training and inference GPU, shipping in volume since late 2022 and now effectively commodity-priced across the neocloud market. H200 is a memory-upgraded H100 with 141GB of HBM3e, shipping since mid-2024, still rationed at the highest tiers. B200 is the Blackwell-architecture successor with significantly higher training throughput and inference efficiency; supply through 2026 is anchored to a handful of customers. GB200 is the rack-scale Blackwell product (a tray of B200s plus Grace CPUs, NVLink-connected at scale) and is the chip behind the largest 2026 training builds. Procurement priority should match model size and training horizon: H100 for most production work, H200 for inference of larger models, B200 / GB200 reserved capacity for frontier training programmes that need it.

Why is energy the binding constraint?

The gating factor on 2026 capacity is not GPU supply, it is the megawatts to run them. A modern training campus draws 50 to 200 MW at full utilisation; a single GB200 NVL72 rack draws around 120kW. Power-purchase agreements, grid interconnects, and substation capacity are now multi-year projects that have to be in motion years before the chips arrive. Vendors who locked up power in 2023 and 2024 own 2026's capacity; vendors who tried to lock it up in late 2025 are queued behind utility interconnect studies. Energy strategy is the real moat in this market, which is why Crusoe (stranded gas, behind-the-meter renewables) and CoreWeave (long-dated PPAs) score so highly on it.

When should I use a marketplace provider like Vast.ai or RunPod?

For research, ablations, bursty inference, and any workload where the cost-per-hour matters more than the cluster-quality guarantees. Both deliver real savings on suitable workloads. Neither is the right shape for a week-long training run that has to finish on a deadline, because the underlying inventory shifts and the SLA posture does not match production training requirements. Use them as the bottom rung of a tiered compute strategy: marketplace for experimentation, tier-one neocloud reservations for training, hyperscaler for the parts of the pipeline that have to live inside the enterprise compliance perimeter.

How do I evaluate networking topology before signing?

Ask three specific questions. First, what is the fabric (NVIDIA NDR InfiniBand, Spectrum-X Ethernet, AWS EFA, GCP Jupiter, Azure HPC Ethernet) and is it non-blocking inside a pod? Second, what is the NVLink-domain size (eight GPUs for standard HGX, seventy-two for NVL72) and how does cross-domain traffic route? Third, what is the storage-side throughput (parallel filesystem, NVMe pool, S3-compatible) and what is the peak read bandwidth from a 128-GPU job? Vendors who answer cleanly across all three are the ones with mature training infrastructure. Vendors who deflect on any of the three are usually inheriting topology from an underlying partner.

How long is the procurement cycle?

Self-serve on-demand is minutes for Lambda, RunPod, Vast.ai, and the hyperscaler short-form SKUs. Reserved-capacity contracts at tier-one neoclouds run 4 to 12 weeks of negotiation for a first contract, faster on renewals. Hyperscaler enterprise commitments run 6 to 16 weeks. Multi-year frontier contracts with custom power, custom networking, or custom data-centre build run 6 to 18 months. Procurement-cycle length is itself a procurement criterion; pick the lane whose cycle matches the project timeline.

Explore More

Ready to Find the Right AI Tools?

Browse our data-driven rankings to find the best AI tools for your team.