Oracle Cloud GPU Pricing in 2026: H100 vs H200 vs B200 Per-Hour Cost

Oracle Cloud prices H100, H200, and B200 GPUs at different per-hour rates, but the cheapest choice depends on your model size and utilization, not on which chip is newest.

Jun 23, 2026 · 3 min read

Oracle Cloud GPU Pricing in 2026: H100 vs H200 vs B200 Per-Hour Cost

Key takeaways

OCI lists distinct per-hour rates for H100, H200, and B200, rising with each generation.
The newest GPU is not automatically the cheapest per token for your workload.
Cost per token = effective hourly rate / (tokens per second x 3,600 x utilization).
Memory-bound large models often justify H200 or B200; smaller models can be cheapest on H100.
Always compare OCI quotes against other clouds on a normalized, same-commitment basis.

How is Oracle Cloud GPU pricing structured?

OCI prices each accelerator generation separately, and the per-hour rate generally climbs from H100 to H200 to B200 as memory and throughput increase. On top of the chip, your effective rate depends on region, instance shape, and whether you buy on-demand or commit to capacity. The headline number is a starting point, not the cost that hits your margin.

Is the newest GPU always the cheapest per token?

No, and this is the part teams get wrong. A B200 costs more per hour than an H100, but it can also produce far more tokens per hour. Whether it is cheaper per token depends entirely on your model and utilization.

The deciding metric: cost per token equals the effective hourly rate divided by (tokens per second x 3,600 x utilization). Run it for each generation on your actual model. For a large, memory-hungry model, the newer chip's throughput can win on cost per token despite the higher hourly rate. For a small model that already runs comfortably, an older, cheaper-per-hour H100 can be the lowest cost per token.

The point worth adding: pick the GPU generation by model fit, not by recency. The right question is which chip is cheapest per token for this model, not which chip is newest.

How do you compare OCI against other clouds fairly?

Normalize before you compare. Same GPU count, same commitment level (on-demand vs reserved), same region, then convert each quote to cost per million tokens at the throughput you actually expect. A provider with a higher hourly rate but better availability or networking can still come out ahead once utilization is factored in. Comparing raw hourly stickers across providers is how teams pick the wrong option.

When does committing to OCI capacity pay off?

When your inference load is steady and predictable enough to keep the GPUs busy. Reserved capacity lowers the effective hourly rate versus on-demand, which only helps if utilization is high. For bursty or experimental workloads, on-demand or a per-token API usually wins on flexibility and cost.

The takeaway: rank OCI's H100, H200, and B200 by cost per token for your specific model, not by hourly sticker or chip generation. You can model GPU-hour rates, throughput, and per-token cost side by side in Calcaas.

Frequently asked questions

How much do H100, H200, and B200 cost per hour on Oracle Cloud?

OCI prices each generation separately, with per-hour rates generally rising from H100 to H200 to B200, and varying by region and commitment. Treat the published rate as a starting point and convert it to cost per token before deciding.

Is a B200 cheaper than an H100 for inference?

Per hour, no; per token, it can be. The B200 produces more tokens per hour, so for large models it can win on cost per token despite the higher rate. For small models, a cheaper-per-hour H100 is often the lowest cost per token.

How do I compare Oracle Cloud GPU pricing to other providers?

Normalize the comparison: same GPU count, commitment, and region, then convert each quote to cost per million tokens at your expected throughput. Comparing raw hourly rates across clouds is misleading.

Should I reserve OCI GPU capacity or use on-demand?

Reserve when your inference load is steady and keeps the GPUs busy, since reserved capacity lowers the effective hourly rate. For bursty or experimental workloads, on-demand or a per-token API is usually more cost-effective. Note: place the JSON-LD above inside a <script type="application/ld+json"> tag in the page head.

More from the blog

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

LLM Economics

Jun 23, 20264 min read

GPU Cloud Providers in Europe 2026: The Real Cost of Data Residency

European GPU clouds offer B200 and H200 capacity with EU data residency and sovereignty, but residency usually carries a price premium that you should model as part of cost per token, not treat as a free checkbox.

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

LLM Economics

Jun 23, 20263 min read

Custom AI Chips vs NVIDIA in 2026: What It Means for Your Inference Cost

Hyperscaler custom chips like Trainium, Google TPU, Maia, and Meta MTIA are built to cut the provider's cost of serving AI, but that only lowers your bill if it shows up as a cheaper per-token price or GPU-hour rate.

TPU 8i vs NVIDIA Rubin and B200: Cost Per Token for LLM Inference (2026)

LLM Economics

Jun 23, 20264 min read

TPU 8i vs NVIDIA Rubin and B200: Cost Per Token for LLM Inference (2026)

The accelerator with the best benchmark is not always the cheapest per token, because cost per token depends on price per hour, real throughput, and how much migration and lock-in you have to amortize.

The Margin Memo

Pricing math, in your inbox.

One short note a week on AI pricing, token economics, and margin. No spam, unsubscribe anytime.