GPU Concurrency Benchmark: H100 vs H200 vs B200 vs MI300X

Sedat Dogan

with

Ekrem Sarı

updated on Mar 12, 2026

See our ethical norms

Cite This Benchmark

I have spent the last 20 years focusing on system-level computational performance optimization. We benchmarked the latest NVIDIA GPUs, including the NVIDIA’s H100, H200, and B200, and AMD’s MI300X, for concurrency scaling analysis. Using the vLLM framework with the gpt-oss-20b model, we tested how these GPUs handle concurrent requests, from 1 to 512. By measuring system output throughput, per-query output speed, and end-to-end latency, we share findings to help understand GPU performance for AI workloads.

Concurrency benchmark results

System output throughput vs concurrency

Loading Chart

This chart shows the total number of output tokens generated per second by the system at each concurrency level.

Output speed per query vs concurrency

This metric illustrates how fast an individual query is processed (in tokens per second) as the system gets busier. It is calculated based on the end-to-end latency for a 1,000-token output.

End-to-End latency vs concurrency

This chart displays the average time (in milliseconds) it takes to complete a request from start to finish at different concurrency levels.

Tokens per second per dollar vs. Concurrency

This chart evaluates the cost-efficiency of each GPU by measuring how many tokens are generated per second for every dollar spent on hourly rental. This metric is crucial for understanding the return on investment for each hardware option, especially for budget-conscious deployments.

Note: Pricing is based on on-demand hourly rates from the Runpod cloud platform as of March 2026. Prices are subject to change and may vary based on availability and instance type.

You can read more about our concurrency benchmark methodology.

What is concurrency?

Concurrency refers to a GPU’s ability to process multiple requests simultaneously, a key factor for AI workloads such as large language model inference. In our performance evaluation, concurrency levels represent the number of simultaneous requests (from 1 to 512) sent to the GPU during test runs. Higher concurrency tests the GPU’s capacity to manage parallel tasks without degrading performance, balancing throughput and latency.

Understanding concurrency helps users determine the right GPU for workloads with varying demand or batch processing needs. When running graphics tests or GPU benchmark suites, concurrency performance can significantly differ between GPUs, making it essential for consumers and buyers to compare test results across different system configurations and price points.

What is vLLM?

vLLM is a fast and easy-to-use open-source library for large language model (LLM) inference and serving, supported by a community of contributors. It handles both cloud and self-hosted LLM deployments by managing memory, processing concurrent requests, and serving models like gpt-oss-20b efficiently. For self-hosted LLMs, vLLM simplifies deployment with features like PagedAttention¹ for memory management, continuous batching, and support for both NVIDIA and AMD GPUs, enabling multiple concurrent requests on local hardware.

Don’t miss our benchmarks and data-driven insights. The button opens Google; selecting AIMultiple confirms that you wish to see AIMultiple more often in Google search results.

Add as preferred source

Concurrency benchmark methodology

We tested the latest high-performance GPU architectures from both NVIDIA and AMD to evaluate their concurrency scaling capabilities for AI inference workloads. Our benchmark tested the NVIDIA H100, H200, and B200 GPUs alongside AMD’s MI300X, running the OpenAI gpt-oss-20b model via vLLM under varying concurrent load conditions. Through measurement of throughput metrics, latency distributions, and resource utilization patterns, this analysis aims to provide insights for AI inference deployments.

Test infrastructure

We deployed our tests on Runpod’s cloud infrastructure, utilizing NVIDIA’s most advanced GPU architectures and the vLLM framework.

GPU platform: Runpod cloud infrastructure (H100, H200, B200, and MI300X)
Model: OpenAI GPT-OSS-20B via vLLM framework

Software environment

NVIDIA GPUs (H100, H200, B200):

RunPod template: runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404
vLLM installation: vllm[flashinfer]==0.11.0

AMD GPU (MI300X):

Docker image: rocm/vllm-dev:open-mi300-08052025

vLLM server configuration

Different vLLM settings were used to optimize performance for each hardware architecture.

For NVIDIA H100, H200, and B200 GPUs, the server was launched with the following command:

1vllm serve openai/gpt-oss-20b \
2    --host 0.0.0.0 \
3    --port 8000 \
4    --api-key test-key \
5    --max-model-len 8192 \
6    --max-num-seqs 512 \
7    --gpu-memory-utilization 0.9

For the AMD MI300X GPU, a ROCm-optimized vLLM build was used with specific settings for the architecture:

1vllm serve openai/gpt-oss-20b \
2    --host 0.0.0.0 \
3    --port 8000 \
4    --api-key test-key \
5    --dtype auto \
6    --max-num-seqs 512 \
7    --max-num-batched-tokens 16384 \
8    --compilation-config '{"full_cuda_graph": true}' \
9    --trust-remote-code

Note: This benchmark was conducted using vLLM v0.11.0. vLLM v1.0, released in early 2025, introduces architectural changes that may produce different throughput results.

Benchmark configuration

Each GPU was tested across 9 different concurrency levels with standardized parameters to ensure consistent results.

Concurrency levels: 1, 4, 8, 16, 32, 64, 128, 256, 512 concurrent requests
Test duration: 180 seconds measurement phase with 30s ramp-up/cool-down
Request size: 1,000 input/output tokens per request

Note on result validation: Before recording the final metrics, we ran numerous tests to determine the optimal configuration for each GPU. Once identified, the benchmark was run three consecutive times to verify stability. The throughput results were consistent across these runs, with a variance of less than 0.1%. The figures reported in this analysis are based on the final of these three consecutive executions.

Key metrics

We tracked performance across multiple dimensions to provide a comprehensive view of GPU capabilities under load.

Throughput: System output tokens per second, successful requests per second, and individual request token generation speed
Latency: Time to First Token (TTFT), end-to-end latency with P50/P95/P99 percentiles, average latency per request
Reliability: Success rate percentage, timeout vs. other error classification

Software stack considerations

Performance is not solely a function of hardware. Frameworks like vLLM have more mature, highly optimized support for NVIDIA’s CUDA ecosystem compared to AMD’s ROCm. Performance differences observed in MI300X results may partly reflect the current state of software optimization rather than the hardware’s theoretical potential.

Next-generation hardware roadmap

The GPUs tested in this benchmark, the B200, H200, H100, and MI300X, represent the current generation of AI inference hardware. Both NVIDIA and AMD have announced their successors, which is relevant context for teams planning infrastructure investments for 2026 and beyond.

On the NVIDIA side, Jensen Huang announced at CES 2026 that the Vera Rubin NVL72 platform has entered full production, with first systems expected to ship in the second half of 2026.² According to NVIDIA, the Rubin GPU delivers approximately 50 PFLOPs of FP4 inference performance, roughly five times that of Blackwell-based systems like the B200 benchmarked here.³

On the AMD side, the Instinct MI400, based on the CDNA 5 architecture, is planned for 2026 and is expected to roughly double MI350 compute performance while introducing 432 GB of HBM4 memory.⁴ AMD has also announced that Meta will deploy custom MI450-based Instinct servers at up to 6 gigawatts of capacity, with shipments beginning in the second half of 2026.⁵ Oracle will additionally offer a publicly available AI supercluster powered by approximately 50,000 MI450-series GPUs starting in Q3 2026.⁶

For teams evaluating the GPUs in this benchmark for near-term deployments, the B200 and MI300X remain the highest-performing options currently available. For longer planning horizons, the 2026 roadmap suggests a significant step-change in both throughput and cost efficiency from both vendors.

Conclusion

The B200 leads in throughput and scales well for batch inference. The MI300X offers the fastest response times at low concurrency, making it a better fit for real-time applications like chatbots. The H100 and H200 sit in between, covering general-purpose workloads without excelling in either dimension.

The core trade-off holds across all hardware: higher concurrency increases system throughput but raises per-request latency. Choose based on whether your workload prioritizes volume or response time.

Cite this benchmark

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Sedat Dogan and Ekrem Sarı (2026) - "GPU Concurrency Benchmark: H100 vs H200 vs B200 vs MI300X". Published online at AIMultiple.com. Retrieved March 12, 2026, from: https://aimultiple.com/gpu-benchmark [Online Resource]

Dogan, S., & Sarı, E. (2026, March 12). GPU Concurrency Benchmark: H100 vs H200 vs B200 vs MI300X. AIMultiple. https://aimultiple.com/gpu-benchmark

@misc{dogan2026,
  author = {Dogan, Sedat and Sarı, Ekrem},
  title  = {{GPU Concurrency Benchmark: H100 vs H200 vs B200 vs MI300X}},
  year   = {2026},
  month  = mar,
  howpublished    = {\url{https://aimultiple.com/gpu-benchmark}},
  note   = {AIMultiple. Retrieved March 12, 2026}
}

Reference Links

https://arxiv.org/pdf/2309.06180

Nvidia CEO confirms Vera Rubin NVL72 is now in production — Jensen Huang uses CES keynote to announce the milestone | Tom's Hardware

Tom's Hardware

Nvidia CEO confirms Vera Rubin NVL72 is now in production — Jensen Huang uses CES keynote to announce the milestone | Tom's Hardware

Tom's Hardware

AMD and its Partners Share their Vision for AI Everywhere, for Everyone at CES 2026

Advanced Micro Devices (AMD)

AMD and Meta Announce Expanded Strategic Partnership to Deploy 6 Gigawatts of AMD GPUs

Advanced Micro Devices (AMD)

Oracle and AMD Expand Partnership to Help Customers Achieve Next-Generation AI Scale | Oracle Middle East Regional

Sedat Dogan

CTO

Follow On

Sedat is a technology and information security leader with experience in software development, web data collection and cybersecurity. Sedat:
- Has ⁠20 years of experience as a white-hat hacker and development guru, with extensive expertise in programming languages and server architectures.
- Is an advisor to C-level executives and board members of corporations with high-traffic and mission-critical technology operations like payment infrastructure.
- ⁠Has extensive business acumen alongside his technical expertise.

View Full Profile

Researched by