I have spent the last 20 years focusing on system-level computational performance optimization. We benchmarked the latest NVIDIA GPUs, including the NVIDIA’s H100, H200, and B200, and AMD’s MI300X, for concurrency scaling analysis. Using the vLLM framework with the gpt-oss-20b model, we tested how these GPUs handle concurrent requests, from 1 to 512. By measuring system output throughput, per-query output speed, and end-to-end latency, we share findings to help understand GPU performance for AI workloads.
Concurrency benchmark results
System output throughput vs concurrency
This chart shows the total number of output tokens generated per second by the system at each concurrency level.
Output speed per query vs concurrency
This metric illustrates how fast an individual query is processed (in tokens per second) as the system gets busier. It is calculated based on the end-to-end latency for a 1,000-token output.
End-to-End latency vs concurrency
This chart displays the average time (in milliseconds) it takes to complete a request from start to finish at different concurrency levels.
Tokens per second per dollar vs. Concurrency
This chart evaluates the cost-efficiency of each GPU by measuring how many tokens are generated per second for every dollar spent on hourly rental. This metric is crucial for understanding the return on investment for each hardware option, especially for budget-conscious deployments.
Note: Pricing is based on on-demand hourly rates from the Runpod cloud platform as of March 2026. Prices are subject to change and may vary based on availability and instance type.
You can read more about our concurrency benchmark methodology.
What is concurrency?
Concurrency refers to a GPU’s ability to process multiple requests simultaneously, a key factor for AI workloads such as large language model inference. In our performance evaluation, concurrency levels represent the number of simultaneous requests (from 1 to 512) sent to the GPU during test runs. Higher concurrency tests the GPU’s capacity to manage parallel tasks without degrading performance, balancing throughput and latency.
Understanding concurrency helps users determine the right GPU for workloads with varying demand or batch processing needs. When running graphics tests or GPU benchmark suites, concurrency performance can significantly differ between GPUs, making it essential for consumers and buyers to compare test results across different system configurations and price points.
What is vLLM?
vLLM is a fast and easy-to-use open-source library for large language model (LLM) inference and serving, supported by a community of contributors. It handles both cloud and self-hosted LLM deployments by managing memory, processing concurrent requests, and serving models like gpt-oss-20b efficiently. For self-hosted LLMs, vLLM simplifies deployment with features like PagedAttention1 for memory management, continuous batching, and support for both NVIDIA and AMD GPUs, enabling multiple concurrent requests on local hardware.
Concurrency benchmark methodology
We tested the latest high-performance GPU architectures from both NVIDIA and AMD to evaluate their concurrency scaling capabilities for AI inference workloads. Our benchmark tested the NVIDIA H100, H200, and B200 GPUs alongside AMD’s MI300X, running the OpenAI gpt-oss-20b model via vLLM under varying concurrent load conditions. Through measurement of throughput metrics, latency distributions, and resource utilization patterns, this analysis aims to provide insights for AI inference deployments.
Test infrastructure
We deployed our tests on Runpod’s cloud infrastructure, utilizing NVIDIA’s most advanced GPU architectures and the vLLM framework.
- GPU platform: Runpod cloud infrastructure (H100, H200, B200, and MI300X)
- Model: OpenAI GPT-OSS-20B via vLLM framework
Software environment
NVIDIA GPUs (H100, H200, B200):
- RunPod template:
runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 - vLLM installation:
vllm[flashinfer]==0.11.0
AMD GPU (MI300X):
- Docker image:
rocm/vllm-dev:open-mi300-08052025
vLLM server configuration
Different vLLM settings were used to optimize performance for each hardware architecture.
- For NVIDIA H100, H200, and B200 GPUs, the server was launched with the following command:
- For the AMD MI300X GPU, a ROCm-optimized vLLM build was used with specific settings for the architecture:
Note: This benchmark was conducted using vLLM v0.11.0. vLLM v1.0, released in early 2025, introduces architectural changes that may produce different throughput results.
Benchmark configuration
Each GPU was tested across 9 different concurrency levels with standardized parameters to ensure consistent results.
- Concurrency levels: 1, 4, 8, 16, 32, 64, 128, 256, 512 concurrent requests
- Test duration: 180 seconds measurement phase with 30s ramp-up/cool-down
- Request size: 1,000 input/output tokens per request
Note on result validation: Before recording the final metrics, we ran numerous tests to determine the optimal configuration for each GPU. Once identified, the benchmark was run three consecutive times to verify stability. The throughput results were consistent across these runs, with a variance of less than 0.1%. The figures reported in this analysis are based on the final of these three consecutive executions.
Key metrics
We tracked performance across multiple dimensions to provide a comprehensive view of GPU capabilities under load.
- Throughput: System output tokens per second, successful requests per second, and individual request token generation speed
- Latency: Time to First Token (TTFT), end-to-end latency with P50/P95/P99 percentiles, average latency per request
- Reliability: Success rate percentage, timeout vs. other error classification
Software stack considerations
Performance is not solely a function of hardware. Frameworks like vLLM have more mature, highly optimized support for NVIDIA’s CUDA ecosystem compared to AMD’s ROCm. Performance differences observed in MI300X results may partly reflect the current state of software optimization rather than the hardware’s theoretical potential.
Next-generation hardware roadmap
The GPUs tested in this benchmark, the B200, H200, H100, and MI300X, represent the current generation of AI inference hardware. Both NVIDIA and AMD have announced their successors, which is relevant context for teams planning infrastructure investments for 2026 and beyond.
On the NVIDIA side, Jensen Huang announced at CES 2026 that the Vera Rubin NVL72 platform has entered full production, with first systems expected to ship in the second half of 2026.2 According to NVIDIA, the Rubin GPU delivers approximately 50 PFLOPs of FP4 inference performance, roughly five times that of Blackwell-based systems like the B200 benchmarked here.3
On the AMD side, the Instinct MI400, based on the CDNA 5 architecture, is planned for 2026 and is expected to roughly double MI350 compute performance while introducing 432 GB of HBM4 memory.4 AMD has also announced that Meta will deploy custom MI450-based Instinct servers at up to 6 gigawatts of capacity, with shipments beginning in the second half of 2026.5 Oracle will additionally offer a publicly available AI supercluster powered by approximately 50,000 MI450-series GPUs starting in Q3 2026.6
For teams evaluating the GPUs in this benchmark for near-term deployments, the B200 and MI300X remain the highest-performing options currently available. For longer planning horizons, the 2026 roadmap suggests a significant step-change in both throughput and cost efficiency from both vendors.
Conclusion
The B200 leads in throughput and scales well for batch inference. The MI300X offers the fastest response times at low concurrency, making it a better fit for real-time applications like chatbots. The H100 and H200 sit in between, covering general-purpose workloads without excelling in either dimension.
The core trade-off holds across all hardware: higher concurrency increases system throughput but raises per-request latency. Choose based on whether your workload prioritizes volume or response time.
Further reading
Explore other AI hardware research, such as:
- Top 20 AI Chip Makers: NVIDIA & Its Competitors
- Cloud GPUs for Deep Learning: Availability & Price / Performance
- Best 10 Serverless GPU Clouds & 14 Cost-Effective GPUs
- Multi-GPU Benchmark
Reference Links
- Has 20 years of experience as a white-hat hacker and development guru, with extensive expertise in programming languages and server architectures.
- Is an advisor to C-level executives and board members of corporations with high-traffic and mission-critical technology operations like payment infrastructure.
- Has extensive business acumen alongside his technical expertise.
Be the first to comment
Your email address will not be published. All fields are required.