LLM Quantization: BF16 vs FP8 vs INT4

Ekrem Sarı

with

Sıla Ermut

updated on Mar 17, 2026

See our ethical norms

Cite This Research

We benchmarked Qwen3-32B at 4 precision levels (BF16, FP8, GPTQ-Int8, GPTQ-Int4) on a single NVIDIA H100 80GB GPU. Each configuration was evaluated on 2 benchmarks (~12.2K questions) covering knowledge and code generation, plus 2,000+ inference runs to measure throughput. Int4 is 2.7x faster than BF16 while losing less than 2 points on MMLU-Pro, but code generation (HumanEval) drops 8 points.

Quantization benchmark results

Loading Chart

MMLU-Pro tests broad reasoning across 14 domains (~12K questions, 5-shot). This is the harder version of MMLU with 10-choice questions instead of 4.

HumanEval tests code generation (164 problems, pass@1). The model writes Python functions that run against unit tests. This is the only benchmark where the output is executed, not just scored.

Throughput is output tokens per second at batch size 1.

Model size is GPU memory consumed by weights alone, measured after loading.

MMLU-Pro breakdown by category

Category	BF16	FP8	GPTQ-Int8	GPTQ-Int4
Biology	86.19%	86.47%	86.61%	86.05%
Business	75.67%	76.43%	77.44%	77.19%
Chemistry	74.12%	73.41%	73.94%	72.08%
Computer Science	74.15%	75.37%	74.39%	73.90%
Economics	81.75%	81.75%	81.40%	80.69%
Engineering	49.64%	45.41%	47.06%	43.45%
Health	73.59%	72.49%	73.72%	71.52%
History	62.20%	61.15%	60.63%	62.47%
Law	43.05%	41.69%	43.78%	40.60%
Math	81.87%	81.87%	81.87%	80.24%

Engineering and law show the largest drops at Int4. Math stays stable across all precisions.

Memory capacity and concurrency

GPU monitoring tools like nvidia-smi report near-full utilization regardless of model size because vLLM pre-allocates all available memory. The real question is how that memory splits between model weights and KV cache, because KV cache determines how many users you can serve concurrently.

Max users is the memory-bound ceiling before OOM: total token capacity divided by context length per user. This is the theoretical maximum. In practice, scheduling overhead reduces it slightly.

This has direct implications for reasoning models. DeepSeek-R1 and Qwen-QwQ generate thousands of internal “thought” tokens (often 2K-5K) before producing a final answer. On BF16, a single reasoning request could consume the entire 17K token capacity, blocking a second user. On Int4, the 193K capacity fits multiple concurrent reasoning sessions.

Key findings

FP8 loses no measurable accuracy

FP8 scores 69.64% on MMLU-Pro vs 70.24% for BF16, a 0.6 point difference across 12,000 questions. On HumanEval, both FP8 and BF16 score identically at 39.02%. FP8 gives you 1.5x throughput and cuts your model size in half for a 0.6 point cost.

GPTQ-Int8 scores 70.32% on MMLU-Pro but drops 1.8 points on HumanEval (37.20%). If code generation matters, FP8 is the safer pick.

Int4 degrades code generation more than knowledge

MMLU-Pro drops 1.6 points at Int4 (70.24% to 68.66%). HumanEval drops 8 points (39.02% to 31.10%). Code generation requires precise token predictions where small weight errors compound across function bodies.

The real win is concurrency, not speed

Int4 is 2.7x faster than BF16. But the larger effect is on memory. BF16 leaves only 4.4 GB for KV cache, enough for about 4 concurrent users at 4K context. Int4 frees up 47.3 GB, enough for 47 users, a 12x increase in serving capacity from the same GPU.

Math scores hold across all precisions

Math scores barely move: 81.87% at BF16, 81.87% at FP8, 81.87% at Int8, 80.24% at Int4. Engineering (49.64% to 43.45%) and law (43.05% to 40.60%) are more sensitive.

Cost per token

Using H100 SXM pricing on RunPod ($2.69/hour) at batch size 1:

These numbers reflect single-user, real-time generation. Batch processing drops the cost further.

See more of our benchmarks and data-driven insights in Google Search.

Add as preferred source

LLM quantization benchmark methodology

Environment

GPU: Single NVIDIA H100 80GB HBM3 (SXM) via RunPod ($2.69/hr)
Software: vLLM 0.17.0, lm-evaluation-harness 0.4.11, PyTorch 2.8.0, CUDA 12.8, Python 3.11
Model: Qwen3-32B (post-trained/instruction-tuned) from HuggingFace. No fine-tuning applied.

Accuracy evaluation

Benchmark	Questions	Type	Few-shot	Metric
MMLU-Pro (full, 14 categories)	~12K	generate_until	5-shot	exact_match
HumanEval	164	generate_until	0-shot	pass@1

All evaluations run via lm-evaluation-harness with batch_size="auto".
Each task runs in a separate subprocess. Model loaded fresh each time, GPU fully cleaned between tasks. This prevents OOM from memory fragmentation.
HumanEval runs with HF_ALLOW_CODE_EVAL=1 (code execution enabled).
MMLU-Pro results include per-category breakdown (biology, math, physics, law, etc.).
Qwen3’s thinking mode was not active during evaluations. lm-evaluation-harness sends raw formatted prompts without applying the model’s chat template (apply_chat_template=False by default), so the <think> token is never injected.

Performance evaluation

5 rotating prompts across domains (science, coding, general knowledge)
10 warmup iterations (not measured), then 500 measured iterations
Fixed output: max_tokens=256, temperature=0.7, top_p=0.9, batch_size=1
Metrics: throughput (tokens/sec), GPU memory usage (GB)

vLLM configuration per precision

Precision	HuggingFace ID	vLLM config
BF16	`Qwen/Qwen3-32B`	`dtype="bfloat16"`
FP8	`Qwen/Qwen3-32B-FP8`	`dtype="bfloat16"`
GPTQ-Int8	`JunHowie/Qwen3-32B-GPTQ-Int8`	`dtype="float16", quantization="gptq_marlin"`
GPTQ-Int4	`JunHowie/Qwen3-32B-GPTQ-Int4`	`dtype="float16", quantization="gptq_marlin"`

All precisions use gpu_memory_utilization=0.90, max_model_len=4096.

Split-process architecture

Each benchmark runs as two separate processes to prevent OOM:

Step 1: Load model, warmup, benchmark throughput, save to temp file, exit.
Cleanup: Force-kill vLLM and Ray processes, wait 10 seconds.
Step 2: Load model fresh, run each eval task in a separate subprocess, merge with step 1 metrics, save final JSON.

Controlled variables

To eliminate external factors, the following parameters were fixed across all runs:

Parameter	Value	Rationale
Batch size	1	Simulates single-user latency. Higher batch sizes magnify the bandwidth advantage of quantized models, but batch size 1 is the worst case for compute utilization.
Output length	256 tokens	Fixed generation limit to ensure throughput is measured over a consistent workload.
Temperature	0.7	Standard sampling parameter.
Top-p	0.9	Standard nucleus sampling threshold.
GPU memory utilization	0.90	Empirically tuned. Higher values (0.95, 0.98) caused OOM during CUDA graph warmup or large-batch evaluation.
Cleanup protocol	Strict	Between every run, all vLLM and Ray processes are force-killed and GPU memory is verified clean before the next run starts.

Test prompts

The 5 test prompts:

“Explain the theory of relativity in simple terms.” (Science/Abstract)
“Write a Python function to find the longest palindromic substring.” (Coding)
“What are the main causes of climate change and their effects?” (Complex Reasoning)
“Describe the process of photosynthesis step by step.” (Process Description)
“How does a neural network learn from data?” (Technical Explanation)

Data verification: vLLM runtime telemetry

The memory and concurrency figures in this article were derived directly from vLLM engine initialization logs during benchmark execution.

BF16 initialization:

1INFO ... Model loading took 61.03 GiB memory
2INFO ... Available KV cache memory: 4.38 GiB
3INFO ... GPU KV cache size: 17,952 tokens
4INFO ... Maximum concurrency for 4,096 tokens per request: 4.38x

GPTQ-Int4 initialization:

1INFO ... Model loading took 18.14 GiB memory
2INFO ... Available KV cache memory: 47.28 GiB
3INFO ... GPU KV cache size: 193,632 tokens
4INFO ... Maximum concurrency for 4,096 tokens per request: 47.27x

Limitations

All tests use batch size 1. In high-throughput scenarios, the performance gap between Int4 and BF16 widens because memory bandwidth saturation becomes the dominant bottleneck.

Results are specific to the H100 SXM. Older GPUs (A100, A10) lack native FP8 support. Consumer GPUs (RTX 4090) have different memory bandwidth characteristics.

The GPTQ models (JunHowie) are community-provided quantizations. Official releases may use different calibration datasets or parameters, which can affect accuracy.

We tested GPTQ only. Other quantization methods (AWQ, BitsAndBytes NF4, GGUF, HQQ) might offer different trade-offs.

Conclusion

For Qwen3-32B on an H100, FP8 is the default choice. You get 1.5x the throughput, half the memory footprint, and a 0.6 point accuracy cost.

Int4 makes sense when you need maximum throughput or concurrency: 2.7x speed, 12x concurrency, at the cost of 1.6 points on MMLU-Pro and 8 points on HumanEval.

Int8 sits in the middle and does not offer a clear advantage over FP8 in this setup. The throughput gain over FP8 is small (43.3 vs 37.9 tok/s) and the accuracy is comparable. FP8 is simpler because it is officially provided by the model authors and does not require a third-party quantized checkpoint.

The biggest practical impact of quantization is not speed, it is concurrency. BF16 can serve 4 users at 4K context on a single H100. Int4 can serve 47. At $2.69/hr, that brings cost per 1M tokens from $28.73 down to $10.69.

Cite this research

Pick the format that matches where you're publishing. Pasting the link version into your CMS preserves the backlink.

Ekrem Sarı and Sıla Ermut (2026) - "LLM Quantization: BF16 vs FP8 vs INT4". Published online at AIMultiple.com. Retrieved March 17, 2026, from: https://aimultiple.com/llm-quantization [Online Resource]

Sarı, E., & Ermut, S. (2026, March 17). LLM Quantization: BF16 vs FP8 vs INT4. AIMultiple. https://aimultiple.com/llm-quantization

@misc{sar2026,
  author = {Sarı, Ekrem and Ermut, Sıla},
  title  = {{LLM Quantization: BF16 vs FP8 vs INT4}},
  year   = {2026},
  month  = mar,
  howpublished    = {\url{https://aimultiple.com/llm-quantization}},
  note   = {AIMultiple. Retrieved March 17, 2026}
}

Ekrem Sarı

AI Researcher

Follow On

Ekrem is an AI Researcher at AIMultiple, focusing on intelligent automation, GPUs, AI Agents, and RAG frameworks.

View Full Profile

Researched by

Sıla Ermut

Industry Analyst

Follow On

Sıla Ermut is an industry analyst at AIMultiple focused on email marketing and sales videos. She previously worked as a recruiter in project management and consulting firms. Sıla holds a Master of Science degree in Social Psychology and a Bachelor of Arts degree in International Relations.

View Full Profile