Guide

Which GPU & how much VRAM for your LLM?

How much VRAM does an LLM really need — and does Llama 70B fit on a single GPU? This guide shows the VRAM footprint of 7B to 70B models at FP16 and 4-bit quantization, explains quantization in plain terms, and tells you exactly what fits on an RTX 4090 (24 GB) or RTX 5090 (32 GB) .

Rent a GPU server

The rough rule is simple: VRAM needed ≈ number of parameters × bytes per parameter. FP16 uses 2 bytes per parameter, 8-bit uses 1 byte, and 4-bit around 0.5 bytes. So a 7B model needs roughly 14 GB just for the weights in FP16. On top of that comes overhead — more on that below. This table gives you the ballpark so you pick the right GPU before you even download the model.

VRAM footprint by model size

These figures are approximations for the raw model weights (not measured benchmarks). In practice the KV cache and context length add on top — plan a 15–25% buffer.

Approximate VRAM per model and quantization (weights only)
ModelFP16 (~2 B/param)4-bit (~0.5 B/param)Fits 4090 (24 GB)Fits 5090 (32 GB)
7B~14 GB~5 GBYes (even FP16)Yes (even FP16)
13B~26 GB~8 GB4-bit onlyFP16 tight · 4-bit easy
34B~68 GB~20 GB4-bit only (tight)4-bit comfortable
70B~140 GB~40 GBNoNo (multi-GPU/offload)

Quantization in plain terms

Quantization stores the weights with fewer bits per number. Instead of 16 bits (FP16), 4-bit quantization uses only about a quarter of the memory — shrinking a 70B model from ~140 GB to ~40 GB. With modern methods like Q4_K_M, AWQ or GPTQ, the quality loss is small for most tasks and often barely noticeable. In practice you load such models with Ollama , vLLM or TGI in GGUF, AWQ or GPTQ format. For inference , 4-bit is usually the best trade-off between quality and VRAM; for training, higher precision still matters.

Does it fit a 4090 or 5090?

  • 7B & 13B: run comfortably on the RTX 4090 (24 GB) — though 13B should be quantized.
  • 34B: tight in 4-bit on the 4090, comfortable on the RTX 5090 (32 GB) .
  • 70B: does not fit either card alone in practical quantization — you'll need multi-GPU, CPU offload or a datacenter alternative.
  • Long context / large batch: costs extra VRAM — size up a tier when in doubt.

A rule of thumb to remember

VRAM (GB) ≈ parameters (billions) × bytes per parameter + 15–25% overhead. Bytes per parameter: FP16 = 2, 8-bit = 1, 4-bit = 0.5. Do the quick math once, add the buffer for KV cache and context, and you instantly know whether your model runs on 24 GB, 32 GB or only across multiple GPUs. Unsure about your specific model? Tell us the model and context length and we'll recommend the right card.

Frequently asked questions