Guide

Which GPU & how much VRAM for your LLM?

Q: Does Llama 70B fit on an RTX 4090 or 5090?

Not in practical quantization on a single card: 70B needs roughly 40 GB in 4-bit, more than the 4090's 24 GB or the 5090's 32 GB. You'll need multi-GPU, CPU offload or a datacenter GPU — and we'll advise honestly.

Q: How much VRAM does a 7B model need?

Around 14 GB just for the weights in FP16, or about 5 GB in 4-bit quantization. Both run comfortably on the RTX 4090 with 24 GB — plus a buffer for context and KV cache.

Q: Does only VRAM matter, or system RAM too?

For fast GPU inference, VRAM matters most. With CPU offloading you can move parts of the model into system RAM — extending capacity but costing significant speed. Our GPU servers ship with 96–128 GB of RAM for exactly that.

How much VRAM does an LLM really need — and does Llama 70B fit on a single GPU? This guide shows the VRAM footprint of 7B to 70B models at FP16 and 4-bit quantization, explains quantization in plain terms, and tells you exactly what fits on an RTX 4090 (24 GB) or RTX 5090 (32 GB) .

Rent a GPU server

The rough rule is simple: VRAM needed ≈ number of parameters × bytes per parameter. FP16 uses 2 bytes per parameter, 8-bit uses 1 byte, and 4-bit around 0.5 bytes. So a 7B model needs roughly 14 GB just for the weights in FP16. On top of that comes overhead — more on that below. This table gives you the ballpark so you pick the right GPU before you even download the model.

VRAM footprint by model size

These figures are approximations for the raw model weights (not measured benchmarks). In practice the KV cache and context length add on top — plan a 15–25% buffer.

Approximate VRAM per model and quantization (weights only)
Model	FP16 (~2 B/param)	4-bit (~0.5 B/param)	Fits 4090 (24 GB)	Fits 5090 (32 GB)
7B	~14 GB	~5 GB	Yes (even FP16)	Yes (even FP16)
13B	~26 GB	~8 GB	4-bit only	FP16 tight · 4-bit easy
34B	~68 GB	~20 GB	4-bit only (tight)	4-bit comfortable
70B	~140 GB	~40 GB	No	No (multi-GPU/offload)

Quantization in plain terms

Quantization stores the weights with fewer bits per number. Instead of 16 bits (FP16), 4-bit quantization uses only about a quarter of the memory — shrinking a 70B model from ~140 GB to ~40 GB. With modern methods like Q4_K_M, AWQ or GPTQ, the quality loss is small for most tasks and often barely noticeable. In practice you load such models with Ollama , vLLM or TGI in GGUF, AWQ or GPTQ format. For inference , 4-bit is usually the best trade-off between quality and VRAM; for training, higher precision still matters.

Does it fit a 4090 or 5090?

7B & 13B: run comfortably on the RTX 4090 (24 GB) — though 13B should be quantized.
34B: tight in 4-bit on the 4090, comfortable on the RTX 5090 (32 GB) .
70B: does not fit either card alone in practical quantization — you'll need multi-GPU, CPU offload or a datacenter alternative.
Long context / large batch: costs extra VRAM — size up a tier when in doubt.

Find the right GPU server

A rule of thumb to remember

VRAM (GB) ≈ parameters (billions) × bytes per parameter + 15–25% overhead. Bytes per parameter: FP16 = 2, 8-bit = 1, 4-bit = 0.5. Do the quick math once, add the buffer for KV cache and context, and you instantly know whether your model runs on 24 GB, 32 GB or only across multiple GPUs. Unsure about your specific model? Tell us the model and context length and we'll recommend the right card.

Frequently asked questions

Does Llama 70B fit on an RTX 4090 or 5090?

How much VRAM does a 7B model need?

Does 4-bit quantization cost much quality?

Does only VRAM matter, or system RAM too?