Ollama Server Hosting — a Dedicated GPU for Self-Hosted LLMs
Rent a dedicated GPU server for Ollama and self-hosted LLMs — a whole RTX 4090 or RTX 5090 in Frankfurt, with no cold start, no preemption , and a private, GDPR-compliant endpoint. A fixed monthly price instead of per-token billing.
Request a GPU server for OllamaOllama makes self-hosting LLMs effortless — one command and a model runs locally. The catch: without your own GPU you quickly end up on serverless endpoints with cold starts, shared cloud GPUs with preemption, or token APIs that send your prompts to third parties outside the EU. A dedicated GPU server flips that around — your Ollama runs around the clock on a whole card that's yours alone, behind a fixed, private endpoint in Frankfurt.
Why a dedicated server instead of a cloud GPU?
Shared and hourly cloud GPUs sound cheap, but for an inference service they bring three problems: the first request after idle waits on a cold start, preemption can kill live sessions, and per-hour billing makes cost unpredictable. With Bthorio you get dedicated bare metal instead — the GPU stays warm, your model stays resident in VRAM , and the price is fixed. No warm-up, no interruption, no surprise on the invoice.
Which LLM fits in which VRAM?
The decisive factor in LLM hosting is available VRAM . As a rough rule of thumb, a 4-bit quantized model needs about half as many gigabytes of VRAM as it has billions of parameters — plus headroom for context and KV cache. The table below shows which popular Ollama models sit comfortably on a 24 GB RTX 4090 or a 32 GB RTX 5090.
| Model | Parameters | Quantization | VRAM needed (approx.) | Fits on |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | Q4_K_M | ~6 GB | RTX 4090 & 5090 |
| Mistral / Gemma 2 9B | 7–9B | Q4_K_M | ~6–7 GB | RTX 4090 & 5090 |
| Qwen 2.5 14B | 14B | Q4_K_M | ~10 GB | RTX 4090 & 5090 |
| Qwen 2.5 / Llama 3.3 32B | 32B | Q4_K_M | ~20 GB | RTX 4090 (tight) & 5090 |
| Mixtral 8x7B (MoE) | 47B | Q4_K_M | ~28 GB | RTX 5090 |
| Llama 3.1 70B | 70B | Q4_K_M | ~40 GB+ | Multi-GPU / datacenter |
These figures are guide values for the usual Q4 quantizations including some context overhead; long context windows and larger batch sizes push the requirement up noticeably. If you want to go deeper, our guide Which GPU/VRAM for which LLM? covers quantization levels and memory footprint in detail.
RTX 4090 or RTX 5090 for Ollama?
For most Ollama setups — 7B to 14B models, one to a few concurrent users — the RTX 4090 with 24 GB from €399/month is the most economical choice. As soon as you need 30B-plus models without aggressive quantization, longer contexts or more parallel throughput, the RTX 5090 with 32 GB and Blackwell architecture come into their own. You'll find both on our dedicated GPU server overview.
- A private chat assistant for your team or customers — no prompt leaves the EU
- A RAG backend with local embeddings and your own vector store
- A coding assistant (e.g. via Continue.dev) over the OpenAI-compatible Ollama API
- Batch processing: summarizing, classifying and extracting across large document sets
A fixed monthly price instead of token costs
If you process a lot of tokens, commercial APIs charge per request — and the total scales with usage. Your own server costs the same every month, no matter how many millions of tokens flow through it. For continuously loaded assistants, batch jobs or RAG backends the math tips quickly in favour of self-hosting — and your data stays entirely under your control.
Up and running in minutes
You get root access to bare metal and choose the operating system, drivers and CUDA version freely. Ollama installs with a single command; from there you pull your model and expose the endpoint behind a reverse proxy, TLS and authentication. For a full step-by-step walkthrough see our guide to installing Ollama .