How to Install Ollama and Self-Host Your Own LLM
This tutorial walks you through installing Ollama on a dedicated GPU server step by step — from the NVIDIA driver to your first model to a locked-down API. The result: a private LLM endpoint in Frankfurt, GDPR-compliant and free of per-token billing.
Rent an Ollama serverOllama is the fastest way to run a local large language model in production: it bundles model download, quantization and an OpenAI-compatible API into a single binary. The catch is that inference crawls without a GPU. A dedicated server with an RTX 4090 or RTX 5090 supplies the compute you need — no preemption , at a fixed monthly cost.
Prerequisites
You need root access to a Linux server (Ubuntu 22.04 or 24.04 keeps things simple), an NVIDIA GPU with enough VRAM for your target model, and ideally a domain for the TLS certificate. To size VRAM per model, see our guide which GPU and how much VRAM for which LLM — as a rough rule, 7B to 13B models fit comfortably in 24 GB, while 70B needs quantization.
How to set up an Ollama server — step by step
- Provision the server: rent a GPU server for AI with an RTX 4090 or 5090 and install a recent Ubuntu. Root access is required so you can set up drivers and services freely.
- Install the NVIDIA driver and CUDA: apply the matching NVIDIA driver and confirm the card is detected with the nvidia-smi command. Ollama uses CUDA for GPU acceleration; without a working driver it falls back to the slow CPU.
- Install Ollama: the official install script sets up the binary and a systemd service, so the daemon starts automatically on boot and runs in the background.
- Pull and run a model: ollama pull <model> downloads a model (for example an 8B model or a quantized 70B), and ollama run <model> opens your first interactive prompt for testing.
- Test the API: Ollama exposes an OpenAI-compatible endpoint on port 11434. Verify it locally with a curl request before you expose it to the outside.
- Add a reverse proxy and TLS: put Nginx or Caddy in front, terminate HTTPS with a Let's Encrypt certificate, and forward only authenticated requests to port 11434.
- Harden access: bind Ollama to localhost, enforce an API token or basic auth at the proxy, and restrict the port with a firewall. An LLM endpoint should never sit unprotected on the open internet.
Cost: your own server vs. a cloud API
Whether your own Ollama server pays off depends on volume. Token APIs are cheaper for low, sporadic usage; once load is steady, the math tips toward a fixed monthly price — and your prompts never leave your server. Our guide self-hosted LLM vs. API works out the break-even in detail.
| Criterion | Own GPU server (Bthorio) | Cloud token API |
|---|---|---|
| Billing | Fixed monthly price | Per token / request |
| Cost at high load | Constant, predictable | Grows linearly with usage |
| Data privacy | Prompts stay in the EU | Data goes to the provider |
| Model choice | Free (any open model) | Only offered models |
| Latency | Constant, no cold start | Varies with provider load |
Day-to-day operation
Once the endpoint is secured, the focus shifts to everyday operation: keeping models current, watching utilization and holding response times steady. Ollama loads models on demand and keeps them in VRAM; if you serve several models, watch memory especially closely.
- Watch utilization: check GPU memory and load regularly with nvidia-smi to catch bottlenecks early.
- Maintain models: pull new versions deliberately and remove unused ones to free VRAM and disk space.
- Meter context length: very long contexts cost VRAM and time — size them as large as needed, not as large as possible.
- Connect clients: thanks to the OpenAI-compatible API, many tools talk to the endpoint directly; store the API token centrally rather than in each client.
- Reboot behaviour: because Ollama runs as a systemd service, the endpoint comes back up on its own after a reboot — test that deliberately once.