Tutorial

How to Install Ollama and Self-Host Your Own LLM

This tutorial walks you through installing Ollama on a dedicated GPU server step by step — from the NVIDIA driver to your first model to a locked-down API. The result: a private LLM endpoint in Frankfurt, GDPR-compliant and free of per-token billing.

Rent an Ollama server

Ollama is the fastest way to run a local large language model in production: it bundles model download, quantization and an OpenAI-compatible API into a single binary. The catch is that inference crawls without a GPU. A dedicated server with an RTX 4090 or RTX 5090 supplies the compute you need — no preemption , at a fixed monthly cost.

Prerequisites

You need root access to a Linux server (Ubuntu 22.04 or 24.04 keeps things simple), an NVIDIA GPU with enough VRAM for your target model, and ideally a domain for the TLS certificate. To size VRAM per model, see our guide which GPU and how much VRAM for which LLM — as a rough rule, 7B to 13B models fit comfortably in 24 GB, while 70B needs quantization.

How to set up an Ollama server — step by step

  1. Provision the server: rent a GPU server for AI with an RTX 4090 or 5090 and install a recent Ubuntu. Root access is required so you can set up drivers and services freely.
  2. Install the NVIDIA driver and CUDA: apply the matching NVIDIA driver and confirm the card is detected with the nvidia-smi command. Ollama uses CUDA for GPU acceleration; without a working driver it falls back to the slow CPU.
  3. Install Ollama: the official install script sets up the binary and a systemd service, so the daemon starts automatically on boot and runs in the background.
  4. Pull and run a model: ollama pull <model> downloads a model (for example an 8B model or a quantized 70B), and ollama run <model> opens your first interactive prompt for testing.
  5. Test the API: Ollama exposes an OpenAI-compatible endpoint on port 11434. Verify it locally with a curl request before you expose it to the outside.
  6. Add a reverse proxy and TLS: put Nginx or Caddy in front, terminate HTTPS with a Let's Encrypt certificate, and forward only authenticated requests to port 11434.
  7. Harden access: bind Ollama to localhost, enforce an API token or basic auth at the proxy, and restrict the port with a firewall. An LLM endpoint should never sit unprotected on the open internet.

Cost: your own server vs. a cloud API

Whether your own Ollama server pays off depends on volume. Token APIs are cheaper for low, sporadic usage; once load is steady, the math tips toward a fixed monthly price — and your prompts never leave your server. Our guide self-hosted LLM vs. API works out the break-even in detail.

Ollama server vs. token API
CriterionOwn GPU server (Bthorio)Cloud token API
BillingFixed monthly pricePer token / request
Cost at high loadConstant, predictableGrows linearly with usage
Data privacyPrompts stay in the EUData goes to the provider
Model choiceFree (any open model)Only offered models
LatencyConstant, no cold startVaries with provider load

Day-to-day operation

Once the endpoint is secured, the focus shifts to everyday operation: keeping models current, watching utilization and holding response times steady. Ollama loads models on demand and keeps them in VRAM; if you serve several models, watch memory especially closely.

  • Watch utilization: check GPU memory and load regularly with nvidia-smi to catch bottlenecks early.
  • Maintain models: pull new versions deliberately and remove unused ones to free VRAM and disk space.
  • Meter context length: very long contexts cost VRAM and time — size them as large as needed, not as large as possible.
  • Connect clients: thanks to the OpenAI-compatible API, many tools talk to the endpoint directly; store the API token centrally rather than in each client.
  • Reboot behaviour: because Ollama runs as a systemd service, the endpoint comes back up on its own after a reboot — test that deliberately once.

Frequently asked questions