prepare-ollama-proxmox/readme.md at caba201b510bccec42171fd362f1b8bc757735df

ghassan/prepare-ollama-proxmox

Fork 0

Ghassan Yusuf caba201b51 Add readme.md

2026-03-25 22:35:11 +03:00

7.3 KiB

Raw Blame History

1. Prepare Proxmox host (one‑time)

Enable IOMMU in BIOS (VT‑d / AMD‑Vi).
On PVE host shell:
- Verify IOMMU is active after reboot:
```
dmesg | grep -e DMAR -e IOMMU
```
  You should see Intel(R) Virtualization Technology for Directed I/O and Queued invalidation lines like your current output. forum.proxmox
Install NVIDIA driver + CUDA on host (you already have this, but for rebuild):
- Add NVIDIA repo or use Debian non‑free firmware.
- After install and reboot:
```
nvidia-smi
ls -l /dev/nvidia*
```
  Confirm both RTX 3060s (GPU 0 and 1) and devices /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl, /dev/nvidia-uvm*, /dev/nvidia-modeset, /dev/nvidia-caps. reddit

Host is now GPU‑ready.

2. Create the Ollama LXC (privileged Ubuntu)

In Proxmox UI → Create CT:
- CT ID: 122
- Hostname: ollama
- Template: Ubuntu 22.04/24.04.
- Uncheck “Unprivileged container” (must be privileged).
- Disk: 200 GB (as you have).
- CPU: 8 cores.
- RAM: 8 GB (or more if you like).
- Network: vmbr0, static IP 172.16.0.6/24, VLAN tag 90, GW 172.16.0.1 (your current settings).

Before starting it, edit config on host:

nano /etc/pve/lxc/122.conf

Make it look like (adapting your net/rootfs values):

arch: amd64
cores: 8
hostname: ollama
memory: 8192
net0: name=eth0,bridge=vmbr0,firewall=1,gw=172.16.0.1,hwaddr=BC:24:11:C9:1B:EE,ip=172.16.0.6/24,tag=90
onboot: 0
ostype: ubuntu
rootfs: local-lvm:vm-122-disk-0,size=200G
startup: order=2
swap: 8192
tags:
unprivileged: 0

# GPU / NVIDIA
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 238:* rwm
lxc.cgroup2.devices.allow: c 235:* rwm

lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidia1 dev/nvidia1 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-caps dev/nvidia-caps none bind,optional,create=dir

lxc.environment: NVIDIA_VISIBLE_DEVICES=all
lxc.environment: NVIDIA_DRIVER_CAPABILITIES=compute,utility
lxc.environment: CUDA_DEVICE_ORDER=PCI_BUS_ID

This gives the CT full access to both GPUs and driver nodes following current Proxmox+NVIDIA LXC recommendations. forum.proxmox

Start the CT:
```
pct start 122
```

3. Verify GPUs inside the LXC

Inside CT 122 console:

ls -l /dev/nvidia*
nvidia-smi

You should see:

nvidia0, nvidia1, nvidiactl, nvidia-uvm, nvidia-modeset, nvidia-caps.
nvidia-smi showing both RTX 3060s, identical to host output. youtube

If that’s true, passthrough is correct.

4. Install Ollama directly in the LXC

Inside CT (Ubuntu):
```
curl -fsSL https://ollama.com/install.sh | sh
```
Or follow Ollama Linux docs if manual. docs.ollama
Verify:
```
ollama --version
```
It should install a systemd service ollama.service that runs ollama serve. docs.ollama

5. Configure systemd so Ollama uses both GPUs and listens on all IPs

Create/modify systemd override:

sudo systemctl edit ollama.service

Put:

[Service]
Environment=NVIDIA_VISIBLE_DEVICES=0,1
Environment=OLLAMA_FLASH_ATTENTION=1
Environment=OLLAMA_GPU_OVERHEAD=1024
Environment=OLLAMA_HOST=0.0.0.0:11434
Environment=OLLAMA_ORIGINS=*
Restart=always
RestartSec=10

This is the officially recommended pattern (Environment in override, main unit just runs ollama serve). github

Apply and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl status ollama

Ensure it’s active (running).

Confirm it’s listening:
```
ss -lntp | grep 11434 || netstat -lntp | grep 11434
```
You should see 0.0.0.0:11434 bound to ollama. thinkinprompt

6. Confirm both GPUs are actually used

In CT 122, terminal A:
```
nvidia-smi -l 1
```

In terminal B:

ollama pull llama3.1:8b-q4
ollama run llama3.1:8b-q4

Watch nvidia-smi during generation: you should see processes on GPU 0 and GPU 1 when larger models or multiple requests are running. For big models (e.g. 70b-q4), utilization on both cards will be clearer. perplexity

7. Connect Open WebUI (in its own CT/VM)

In your existing openweb-ui container/VM:

If Docker CUDA image (recommended):

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  --gpus all \
  --security-opt apparmor=unconfined \
  -e OLLAMA_BASE_URL=http://172.16.0.6:11434 \
  -v open-webui:/app/backend/data \
  --restart always \
  ghcr.io/open-webui/open-webui:cuda

This uses GPU where Open WebUI runs and calls Ollama over the network at 172.16.0.6:11434. perplexity

Test by opening Open WebUI in the browser, selecting an Ollama model, and sending a message while watching nvidia-smi in CT 122.

8. Quick checklist for rebuild

Host:
- IOMMU enabled (dmesg | grep DMAR shows VT‑d). forum.proxmox
- NVIDIA driver + CUDA OK (nvidia-smi shows both 3060s). reddit
LXC 122:
- Privileged (unprivileged: 0).
- /etc/pve/lxc/122.conf has lxc.cgroup2.devices.allow for 195, 238, 235 and mount entries for all /dev/nvidia* and /dev/nvidia-caps. perplexity
- Inside CT, ls -l /dev/nvidia* + nvidia-smi show both GPUs. perplexity
Ollama:
- Installed directly in CT.
- systemctl edit ollama.service with env vars for both GPUs + host/origins.
- systemctl status ollama → running, ollama serve bound to 0.0.0.0:11434. ollama.readthedocs
Open WebUI:
- Points OLLAMA_BASE_URL to http://172.16.0.6:11434. perplexity

If you want, I can turn this into a bash script plus a couple of template files so next rebuild is mostly copy‑paste.

7.3 KiB Raw Blame History Unescape Escape