Running private LLM on CloudFerro Virtual machine

Introduction

Running a Large Language Model (LLM) on your own virtual machine with a high-performance GPU offers several advantages:

  • Privacy and Security: You maintain control over your data, reducing the risk of exposure associated with third-party platforms.
  • Performance Optimization: You can optimize and configure your environment specifically for your workload, potentially achieving lower latency and faster processing times.
  • Customization: You have the flexibility to adjust system resources and settings to tailor the performance and capabilities of the LLM to your specific needs.
  • Cost Efficiency: By controlling the computing resources, you can manage costs more effectively, especially if you have fluctuating demands or take advantage of SPOT instances. Additionally VM with LLM shared thru API between team members will replace need of equiping them wuth local GPU able to run LLM.
  • Scalability: You can scale your resources up or down based on demand, allowing you to handle varying workloads efficiently.
  • Reduced Dependency: Operating your LLM reduces reliance on third-party infrastructure (in this case you would be dependent only on independent Cloud Provider operating in Europe under EU law), giving you greater independence and control over its operation and maintenance.
  • Access to Advanced Features: Cloud operator is able to provide high-performance GPU difficult to purchase by smaller companies, you can test and leverage advanced features and capabilities of LLMs that require significant computational power.
  • Continuous Availability: You achieve high availability and reliability, as the virtual machine can be configured to meet uptime requirements without interruptions often associated with shared platforms.

What will you learn from this document?

  • You will learn how to run a private Large Language Model (LLM) on a CloudFerro virtual machine using the self hosted Ollama platform.
  • You will start by creating a VM on the CREODIAS platform by selecting the appropriate GPU and AI-related options
  • Once you set up SSH access, you will verify the GPU visibility to ensure the NVIDIA drivers load correctly.
  • You will then proceed with the Ollama installation and verify its ability to recognize the GPU.
  • Furthermore, you will be guided on downloading and testing small LLM models from the Ollama Library.
  • Next you get details on advanced configurations, including how to expose the Ollama API for network access and set up a reverse proxy with SSL certificates and Basic Authentication for added security.
  • Additionally, you will address potential security considerations when you expose the API, either within a cloud tenant or publicly.

Manual procedure

VM creation

To create the VM, please follow this document:  

How to create a new Linux VM in OpenStack Dashboard Horizon on CREODIAS

Please note that whaen making the two steps you must choose the GPU and AI related options.  

1. When a source image is selected, please use one of the *_NVIDIA_AI images (two Ubuntu and one CentOS are available).

2. An instance must be created with one of the following flavors:  

  (as available at the end of March 2025)

  •   WAW3-1
    •       vm.a6000.1 (1/8 of shared A6000 card)
    •       vm.a6000.2 (1/4 of shared A6000 card)
    •       vm.a6000.4 (1/2 of shared A6000 card)
    •       vm.a6000.8 (full shared A6000 card)
  •     WAW3-2  
    Standard instances
    •       gpu.h100 (One H100 card available)
    •       gpu.h100x2 (Two H100 cards available)
    •       gpu.h100x4 (Four H100 cards available)
    •       gpu.l40sx2 (Two L40s cards available)
    •       gpu.l40sx8 (Eight L40s cards available)
    •       vm.l40s.1 (1/8 of shared L40s card)
    •       vm.l40s.2 (1/4 of shared L40s card)
    •       vm.l40s.4 (1/2 of shared L40s card)
    •       vm.l40s.8 (full shared L40s card)

      Spot instances
    • spot.vm.l40s.1 (1/8 of shared L40s card)
    • spot.vm.l40s.2 (1/4 of shared L40s card)
    • spot.vm.l40s.4 (1/2 of shared L40s card)
    • spot.vm.l40s.8 (full shared L40s card)
  • FRA1-2
    • vm.l40s.2 (1/4 of shared L40s card)
    • vm.l40s.8 (full shared L40s card)
  • WAW4-1    
    • A new GPU flavors for H100 and L40s NVIDIA GPUs will be available soon (~ end of April 2025).

This tutorial was prepared using "vm.a6000.8" flavor and "Ubuntu 22.04 NVIDIA_AI" image on WAW3-1 region.

Accessing VM with SSH

To configure just created instance, we will access it using SSH.  

Depending on the operating system that you use on your local computer, choose one of the below documents:  

GPU check

The first step is checking if the GPU is visible by the system and if NVIDIA drivers are loaded properly.  

You should be able to run the command:

nvidia-smi

And you should get the following output:

Fri Mar 21 17:28:32 2025 
+---------------------------------------------------------------------------------------+ 
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     | 
|-----------------------------------------+----------------------+----------------------+ 
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. | 
|                                         |                      |               MIG M. | 
|=========================================+======================+======================| 
|   0  NVIDIA RTXA6000-48Q            On  | 00000000:00:05.0 Off |                    0 | 
| N/A   N/A    P8              N/A /  N/A |      0MiB / 49152MiB |      0%      Default | 
|                                         |                      |             Disabled | 
+-----------------------------------------+----------------------+----------------------+ 
 
+---------------------------------------------------------------------------------------+ 
| Processes:                                                                            | 
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory | 
|        ID   ID                                                             Usage      | 
|=======================================================================================| 
|  No running processes found                                                           | 
+---------------------------------------------------------------------------------------+ 

Please note that GPU memory usage is 0MiB of the amount available per selected flavor because it is not used yet.

Ollama installation

According to the official instruction at [Ollama dowload page for Linux](https://ollama.com/download/linux) it is enough to run a single installation script:

curl -fsSL https://ollama.com/install.sh | sh -->

You should see the following output with the last message stating that Ollama sees GPU.

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.

Please note that this installation script not only downloads and installs packages, but additionally runs Ollama web service locally.
If you execute the command:

systemctl status ollama

you will get this output:

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2025-03-21 19:35:50 UTC; 2 days ago
   Main PID: 110150 (ollama)
      Tasks: 22 (limit: 135297)
     Memory: 1.7G
        CPU: 33.690s
     CGroup: /system.slice/ollama.service
             └─110150 /usr/local/bin/ollama serve

Mar 21 20:57:45 llm-tests ollama[110150]: llama_init_from_model: graph splits = 2
Mar 21 20:57:45 llm-tests ollama[110150]: key clip.use_silu not found in file
Mar 21 20:57:45 llm-tests ollama[110150]: key clip.vision.image_grid_pinpoints not found in file
Mar 21 20:57:45 llm-tests ollama[110150]: key clip.vision.feature_layer not found in file
Mar 21 20:57:45 llm-tests ollama[110150]: key clip.vision.mm_patch_merge_type not found in file
Mar 21 20:57:45 llm-tests ollama[110150]: key clip.vision.image_crop_resolution not found in file
Mar 21 20:57:45 llm-tests ollama[110150]: time=2025-03-21T20:57:45.432Z level=INFO source=server.go:619 msg="llama runner started in 1.01 seconds"
Mar 21 20:57:46 llm-tests ollama[110150]: [GIN] 2025/03/21 - 20:57:46 | 200 |  2.032983756s |       127.0.0.1 | POST     "/api/generate"
Mar 23 19:36:29 llm-tests ollama[110150]: [GIN] 2025/03/23 - 19:36:29 | 200 |       59.41µs |       127.0.0.1 | HEAD     "/"
Mar 23 19:36:29 llm-tests ollama[110150]: [GIN] 2025/03/23 - 19:36:29 | 200 |     538.938µs |       127.0.0.1 | GET      "/api/tags"

To test out Ollama installation, we will download 2 small models from Ollama Library.

ollama pull llama3.2:1b
ollama pull moondream

Each of them should give a similar output:

pulling manifest
pulling 74701a8c35f6... 100% ▕█████████████████████████████████████▏ 1.3 GB
pulling 966de95ca8a6... 100% ▕█████████████████████████████████████▏ 1.4 KB
pulling fcc5a6bec9da... 100% ▕█████████████████████████████████████▏ 7.7 KB
pulling a70ff7e570d9... 100% ▕█████████████████████████████████████▏ 6.0 KB
pulling 4f659a1e86d7... 100% ▕█████████████████████████████████████▏  485 B
verifying sha256 digest
writing manifest
success

Verify that they are visible:

ollama list

You should see them on the list

NAME                ID              SIZE      MODIFIED
moondream:latest    55fc3abd3867    1.7 GB    47 hours ago
llama3.2:1b         baf6a787fdff    1.3 GB    2 days ago

Please test them by executing one or both commands below.
Remember that to exit the chat ,you need to use /bye command.

ollama run moondream

Or

ollama run llama3.2:1b

Now, if you execute:

nvidia-smi

Command.
Then you should have a similar output:

Fri Mar 21 20:58:40 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTXA6000-48Q            On  | 00000000:00:05.0 Off |                    0 |
| N/A   N/A    P8              N/A /  N/A |   6497MiB / 49152MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1514      C   /usr/local/bin/ollama                      4099MiB |
|    0   N/A  N/A      1568      C   /usr/local/bin/ollama                      2395MiB |
+---------------------------------------------------------------------------------------+

It shows Ollama processes on the list and memory consumption being sum of loaded models.

As mentioned before, the Linux service with Ollama API server should already run in the background.
You may test it with the following Curl request:

curl http://localhost:11434/api/generate -d '{
  "model": "moondream",
  "prompt": "Why milk is white?"
}'

You will receive a bunch of json response messages containing a model answer

{"model":"moondream","created_at":"2025-03-23T19:50:31.694190903Z","response":"\n","done":false}
{"model":"moondream","created_at":"2025-03-23T19:50:31.701052938Z","response":"Mil","done":false}
{"model":"moondream","created_at":"2025-03-23T19:50:31.704855264Z","response":"k","done":false}
{"model":"moondream","created_at":"2025-03-23T19:50:31.70867345Z","response":" is","done":false}
{"model":"moondream","created_at":"2025-03-23T19:50:31.712496186Z","response":" white","done":false}
{"model":"moondream","created_at":"2025-03-23T19:50:31.716349912Z","response":" because","done":false}
...

Bigger size models

For now to make this tutorial fluent, we used small models with the size of about 1 GB.
If we have a GPU with more memory we may do test using bigger model. Let's try Llama3.3 with size of 42GB.
When you type name of model in search box on Ollama Libray then you get a list of models with this text in name. Copy model tag and use it locally.

You may activate the download of the model and then run it by a single command.

ollama run llama3.3:latest

Or only download the model for further usage:

ollama pull llama3.3:latest

Tag "llama3.3:latest" should be also used in Curl query when communicating with API.

Additional setup if necessary

If you execute command

ollama serve --help

You will see a list of environment variables allowing to tune configuration according to your requirements and the hardware used.
In the next section we will set up one of them.

Start ollama
Usage:
  ollama serve [flags]

Aliases:
  serve, start

Flags:
  -h, --help   help for serve

Environment Variables:
      OLLAMA_DEBUG               Show additional debug information (e.g. OLLAMA_DEBUG=1)
      OLLAMA_HOST                IP Address for the ollama server (default 127.0.0.1:11434)
      OLLAMA_KEEP_ALIVE          The duration that models stay loaded in memory (default "5m")
      OLLAMA_MAX_LOADED_MODELS   Maximum number of loaded models per GPU
      OLLAMA_MAX_QUEUE           Maximum number of queued requests
      OLLAMA_MODELS              The path to the models directory
      OLLAMA_NUM_PARALLEL        Maximum number of parallel requests
      OLLAMA_NOPRUNE             Do not prune model blobs on startup
      OLLAMA_ORIGINS             A comma separated list of allowed origins
      OLLAMA_SCHED_SPREAD        Always schedule model across all GPUs

      OLLAMA_FLASH_ATTENTION     Enabled flash attention
      OLLAMA_KV_CACHE_TYPE       Quantization type for the K/V cache (default: f16)
      OLLAMA_LLM_LIBRARY         Set LLM library to bypass autodetection
      OLLAMA_GPU_OVERHEAD        Reserve a portion of VRAM per GPU (bytes)
      OLLAMA_LOAD_TIMEOUT        How long to allow model loads to stall before giving up (default "5m")

Exposing Ollama API for other hosts in network - Internal use

Edit the file with Ollama service configuration (if necessary replace vim with your editor of choice).

sudo vim /etc/systemd/system/ollama.service

By default Ollama is exposed on localhost and port 11434, so it can not be accessed from other hosts in the project. To change the default behavior we add a line, setting Ollama to expose API on All interfaces and lower range port. For this article, we choose port 8765.

Environment="OLLAMA_HOST=0.0.0.0:8765"

In [service] section.

So updated File would look like this:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:8765"
Environment="PATH=/opt/miniconda3/condabin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"

[Install]
WantedBy=default.target

After this change we have to update the services.

sudo systemctl daemon-reload
systemctl restart ollama.service

And check if it is running properly.

systemctl status ollama.service

If we go now to another VM in the same network and execute a similar Curl request - modified only by changing IP address and port.

curl http://LLM_TEST_VM_IP:8765/api/generate -d '{
  "model": "moondream",
  "prompt": "Why milk is white?"
}'

Important remark:

If we expose API directly in this way in other port, then command ollama wouldn't work. The message will be:

Error: could not connect to ollama app, is it running?

It is because the command uses the same API and tries to access it on the default port 11434.
We have to execute the command:

export OLLAMA_HOST=0.0.0.0:8765

Or even add it to ~/.bashrc file to make the change permanent.

API security

You have to consider one important thing. Now Ollama API is exposed not only to a single network but also to all hosts in other networks in your project.
If this is not acceptable, you should consider some security settings.

Exposing Ollama API

In this case we will leave default API sttings for localhost and port 11434. Instead we add reverse proxy that expose API on other port and eventually add some authorization.

sudo apt install nginx
sudo apt install apache2-utils

Set Basic Authentication password. You must retype the password twice.

cd /etc
sudo htpasswd -c .htpasswd ollama

Exposing in cloud tenant

Simple NGINX configuration with basic auth but http only.
For internal usage only!
Strongly not recommending using when exposing API in public Internet.

user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
events {
  worker_connections 768;
}
http {
  server {
    listen 8765;
    # Basic authentication setup
    auth_basic "Restricted Area";
    auth_basic_user_file /etc/.htpasswd; # File containing usernames and hashed passwords
    location / {
      proxy_pass http://127.0.0.1:11434;
    }
  }
}

Test Curl request:

curl -u "ollama:YOUR_PASSWORD" http://10.0.0.148:8765/api/generate -d '{
  "model": "llama3.3:latest",
  "prompt": "Who is Peter Watts?"
}'

Exposing API with encryption

Assign public IP to your machine with Ollama using this guide:
How to Add or Remove Floating IP’s to your VM on CREODIAS

Obtain SSL certificate for this IP or domain name and put it in two files on VM:

  • /etc/ssl/certs/YOUR_CERT_NAME.crt
  • /etc/ssl/private/YOUR_CERT_NAME.key

Or generate a self-signed certificate.
It would be enough for personal or small team usage, but not if you want to expose API for customers or business partners.

sudo openssl req -x509 -nodes -days 365 -newkey rsa:4096 -keyout /etc/ssl/private/YOUR_CERT_NAME.key -out /etc/ssl/certs/YOUR_CERT_NAME.crt -subj "/C=PL/ST=Mazowieckie/L=Warsaw/O=CloudFerro/OU=Tech/CN=OllamaTest"

Simple NGINX config with basic auth and https.

user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
events {
        worker_connections 768;
        # multi_accept on;
}
http {
  server {
    listen 8765 ssl;
    server_name testing-ollama;
    # Path to SSL certificates
    ssl_certificate /etc/ssl/certs/YOUR_CERT_NAME.crt;
    ssl_certificate_key /etc/ssl/private/YOUR_CERT_NAME.key;
    # Basic authentication setup
    auth_basic "Restricted Area";
    auth_basic_user_file /etc/.htpasswd; # File containing usernames and hashed passwords
    location / {
      proxy_pass http://127.0.0.1:11434;
    }
  }
}

Curl test request.
With accepting self signed certificate by -k option:

curl -k -u "ollama:YOUR_PASSWORD" https://YOUR_IP_OR_DOMAIN:8765/api/generate -d '{
  "model": "llama3.3:latest",
  "prompt": "Who is Peter Watts?"
}'

Automated workflow with Terraform

Prerequisites / Preparation

Before you start, please read the documents:

Step 1 - Select or Create a Project

You may use the default project in your tenant (usually named "cloud_aaaaa_bb") or create a new one by following the document mentioned below. https://creodias.docs.cloudferro.com/en/latest/openstackcli/How-To-Create-and-Configure-New-Project-on-Creodias-Cloud.html

Step 2 - Install Terraform

There are various ways to install Terraform, some of them are described in the documentation mentioned in the "Preparation" chapter.

If you are using Ubuntu 22.04 LTS or newer and you do not need the latest Terraform release (for the Terraform OpenStack provider, it is not necessary), the easiest way is to use Snap.

First, install Snap:

sudo apt install snapd

Then install Terraform:

sudo snap install terraform --classic

Step 3 - Allowing Access to Project from Terraform

Now create Application Credentials.
Please follow the document: "How to Generate or Use Application Credentials via CLI on CREODIAS": https://creodias.docs.cloudferro.com/en/latest/cloud/How-to-generate-or-use-Application-Credentials-via-CLI-on-Creodias.html

When you have them ready, save them in a secure location (i.e., password manager) and fill in the variables in the "llm_vm.tfvars" file.

Step 4 - Prepare Configuration Files

As Terraform operates on the entire directory and automatically merges all "*.tf" files into one codebase, we may split our Terraform code into a few files to manage the code more easily.

  • main.tf
  • variables.tf
  • resources.tf
  • locals.tf

Additionally, we need two other files:

  • llm_vm_user_data.yaml
  • llm_api_nginx.conf
  • llm_vm.tfvars

File 1 - main.tf

In this file, we keep the main definitions for Terraform and the OpenStack provider.

terraform {
  required_version = ">= 0.14.0"
  required_providers {
    openstack = {
      source  = "terraform-provider-openstack/openstack"
      version = "~> 1.51.1"
    }
  }
}

provider "openstack" {
  auth_url    = var.auth_url
  region      = var.region
  user_name =  "${var.os_user_name}"
  application_credential_id = "${var.os_application_credential_id}"
  application_credential_secret = "${var.os_application_credential_secret}"
}

File 2 - variables.tf

In this file, we will keep variable definitions.

variable os_user_name {
  type = string
}

variable tenant_project_name {
  type = string
}

variable os_application_credential_id {
  type = string
}

variable os_application_credential_secret {
  type = string
}

variable auth_url {
  type = string
  default = "https://keystone.cloudferro.com:5000"
}

variable region {
  type = string
  validation {
    condition = contains(["WAW3-1", "WAW3-2", "FRA1", "FRA1-2", "WAW4-1"], var.region)
    error_message = "Proper region names are: WAW3-1, WAW3-2, FRA1, FRA1-2, WAW4-1"
  }
}

#Our friendly name for entire environment.
variable env_id {
  type = string
}

# Key-pair created in previous steps 
variable env_keypair {
  type = string
}


variable internal_network {
  type = string
  default = "192.168.12.0"
  validation {
    condition = can(regex("^(10\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])|192\\.168\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]))$", var.internal_network))
    error_message = "Provide proper network address for class 10.a.b.c or 192.168.a.b"
  }
}

variable internal_netmask {
  type = string
  default = "/24"
  validation {
    condition = can(regex("^\\/(1[6-9]|2[0-4])$", var.internal_netmask))
    error_message = "Please use mask size from /16 to /24."
  }
}

variable external_network {
  type = string
  default = "10.8.0.0"
  validation {
    condition = can(regex("^(10\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])|192\\.168\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\\.(?:25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9]))$", var.external_network))
    error_message = "Provide proper network address for class 10.a.b.c or 192.168.a.b"
  }
}

variable llm_image {
  type = string
  default = "Ubuntu 22.04 NVIDIA_AI"
}

variable llm_flavor {
  type = string
}

variable llm_api_port {
  type = number
  default = 8765
}

variable llm_tag {
  type = string
}

variable cert_data {
  type = string
  default = "/C=colar_system/ST=earth/L=europe/O=good_people/OU=smart_people/CN=OllamaTest"
}

File 3 - resources.tf

This is the most significant file where definitions of all entities and resources are stored.

resource "random_password" "ollama_api_pass" {
  length           = 24
  special          = true
  min_upper        = 8
  min_lower        = 8
  min_numeric      = 6
  min_special      = 2
  override_special = "-"
  keepers = {
    tenant = var.tenant_project_name
  }
}

output "ollama_api_pass_output" {
  value = random_password.ollama_api_pass.result
  sensitive = true
}

data "openstack_networking_network_v2" "external_network" {
  name = "external"
}

resource "openstack_networking_router_v2" "external_router" {
  name = "${var.env_id}-router"
  external_network_id = data.openstack_networking_network_v2.external_network.id
}

resource "openstack_networking_network_v2" "env_net" {
  name = "${var.env_id}-net"
}

resource "openstack_networking_subnet_v2" "env_net_subnet" {
  name            = "${var.env_id}-net-subnet"
  network_id      = openstack_networking_network_v2.env_net.id
  cidr            = "${var.internal_network}${var.internal_netmask}"
  gateway_ip      = cidrhost("${var.internal_network}${var.internal_netmask}", 1)
  ip_version      = 4
  enable_dhcp     = true
}

resource "openstack_networking_router_interface_v2" "router_interface_external" {
  router_id = openstack_networking_router_v2.external_router.id
  subnet_id = openstack_networking_subnet_v2.env_net_subnet.id
}

resource "openstack_networking_floatingip_v2" "llm_public_ip" {
  pool = "external"
}

resource "openstack_networking_secgroup_v2" "sg_llm_api" {
  name        = "${var.env_id}-sg-llm-api"
  description = "Ollama API"
}

resource "openstack_networking_secgroup_rule_v2" "sg_llm_api_rule_1" {
  direction         = "ingress"
  ethertype         = "IPv4"
  protocol          = "tcp"
  port_range_min    = var.llm_api_port
  port_range_max    = var.llm_api_port
  remote_ip_prefix  = "0.0.0.0/0"
  security_group_id = openstack_networking_secgroup_v2.sg_llm_api.id
}

resource "openstack_compute_instance_v2" "llm_server" {
  name              = "${var.env_id}-server"
  image_name        = var.llm_image
  flavor_name       = var.llm_flavor
  security_groups   = [
    "default",
    "allow_ping_ssh_icmp_rdp",
    openstack_networking_secgroup_v2.sg_llm_api.name
    ]
  key_pair          = var.env_keypair
  depends_on        = [
    openstack_networking_subnet_v2.env_net_subnet
    ]
  user_data = local.llm_vm_user_data
  network {
    uuid = openstack_networking_network_v2.env_net.id
    fixed_ip_v4 = cidrhost("${var.internal_network}${var.internal_netmask}", 3)
  }
}

resource "openstack_compute_floatingip_associate_v2" "llm_ip_associate" {
  floating_ip = openstack_networking_floatingip_v2.llm_public_ip.address
  instance_id = openstack_compute_instance_v2.llm_server.id
}

File 4 - locals.tf

In this file we keep all values recalculated from any type of input data (variables, templates ...).

locals {
  nginx_config = "${templatefile("./llm_api_nginx.conf",
    {
      ollama_api_port = "${var.llm_api_port}"
    }
  )}"
  llm_vm_user_data = "${templatefile("./llm_vm_user_data.yaml",
    {
      llm_tag = "${var.llm_tag}"
      cert_data = "${var.cert_data}"
      ollama_api_pass = "${random_password.ollama_api_pass.result}"
      nginx_config_content = "${indent(6, local.nginx_config)}"
    }
  )}"
}

File 5 - llm_vm_user_data.yaml

This is a template of user-data that would be injected into our instance hosting Ollama.

#cloud-config
package_update: true
package_upgrade: true
packages:
  - vim
  - openssh-server
  - nginx
  - apache2-utils
write_files:
  - path: /etc/nginx/nginx.conf
    permissions: '0700'
    content: |
      ${nginx_config_content}
  - path: /run/scripts/prepare_llm_vm
    permissions: '0700'
    defer: true
    content: |
      #!/bin/bash
      curl -fsSL https://ollama.com/install.sh | sh 
      sleep 5s
      systemctl enable ollama.service
      systemctl start ollama.service
      sleep 5s
      export HOME=/root
      ollama pull ${llm_tag}
      sudo openssl req -x509 -nodes -days 365 -newkey rsa:4096 -keyout /etc/ssl/private/ollama_api.key -out /etc/ssl/certs/ollama_api.crt -subj "${cert_data}"
      sudo htpasswd -b -c /etc/.htpasswd ollama ${ollama_api_pass}
      systemctl enable nginx
      systemctl start nginx
      echo 'Ollama ready!' > /var/log/ollama_ready.log
runcmd:
  - ["/bin/bash", "/run/scripts/prepare_llm_vm"]

File 5 - llm_vm.tfvars

In this file, we will provide values for Terraform variables:

  • os_user_name - Enter your username used to authenticate in CREODIAS here.
  • tenant_project_name - Name of the project selected or created in step 1.
  • os_application_credential_id
  • os_application_credential_secret
  • region - CloudFerro Cloud region name. Allowed values are: WAW3-1, WAW3-2, FRA1-2, WAW4-1.
  • env_id - Name that will prefix all resources created in OpenStack.
  • env_keypair - Keypair available in OpenStack. You will use it to log in via SSH to the LLM machine if this would be necessary - For example to use model directly with ollama run MODEL_TAG command.
  • internal_network - Network class for our environment. Any of 10.a.b.c or 192.168.b.c.
  • internal_netmask - Network mask. Allowed values: /24, /16.
  • llm_flavor - VM flavor for our Ollama host.
  • llm_image - Operating system image to be deployed on our instance.
  • llm_tag - Tag from Ollama Library of model that we want automatically download during provisioning.
  • cert_data - Values for self signed certificate.

Some of the included data, such as credentials, are sensitive. So if you save this in a Git repository, it is strongly recommended to add the file pattern "*.tfvars" to ".gitignore".

You may also add to this file the variable "external_network".

Do not forget to fill or update variable values in the content below.

os_user_name = "user@domain"
tenant_project_name = "cloud_aaaaa_b"
os_application_credential_id = "enter_ac_id_here"
os_application_credential_secret = "enter_ac_secret_here"
region = ""
env_id = ""
env_keypair = ""
internal_network = "192.168.1.0"
internal_netmask = "/24"
llm_flavor = "vm.a6000.8"
llm_image = "Ubuntu 22.04 NVIDIA_AI"
llm_tag="llama3.2:1b"
cert_data = "/C=PL/ST=Mazowieckie/L=Warsaw/O=CloudFerro/OU=Tech/CN=OllamaTest"

Step 5 - Activate Terraform Workspace

A very useful Terraform functionality is workspaces. Using workspaces, you may manage multiple environments with the same code.

Create and enter a directory for our project by executing commands:

mkdir tf_llm
cd tf_llm

To initialize Terraform, execute:

terraform init

Then, check workspaces:

terraform workspace list

As an output of the command above, you should see output like this:

* default

As we want to use a dedicated workspace for our environment, we must create it. To do this, please execute the command:

terraform workspace new llm_vm

Terraform will create a new workspace and switch to it.

Step 6 - Validate Configuration

To ensure the prepared configuration is valid, do two things.

First, execute the command:

terraform validate

Then execute Terraform plan:

terraform plan -var-file=llm_vm.tfvars

You should get as an output a list of messages describing resources that would be created.

Step 7 - Provisioning of Resources

To provision all the resources, execute the command:

terraform apply -var-file=llm_vm.tfvars

As with the plan command, you should get as an output a list of messages describing resources that would be created, but now finished with a question if you want to apply changes.
You must answer with the full word "yes".

You will see a sequence of messages about the status of provisioning.

Please remember that when the above sequence successfully finishes, the Ollama host is still not ready!
A script configuring the Ollama and downloading selected model is still running on the instance.
The process may take several minutes.
We recommend waiting about 5 minutes.

Step 8 - Obtaining VM IP and basic authorization password

To obtain a public IP address of the created instance, use the following command:

terraform state show openstack_networking_floatingip_v2.llm_public_ip

Public IP of host will be in field "address"

Password to authorize may be displayed by command:

terraform output -json

Password text will be at the key "value".

Step 9 - Testing

You may use use LLM directly after accessing the created instance with SSH.

ssh -i ENV_KEY_PAIR eouser@LLM_VM_PUBLIC_IP

Then:

ollama run llama3.2:1b

If the instance is accessed from some application via API than, API Test may be done using similar Curl request as previously:

curl -k -u "ollama:GENERATED_PASSWORD" https://PUBLIC_IP:8765/api/generate -d '{
  "model": "llama3.2:1b",
  "prompt": "Who is Peter Watts?"
}'

Step 10 - Removing resources when they are not needed

As GPU instance is more expensive we may completely remove it when it is not needed. By executing the command below you remove only the VM instance. The rest of resources would not be affected.

terraform destroy -var-file=llm_vm.tfvars -target=openstack_compute_instance_v2.llm_server

You may recreate it simply by running:

terraform apply -var-file=llm_vm.tfvars

Step 11 - Usage

That's all! You may use the created virtual machine with GPU and LLM of your choice.

Happy prompting with your own AI 🙂