Running an AI model on vLLM, Docker + NGC

This guide will help you deploy a modern large language model (LLM) for inference in the cloud using vLLM — a high-performance framework for text generation. We’ll use a prebuilt image from NVIDIA NGC that already includes CUDA, PyTorch, Docker, and vLLM to minimize setup time.

Virtual Machine Setup

To get started, select the following image:
https://immers.cloud/image/view/?id=24611

It includes everything required to run an LLM via vLLM:

  • Ubuntu 24.04.
  • NVIDIA drivers.
  • CUDA 13.
  • NVIDIA Container Toolkit and Docker.

Choose a GPU based on your model’s requirements — focus on video memory (VRAM) capacity and support for formats like FP8 or INT4. This information is typically found in the model’s documentation or in the platform’s model catalog.

Configure the remaining virtual machine parameters (CPU, RAM, disk type) according to the recommendations shown in the screenshot below.

screenshot 1

Starting the Container After SSH Connection

After connecting to the virtual machine via SSH, set the required environment variables using the export command before launching the container. These variables are necessary for running vLLM inside the Docker container:

  • HF_TOKEN — required for accessing private or gated models on Hugging Face (e.g., Gemma-3).
  • MODEL_HF_PATH — the model identifier from Hugging Face Hub.
  • GPU_AMOUNT — number of GPUs to use for tensor or pipeline parallelism.

These variables enable secure access to model weights and configure multi-GPU inference correctly.

screenshoot 2

Launch command:

sudo docker run --gpus "all" \

-v ~/.cache/huggingface:/root/.cache/huggingface \

--env "HF_TOKEN=$HF_TOKEN" \

-p 8000:8000 \

--ipc=host \

-d \

vllm/vllm-openai:v0.11.0 \

--model $HF_MODEL_PATH \

--trust-remote-code \

--tensor-parallel-size $GPU_AMOUNT

# or --pipeline-parallel-size $GPU_AMOUNT

Docker downloads all required components and starts model inference using vLLM.

Screenshot 3

Check the running container

sudo docker ps -a

Screenshot 4

After downloading the Docker image and starting the container, verify that the vLLM server is running with the following command (use the container ID shown in the output of docker ps -a)

sudo docker logs {ID of the container}:

Screenshot 5

Endpoint availability check and testing

To list available models (the one you launched should appear in the response), run:

curl http://127.0.0.1:8000/v1/models

Screenshot 6

We see the model in the list — google/gemma-3-4b-it, its context length is 131,072 and other parameters.

IMPORTANT! If we want to access the virtual machine (model) from a device OUTSIDE the local network, we must use the virtual machine's public network address as shown below: In the list of virtual machines, we see the network address:

Screenshot 7

In this case, the virtual machine’s address is 195.209.214.129. To make requests from outside the local network, construct the URL as follows:
http://{server-ip-address}:8000/v1

Simply replace the local address with the public one. For the previous command, it will look like this:

curl http://195.209.214.129:8000/v1/models

Let’s test inference by asking the model a simple question:

curl http://127.0.0.1:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{"model": "google/gemma-3-4b-it", "messages": [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "Say this is a test"}

], "temperature": 0, "max_tokens": 150}'

Screenshot 8

The model responds with «This is test» and offers assistance with any questions, meaning everything is working perfectly!

Best practices

If your virtual machine is exposed to the public network, always secure the vLLM endpoint with an access key. This will prevent unauthorized external requests to your model.

sudo docker run --gpus "all" \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 --env "HF_TOKEN=$HF_TOKEN" \
 -p 8000:8000 \
 --ipc=host \
 -d \
 vllm/vllm-openai:v0.11.0 \
 --model $HF_MODEL_PATH \
 --trust-remote-code \
 --tensor-parallel-size $GPU_AMOUNT
 --api-key your-secret-key

It must be passed in the Authorization header when interacting with the model:

curl http://127.0.0.1:8000:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{"model": "google/gemma-3-4b-it", "messages": [
 {"role": "system", "content": "You are a helpful assistant."},
 {"role": "user", "content": "Say this is a test"}
], "temperature": 0, "max_tokens": 150}'

With this approach, only those who possess the key will have access to the model — keep it secret!

 

Updated Date 09.02.2026