Products

Cloud servers

Cloud servers with per-second billing. Isolated resources will give maximum performance for your project.

GPU servers

Cloud servers with modern RTX and Tesla graphics accelerators for games, rendering, streaming, working with 3D graphics, and artificial intelligence.

H200

H100 NVL

H100

RTX 5090

RTX 4090

RTX 3090

RTX 3080

A100

RTX A5000

A10

RTX 2080 Ti

A2

Tesla T4

Tesla V100

All GPU servers

CPU servers

The cloud servers with high-performance Intel Xeon Gold 2nd, 3rd and 5th generation CPU are available for 100% of the processor time.
SSD servers NVMe servers
All CPU servers

Dedicated servers

Rent a physically dedicated server for a long term with a monthly payment. Configure it using modern components: Intel Xeon Gold 2nd, 3rd and 5th generation processors, up to 10 of the latest RTX and Tesla video accelerators, and up to 8192 GB of RAM per server, SSD and NVMe disks for data centers.

Select a dedicated server

Marketplace

Use popular and modern applications as effective tools for organizing your project. Save time with pre-configured images that already have all the necessary components installed.

Forget about manually downloading and installing the software — just deploy a virtual server with a ready-made image.
Neural networks 3D CUDA Docker / NGC For games Windows images Linux images
All pre-installed images
Features
Prices
FAQ
Contact
Login

Running an AI model on vLLM, Docker + NGC

This guide will help you deploy a modern large language model (LLM) for inference in the cloud using vLLM — a high-performance framework for text generation. We’ll use a prebuilt image from NVIDIA NGC that already includes CUDA, PyTorch, Docker, and vLLM to minimize setup time.

Virtual Machine Setup

To get started, select the following image:
https://immers.cloud/image/view/?id=24611

It includes everything required to run an LLM via vLLM:

Ubuntu 24.04.
NVIDIA drivers.
CUDA 13.
NVIDIA Container Toolkit and Docker.

Choose a GPU based on your model’s requirements — focus on video memory (VRAM) capacity and support for formats like FP8 or INT4. This information is typically found in the model’s documentation or in the platform’s model catalog.

Configure the remaining virtual machine parameters (CPU, RAM, disk type) according to the recommendations shown in the screenshot below.

screenshot 1

Starting the Container After SSH Connection

After connecting to the virtual machine via SSH, set the required environment variables using the export command before launching the container. These variables are necessary for running vLLM inside the Docker container:

HF_TOKEN — required for accessing private or gated models on Hugging Face (e.g., Gemma-3).
MODEL_HF_PATH — the model identifier from Hugging Face Hub.
GPU_AMOUNT — number of GPUs to use for tensor or pipeline parallelism.

These variables enable secure access to model weights and configure multi-GPU inference correctly.

screenshoot 2

Launch command:

sudo docker run --gpus "all" \

-v ~/.cache/huggingface:/root/.cache/huggingface \

--env "HF_TOKEN=$HF_TOKEN" \

-p 8000:8000 \

--ipc=host \

-d \

vllm/vllm-openai:v0.11.0 \

--model $HF_MODEL_PATH \

--trust-remote-code \

--tensor-parallel-size $GPU_AMOUNT

# or --pipeline-parallel-size $GPU_AMOUNT

Docker downloads all required components and starts model inference using vLLM.

Screenshot 3

Check the running container

sudo docker ps -a

Screenshot 4

After downloading the Docker image and starting the container, verify that the vLLM server is running with the following command (use the container ID shown in the output of docker ps -a)

sudo docker logs {ID of the container}:

Screenshot 5

Endpoint availability check and testing

To list available models (the one you launched should appear in the response), run:

curl http://127.0.0.1:8000/v1/models

Screenshot 6

We see the model in the list — google/gemma-3-4b-it, its context length is 131,072 and other parameters.

IMPORTANT! If we want to access the virtual machine (model) from a device OUTSIDE the local network, we must use the virtual machine's public network address as shown below: In the list of virtual machines, we see the network address:

Screenshot 7

In this case, the virtual machine’s address is 195.209.214.129. To make requests from outside the local network, construct the URL as follows:
http://{server-ip-address}:8000/v1

Simply replace the local address with the public one. For the previous command, it will look like this:

curl http://195.209.214.129:8000/v1/models

Let’s test inference by asking the model a simple question:

curl http://127.0.0.1:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{"model": "google/gemma-3-4b-it", "messages": [

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "Say this is a test"}

], "temperature": 0, "max_tokens": 150}'

The model responds with «This is test» and offers assistance with any questions, meaning everything is working perfectly!

Best practices

If your virtual machine is exposed to the public network, always secure the vLLM endpoint with an access key. This will prevent unauthorized external requests to your model.

sudo docker run --gpus "all" \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ -d \ vllm/vllm-openai:v0.11.0 \ --model $HF_MODEL_PATH \ --trust-remote-code \ --tensor-parallel-size $GPU_AMOUNT --api-key your-secret-key

It must be passed in the Authorization header when interacting with the model:

curl http://127.0.0.1:8000:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{"model": "google/gemma-3-4b-it", "messages": [
 {"role": "system", "content": "You are a helpful assistant."},
 {"role": "user", "content": "Say this is a test"}
], "temperature": 0, "max_tokens": 150}'

With this approach, only those who possess the key will have access to the model — keep it secret!

Updated Date 09.02.2026