How much video memory (VRAM) is needed to run a model?

Selecting the right server configuration is a complex task that requires deep domain expertise. It depends on the model's characteristics (architecture, parameter count, attention mechanism, etc.), inference requirements (context length, parallelism, etc.), and server configuration specs. We offer a wide range of GPU configurations suitable for neural network inference. To simplify your choice, we prepare recommendations for each model and verify the functionality and performance of these configurations daily.

In recent years, a trend has emerged: the more parameters an AI model has, the higher its quality. However, large models require expensive servers with powerful GPUs, which places significant pressure on budgets. To optimize costs, quantization is used—reducing the precision of model weights and activations from the standard 32-bit (FP32) to smaller formats. The most common options are:

  • 4-bit (maximum resource savings, possible quality trade-offs);
  • 8-bit (optimal balance between model size and quality);
  • 16-bit (no precision reduction, no quality loss).

Thus, quantization helps reduce infrastructure costs and enables the use of powerful AI models even with a limited budget. That's why we prepare recommendations for each bitness level.

To use these recommendations, simply go to the page of the desired model in the model catalog and scroll down to the section "Recommended server configurations for hosting". Depending on the model type, two usage options are available:

  1. If the model is an LLM or visual LLM, clicking the Launch button will open the private endpoint creation page.
  2. For other models (image, video, or audio generation), clicking "Launch" will create a regular server.
Updated Date 18.06.2026