GigaChat3-702B-A36B-preview

GigaChat 3 Ultra Preview is a massive MoE model with 702 billion total parameters, of which only 36 billion are active in generating each token. A key feature is the use of Multi-head Latent Attention (MLA) — a mechanism that compresses the Key-Value (KV) cache into a latent vector. For users, this means the ability to work with massive contexts and long documents without an exponential increase in GPU memory requirements. The model is trained using a Multi-Token Prediction (MTP) objective — unlike classical solutions that predict only the next word, this model can predict multiple tokens in a single forward pass. This implements a native speculative decoding mechanism, accelerating response generation by up to 40% without quality loss. Technically, this makes it one of the fastest models in the "heavy" LLM class.

Unlike many Russian models that are the result of fine-tuning foreign counterparts, GigaChat 3 was trained from scratch on a unique dataset of over 20 trillion tokens. This dataset includes a vast amount of Russian-language sources and adds languages rarely featured in model training — from Chinese and Arabic to Uzbek and Kazakh. This provides the model with a deep understanding of the Russian language and Russian cultural context, which are inaccessible to foreign LLMs. The model leads in both Russian and international benchmarks: MERA (Industrial) - 0.824, HumanEval+ (Code) - 0.8659, GSM8K (Math) - 0.9598, etc. In terms of quality, GigaChat 3 Ultra confidently outperforms GigaChat 2 Max on all key benchmarks.

The model is optimized for On-premise and Private Cloud installations in large enterprises where data security is critical (air-gapped environments without internet access). It is ideally suited for complex analytics of large document arrays (RAG), automation of L2/L3 technical support (requiring deep contextual understanding), programmer assistants, and code generation within closed corporate repositories. The model supports popular inference engines (vLLM, SGLang, LMDeploy, TensorRT-LLM) and operates in BF16 and FP8 modes for optimal performance.


Announce Date: 19.11.2025
Parameters: 715B
Experts: 256
Activated at inference: 36B
Context: 131K
Layers: 64
Attention Type: Multi-head Latent Attention
VRAM requirements: 344.5 GB using 4 bits quantization
Developer: Sber AI
Transformers Version: 4.53.2
License: MIT

Public endpoint

Use our pre-built public endpoints for free to test inference and explore GigaChat3-702B-A36B-preview capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU TPS Status Link
There are no public endpoints for this model yet.

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended configurations for hosting GigaChat3-702B-A36B-preview

Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslaa100-6.44.512.480.nvlink
131,072.0
pipeline
44 524288 480 6 $14.10 Launch
h200-3.32.512.480
131,072.0
pipeline
32 524288 480 3 $14.36 Launch
teslaa100-8.44.512.480.nvlink
131,072.0
tensor
44 524288 480 8 $18.35 Launch
h200-4.32.768.480
131,072.0
tensor
32 786432 480 4 $19.23 Launch
Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
h200-6.52.896.960
131,072.0
pipeline
52 917504 960 6 $28.39 Launch
h200-8.52.1024.960
131,072.0
tensor
52 1048576 960 8 $37.37 Launch
There are no configurations for this model, context and quantization yet.

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.