GigaChat 3 Ultra Preview is a massive MoE model with 702 billion total parameters, of which only 36 billion are active in generating each token. A key feature is the use of Multi-head Latent Attention (MLA) — a mechanism that compresses the Key-Value (KV) cache into a latent vector. For users, this means the ability to work with massive contexts and long documents without an exponential increase in GPU memory requirements. The model is trained using a Multi-Token Prediction (MTP) objective — unlike classical solutions that predict only the next word, this model can predict multiple tokens in a single forward pass. This implements a native speculative decoding mechanism, accelerating response generation by up to 40% without quality loss. Technically, this makes it one of the fastest models in the "heavy" LLM class.
Unlike many Russian models that are the result of fine-tuning foreign counterparts, GigaChat 3 was trained from scratch on a unique dataset of over 20 trillion tokens. This dataset includes a vast amount of Russian-language sources and adds languages rarely featured in model training — from Chinese and Arabic to Uzbek and Kazakh. This provides the model with a deep understanding of the Russian language and Russian cultural context, which are inaccessible to foreign LLMs. The model leads in both Russian and international benchmarks: MERA (Industrial) - 0.824, HumanEval+ (Code) - 0.8659, GSM8K (Math) - 0.9598, etc. In terms of quality, GigaChat 3 Ultra confidently outperforms GigaChat 2 Max on all key benchmarks.
The model is optimized for On-premise and Private Cloud installations in large enterprises where data security is critical (air-gapped environments without internet access). It is ideally suited for complex analytics of large document arrays (RAG), automation of L2/L3 technical support (requiring deep contextual understanding), programmer assistants, and code generation within closed corporate repositories. The model supports popular inference engines (vLLM, SGLang, LMDeploy, TensorRT-LLM) and operates in BF16 and FP8 modes for optimal performance.
| Model Name | Context | Type | GPU | TPS | Status | Link |
|---|
There are no public endpoints for this model yet.
Rent your own physically dedicated instance with hourly or long-term monthly billing.
We recommend deploying private instances in the following scenarios:
| Name | vCPU | RAM, MB | Disk, GB | GPU | |||
|---|---|---|---|---|---|---|---|
131,072.0 pipeline |
44 | 524288 | 480 | 6 | $14.10 | Launch | |
131,072.0 pipeline |
32 | 524288 | 480 | 3 | $14.36 | Launch | |
131,072.0 tensor |
44 | 524288 | 480 | 8 | $18.35 | Launch | |
131,072.0 tensor |
32 | 786432 | 480 | 4 | $19.23 | Launch | |
| Name | vCPU | RAM, MB | Disk, GB | GPU | |||
|---|---|---|---|---|---|---|---|
131,072.0 pipeline |
52 | 917504 | 960 | 6 | $28.39 | Launch | |
131,072.0 tensor |
52 | 1048576 | 960 | 8 | $37.37 | Launch | |
There are no configurations for this model, context and quantization yet.
Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.