GigaChat3-10B-A1.8B

GigaChat3-10B-A1.8B is an excellent example of efficient computation in LLMs. With a total size of 10 billion parameters, only 1.8 billion are active during generation. This places its speed on par with very small models, but the Mixture-of-Experts architecture allows it to store a significantly larger amount of knowledge. Generation speed is further enhanced by the MTP (Multi-Token Prediction) mechanism, which produces multiple output tokens at once. Furthermore, the model implements Multi-head Latent Attention (MLA), which compresses the Key-Value cache into a latent vector, reducing GPU memory requirements. This enables efficient and cost-effective operation with a long context of 256K tokens.

The model underwent comprehensive training on 20 trillion tokens, which included 10 additional non-standard languages (languages of former USSR countries, Chinese, Arabic) and a massive block of synthetic data to ensure high-quality responses in mathematics, logic, and programming. This training sets it apart favorably from compact versions of Llama or Gemma, which often struggle with Russian grammar or lack knowledge of Russian everyday and cultural contexts. GigaChat 3 Lightning (as this model is also known), on the contrary, demonstrates high coherence and proficiency in Russian speech and even an understanding of colloquial terms.

Thanks to its low latency and high throughput, the model is ideally suited for creating fast conversational agents and first-line support chatbots, for use as a "Router model" in agentic systems (classifying queries before routing them to a larger model), and for inference on limited resources (Edge devices, modest servers). The model supports easy deployment via popular frameworks: transformers, vLLM, and SGLang, and is offered in two versions—FP8 and bfloat16—allowing users to choose between performance and quality.


Announce Date: 19.11.2025
Parameters: 11B
Experts: 64
Activated at inference: 1.8B
Context: 263K
Layers: 26
Attention Type: Multi-head Latent Attention
VRAM requirements: 15.0 GB using 4 bits quantization
Developer: Sber AI
Transformers Version: 4.53.2
License: MIT

Public endpoint

Use our pre-built public endpoints for free to test inference and explore GigaChat3-10B-A1.8B capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU TPS Status Link
ai-sage/GigaChat3-10B-A1.8B 262,144.0 Public RTX4090 AVAILABLE chat

API access to GigaChat3-10B-A1.8B endpoints

curl https://chat.immers.cloud/v1/endpoints/gigachat3-10b-a1.8b/generate/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer USER_API_KEY" \
-d '{"model": "GigaChat-3-10B-A1.8B", "messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"}
], "temperature": 0, "max_tokens": 150}'
$response = Invoke-WebRequest https://chat.immers.cloud/v1/endpoints/gigachat3-10b-a1.8b/generate/chat/completions `
-Method POST `
-Headers @{
"Authorization" = "Bearer USER_API_KEY"
"Content-Type" = "application/json"
} `
-Body (@{
model = "GigaChat-3-10B-A1.8B"
messages = @(
@{ role = "system"; content = "You are a helpful assistant." },
@{ role = "user"; content = "Say this is a test" }
)
} | ConvertTo-Json)
($response.Content | ConvertFrom-Json).choices[0].message.content
#!pip install OpenAI --upgrade

from openai import OpenAI

client = OpenAI(
api_key="USER_API_KEY",
base_url="https://chat.immers.cloud/v1/endpoints/gigachat3-10b-a1.8b/generate/",
)

chat_response = client.chat.completions.create(
model="GigaChat-3-10B-A1.8B",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Say this is a test"},
]
)
print(chat_response.choices[0].message.content)

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended configurations for hosting GigaChat3-10B-A1.8B

Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslat4-1.16.16.160
262,144.0
16 16384 160 1 $0.33 Launch
teslaa2-1.16.32.160
262,144.0
16 32768 160 1 $0.38 Launch
teslaa10-1.16.32.160
262,144.0
16 32768 160 1 $0.53 Launch
rtx2080ti-2.12.64.160
262,144.0
tensor
12 65536 160 2 $0.69 Launch
rtx3090-1.16.24.160
262,144.0
16 24576 160 1 $0.88 Launch
rtx3080-2.16.32.160
262,144.0
tensor
16 32762 160 2 $0.97 Launch
rtx4090-1.16.32.160
262,144.0
16 32768 160 1 $1.15 Launch
teslav100-1.12.64.160
262,144.0
12 65536 160 1 $1.20 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
16 65536 160 2 $1.23 Launch
rtx5090-1.16.64.160
262,144.0
16 65536 160 1 $1.59 Launch
teslaa100-1.16.64.160
262,144.0
16 65536 160 1 $2.37 Launch
teslah100-1.16.64.160
262,144.0
16 65536 160 1 $3.83 Launch
h200-1.16.128.160
262,144.0
16 131072 160 1 $4.74 Launch
Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslat4-2.16.32.160
262,144.0
tensor
16 32768 160 2 $0.54 Launch
teslaa2-2.16.32.160
262,144.0
tensor
16 32768 160 2 $0.57 Launch
rtx2080ti-3.12.24.120
262,144.0
pipeline
12 24576 120 3 $0.84 Launch
teslaa10-2.16.64.160
262,144.0
tensor
16 65536 160 2 $0.93 Launch
rtx2080ti-4.16.32.160
262,144.0
tensor
16 32768 160 4 $1.12 Launch
teslav100-1.12.64.160
262,144.0
12 65536 160 1 $1.20 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
16 65536 160 2 $1.23 Launch
rtx3080-3.16.64.160
262,144.0
pipeline
16 65536 160 3 $1.43 Launch
rtx5090-1.16.64.160
262,144.0
16 65536 160 1 $1.59 Launch
rtx3090-2.16.64.160
262,144.0
tensor
16 65536 160 2 $1.67 Launch
rtx3080-4.16.64.160
262,144.0
tensor
16 65536 160 4 $1.82 Launch
rtx4090-2.16.64.160
262,144.0
tensor
16 65536 160 2 $2.19 Launch
teslaa100-1.16.64.160
262,144.0
16 65536 160 1 $2.37 Launch
teslah100-1.16.64.160
262,144.0
16 65536 160 1 $3.83 Launch
h200-1.16.128.160
262,144.0
16 131072 160 1 $4.74 Launch
Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslat4-3.32.64.160
262,144.0
pipeline
32 65536 160 3 $0.88 Launch
teslaa10-2.16.64.160
262,144.0
tensor
16 65536 160 2 $0.93 Launch
teslat4-4.16.64.160
262,144.0
tensor
16 65536 160 4 $0.96 Launch
teslaa2-3.32.128.160
262,144.0
pipeline
32 131072 160 3 $1.06 Launch
rtx2080ti-4.16.64.160
262,144.0
tensor
16 65536 160 4 $1.18 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
16 65536 160 2 $1.23 Launch
teslaa2-4.32.128.160
262,144.0
tensor
32 131072 160 4 $1.26 Launch
rtx3090-2.16.64.160
262,144.0
tensor
16 65536 160 2 $1.67 Launch
rtx4090-2.16.64.160
262,144.0
tensor
16 65536 160 2 $2.19 Launch
teslav100-2.16.64.240
262,144.0
tensor
16 65535 240 2 $2.22 Launch
teslaa100-1.16.64.160
262,144.0
16 65536 160 1 $2.37 Launch
rtx5090-2.16.64.160
262,144.0
tensor
16 65536 160 2 $2.93 Launch
teslah100-1.16.64.160
262,144.0
16 65536 160 1 $3.83 Launch
h200-1.16.128.160
262,144.0
16 131072 160 1 $4.74 Launch

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.