Qwen3.5-0.8B

reasoning
multimodal

The Qwen3.5-0.8B model is ultra-compact—the smallest in the Qwen 3.5 series—yet it retains all the technical innovations and advantages of the lineup. Its architecture is built on a hybrid approach, combining two key mechanisms: Gated DeltaNet and Gated Attention, arranged across 24 layers in a pattern of 6 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)). This allows the model to efficiently compress and process long sequences with lower computational costs compared to traditional Transformers. It supports Multi-Token Prediction (MTP) and comes with ready-made integrations for popular inference frameworks such as vLLM, SGLang, and Transformers.

The uniqueness of Qwen3.5-0.8B lies in its ability to be truly multimodal while maintaining an extremely small size (0.8B parameters). Unlike its predecessor, Qwen3-0.6B, which was purely text-based, the new model integrates a vision encoder and was trained in its early stages on mixed multimodal data. This enables it not only to read text within images but also to understand complex visual scenes, diagrams, and even videos. The model supports 201 languages, a reasoning mode (thinking mode), improved instruction following, and a native context window of 262,144 tokens—a record for models of this size.

Thanks to its architecture and performance, Qwen3.5-0.8B opens up a wide range of possibilities for developers and researchers - Rapid Prototyping and Research: An ideal "sandbox" for testing ideas, prompt engineering, and experimenting with long contexts without the need for expensive hardware.


Announce Date: 28.02.2026
Parameters: 874M
Context: 263K
Layers: 24, using full attention: 6
Attention Type: Linear Attention
Developer: Qwen
Transformers Version: 4.57.0.dev0
vLLM Version: 0.17.0
License: Apache 2.0

Public endpoint

Use our pre-built public endpoints for free to test inference and explore Qwen3.5-0.8B capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU Status Link
There are no public endpoints for this model yet.

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended server configurations for hosting Qwen3.5-0.8B

Prices:
Name GPU Price, hour TPS Max Concurrency
teslat4-1.16.16.160
262,144.0
1 $0.33 3.594 Launch
rtx2080ti-1.10.16.500
262,144.0
1 $0.38 2.104 Launch
teslaa2-1.16.32.160
262,144.0
1 $0.38 3.594 Launch
teslaa10-1.16.32.160
262,144.0
1 $0.53 5.980 Launch
rtx3080-1.16.32.160
262,144.0
1 $0.57 1.806 Launch
rtx3090-1.16.24.160
262,144.0
1 $0.83 5.980 Launch
rtx4090-1.16.32.160
262,144.0
1 $1.02 5.980 Launch
teslav100-1.12.64.160
262,144.0
1 $1.20 8.365 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 12.307 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 8.365 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 22.676 Launch
h100-1.16.64.160
262,144.0
1 $3.83 22.676 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 26.850 Launch
h200-1.16.128.160
262,144.0
1 $4.74 40.862 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslat4-1.16.16.160
262,144.0
1 $0.33 3.673 Launch
rtx2080ti-1.10.16.500
262,144.0
1 $0.38 2.182 Launch
teslaa2-1.16.32.160
262,144.0
1 $0.38 3.673 Launch
teslaa10-1.16.32.160
262,144.0
1 $0.53 6.058 Launch
rtx3080-1.16.32.160
262,144.0
1 $0.57 1.884 Launch
rtx3090-1.16.24.160
262,144.0
1 $0.83 6.058 Launch
rtx4090-1.16.32.160
262,144.0
1 $1.02 6.058 Launch
teslav100-1.12.64.160
262,144.0
1 $1.20 8.443 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 12.385 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 8.443 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 22.754 Launch
h100-1.16.64.160
262,144.0
1 $3.83 22.754 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 26.928 Launch
h200-1.16.128.160
262,144.0
1 $4.74 40.941 Launch
Prices:
Name GPU Price, hour TPS Max Concurrency
teslat4-1.16.16.160
262,144.0
1 $0.33 3.403 Launch
rtx2080ti-1.10.16.500
262,144.0
1 $0.38 1.912 Launch
teslaa2-1.16.32.160
262,144.0
1 $0.38 3.403 Launch
teslaa10-1.16.32.160
262,144.0
1 $0.53 5.788 Launch
rtx3080-1.16.32.160
262,144.0
1 $0.57 1.614 Launch
rtx3090-1.16.24.160
262,144.0
1 $0.83 5.788 Launch
rtx4090-1.16.32.160
262,144.0
1 $1.02 5.788 Launch
teslav100-1.12.64.160
262,144.0
1 $1.20 8.173 Launch
rtxa5000-2.16.64.160.nvlink
262,144.0
tensor
2 $1.23 12.116 Launch
rtx5090-1.16.64.160
262,144.0
1 $1.59 8.173 Launch
teslaa100-1.16.64.160
262,144.0
1 $2.37 22.484 Launch
h100-1.16.64.160
262,144.0
1 $3.83 22.484 Launch
h100nvl-1.16.96.160
262,144.0
1 $4.11 26.658 Launch
h200-1.16.128.160
262,144.0
1 $4.74 40.671 Launch

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.