Qwen3-VL-235B-A22B-Thinking

reasoning
multimodal

Qwen3-VL-235B-A22B-Thinking is the flagship multimodal model in the Qwen3 series, designed for deep understanding and grounded reasoning based on text, images, and video. It implements a broad set of capabilities for object recognition, spatial and temporal localization, as well as advanced comprehension of complex documents and event dynamics. The model is built upon Qwen/Qwen3-235B-A22B-Thinking-2507. At the core of its multimodal capabilities lies the Interleaved-MRoPE mechanism, which generates positional embeddings across time, width, and height—critical for high-quality video analytics. DeepStack combines features from different layers of the Vision Transformer (ViT), enhancing perceptual detail and improving image-text alignment accuracy. The Text–Timestamp Alignment technology enables highly precise alignment of textual event representations with timestamps, which is extremely important for correct processing of video and event-based data. The model supports a context window of up to 256,000 tokens, expandable to 1 million, allowing analysis of large documents, books, and hours of video streams while fully preserving context and enabling rapid navigation to relevant segments through indexing.

Qwen3-VL-235B-A22B-Thinking outperforms most open models in multimodal understanding due to its unified processing of text, images, and video; advanced OCR capabilities (supporting 32 languages) with robustness to distorted text, poor lighting, and challenging angles; its ability to extract information from highly structured, long documents and parse textual layout; 2D and 3D spatial localization capabilities for analyzing complex scenes; and, not least, its enhanced reasoning module: the model can construct logical and causal reasoning chains, explain visual scenes, analyze object relationships, track temporal dynamics, and provide well-justified answers—making it an essential tool for engineering, mathematical, research, and agentic tasks.

Developers report that Qwen3-VL-235B-A22B-Thinking achieves top-tier performance on most benchmarks among reasoning models and significantly surpasses closed systems, especially in perception and multimodal reasoning over long contexts. Given this, the model is recommended for tasks involving recognition and information extraction from documents (banking, legal, medical, historical, etc.). Qwen3-VL-235B-A22B-Thinking also excels at deep video analysis or other forms of sequential event representation: motion analysis, object tracking, detailed segmentation, and video clip annotation. Another strong point is mathematical reasoning—the model can not only solve geometry problems and extract numerical data from charts and diagrams, but also prove theorems and derive comprehensive business insights from visualizations. Programming is another area worth highlighting. Code generation and analysis from visual inputs is precisely the domain where Qwen3-VL-235B-A22B-Thinking delivers outstanding results. For example, to obtain visualization code, you no longer need to write lengthy and detailed descriptions in chat about how the chart should look—simply sketch it and show the model.


Announce Date: 23.09.2025
Parameters: 236B
Experts: 128
Activated at inference: 22B
Context: 263K
Layers: 94
Attention Type: Full Attention
VRAM requirements: 178.2 GB using 4 bits quantization
Developer: Qwen
Transformers Version: 4.57.0.dev0
License: Apache 2.0

Public endpoint

Use our pre-built public endpoints for free to test inference and explore Qwen3-VL-235B-A22B-Thinking capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU TPS Status Link
There are no public endpoints for this model yet.

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended configurations for hosting Qwen3-VL-235B-A22B-Thinking

Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslaa100-3.32.384.240
262,144.0
32 393216 240 3 $8.00 Launch
rtx5090-8.44.256.240
262,144.0
44 262144 240 8 $11.55 Launch
h200-2.24.256.240
262,144.0
24 262144 240 2 $13.89 Launch
teslah100-3.32.384.240
262,144.0
32 393216 240 3 $15.58 Launch
Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslaa100-6.44.512.320.nvlink
262,144.0
44 524288 320 6 $15.36 Launch
h200-3.32.512.480
262,144.0
32 524288 480 3 $21.08 Launch
Prices:
Name vCPU RAM, MB Disk, GB GPU Price, hour
teslaa100-8.44.704.960.nvlink
262,144.0
44 720896 960 8 $20.48 Launch
h200-6.52.896.640
262,144.0
52 917504 640 6 $41.79 Launch

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.