Qwen3-VL-30B-A3B-Instruct is a medium-sized multimodal model of the Qwen3-VL series, demonstrating advanced capabilities in the field of image, video and text comprehension. The model is based on a Mixture of Experts (MoE) architecture with 30 billion parameters, of which only 3 billion are actively used, which ensures high performance with relatively low computing costs. The architecture includes 48 layers, 128 experts (8 active), GQA attention with 32 query heads and 4 for keys and values. The key difference from the previous VL versions were three architectural innovations. Interleaved-MRoPE provides full frequency allocation in time, latitude, and altitude coordinates through enhanced positional embeddings, which is critical for understanding long-term video sequences. DeepStack technology combines the multilevel features of Vision Transformer to capture fine-grained details and enhance image alignment with text. The Text-Timestamp Alignment system is superior to the traditional T-RoPE, providing accurate event timestamps for enhanced temporal video modeling. These architectural solutions allow the model not only to "see" images or videos, but also to truly understand the visual world and its dynamics.
The model is able to work as a visual agent, recognizing elements of computer and mobile interfaces, understanding their functions, invoking tools, and performing complex automation tasks. Advanced visual coding features allow you to generate Draw.io Diagrams, HTML, CSS, and JavaScript code are directly based on image and video analysis, which opens up new horizons for automating web development. Advanced spatial perception includes the assessment of object positions, viewpoints, and occlusions, providing a stronger 2D and 3D spatial understanding of scenes. The technical characteristics of the model are impressive: native support for the context of 256K tokens with the ability to expand to 1M, which allows you to process entire books and videos lasting hours with full memorization and indexing by seconds. Advanced OCR supports 32 languages, is resistant to low light, blur and tilt, works better with rare and ancient characters, as well as improved processing of the structure of long documents and entity extraction.
The Qwen3-VL-30B-A3B-Instruct opens up wide possibilities for practical applications in various fields. Interface automation is becoming a reality thanks to the model's ability to recognize and interact with GUI elements of desktop and mobile applications, which allows the creation of intelligent bots to automate routine tasks. Web development gets a powerful tool for generating code directly from visual layouts or descriptions, significantly speeding up the prototyping process. Document analysis with advanced OCR makes the model indispensable for processing multilingual documentation, scanned forms, invoices, and spreadsheets in the financial and commercial fields. Processing video content for up to several hours with accurate time indexing opens up opportunities for creating video surveillance analysis systems, educational content, and media analytics.
Model Name | Context | Type | GPU | TPS | Status | Link |
---|
There are no public endpoints for this model yet.
Rent your own physically dedicated instance with hourly or long-term monthly billing.
We recommend deploying private instances in the following scenarios:
Name | vCPU | RAM, MB | Disk, GB | GPU | |||
---|---|---|---|---|---|---|---|
262,144.0 |
16 | 65536 | 160 | 2 | $0.93 | Launch | |
262,144.0 |
16 | 65536 | 160 | 4 | $0.96 | Launch | |
262,144.0 |
16 | 65536 | 160 | 2 | $1.23 | Launch | |
262,144.0 |
32 | 131072 | 160 | 4 | $1.26 | Launch | |
262,144.0 |
16 | 65536 | 160 | 2 | $1.67 | Launch | |
262,144.0 |
16 | 65536 | 160 | 2 | $2.19 | Launch | |
262,144.0 |
16 | 65535 | 240 | 2 | $2.22 | Launch | |
262,144.0 |
16 | 65536 | 160 | 1 | $2.58 | Launch | |
262,144.0 |
16 | 65536 | 160 | 2 | $2.93 | Launch | |
262,144.0 |
16 | 65536 | 160 | 1 | $5.11 | Launch | |
262,144.0 |
16 | 131072 | 160 | 1 | $6.98 | Launch |
Name | vCPU | RAM, MB | Disk, GB | GPU | |||
---|---|---|---|---|---|---|---|
262,144.0 |
16 | 98304 | 160 | 3 | $1.34 | Launch | |
262,144.0 |
32 | 131072 | 160 | 6 | $1.65 | Launch | |
262,144.0 |
16 | 131072 | 160 | 4 | $2.34 | Launch | |
262,144.0 |
16 | 98304 | 160 | 3 | $2.45 | Launch | |
262,144.0 |
16 | 65536 | 160 | 1 | $2.58 | Launch | |
262,144.0 |
16 | 98304 | 160 | 3 | $3.23 | Launch | |
262,144.0 |
64 | 262144 | 320 | 3 | $3.89 | Launch | |
262,144.0 |
16 | 98304 | 160 | 3 | $4.34 | Launch | |
262,144.0 |
16 | 65536 | 160 | 1 | $5.11 | Launch | |
262,144.0 |
16 | 131072 | 160 | 1 | $6.98 | Launch |
Name | vCPU | RAM, MB | Disk, GB | GPU | |||
---|---|---|---|---|---|---|---|
262,144.0 |
24 | 196608 | 160 | 6 | $3.50 | Launch | |
262,144.0 |
32 | 98304 | 160 | 4 | $4.35 | Launch | |
262,144.0 |
24 | 98304 | 160 | 2 | $5.04 | Launch | |
262,144.0 |
16 | 131072 | 160 | 4 | $5.74 | Launch | |
262,144.0 |
44 | 262144 | 160 | 6 | $6.63 | Launch | |
262,144.0 |
16 | 131072 | 160 | 1 | $6.98 | Launch | |
262,144.0 |
24 | 262144 | 160 | 2 | $10.40 | Launch |
Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.