Qwen3.5-4B — небольшая модель с 4 миллиардами параметров, оптимизированная для развёртывания на edge-устройствах и мобильных платформах. Гибридная архитектура включает 32 слоя их которых 8 слоев с полным вниманием обеспечивает эффективную обработку последовательностей с минимальными вычислительными затратами. Несмотря на компактный размер, модель сохраняет все технические инновации серии Qwen3.5 в том числе нативную мультимодальность и контекстное окно 262K токенов позволяя обрабатывать длинные документы даже на устройствах с ограниченной памятью
На бенчмарках модель показывает результаты, превосходящие многие модели вдвое большего размера. В языковых тестах, таких как MMLU-Pro (79.1) и GPQA Diamond (76.2), она опережает Qwen3-Next-80B-A3B-Thinking в ряде сценариев. В агентных задачах TAU2-Bench (79.9) она демонстрирует результаты на уровне моделей в 20 раз больше, подтверждая свою эффективность в планировании и использовании инструментов. Мультимодальные способности также сильны: результат Mathvista(mini) (85.1) лишь немногим уступает модели 9B, а в CountBench (96.3) и MMBench (89.4) она входит в число лучших. Это делает ее идеальной для задач распознавания объектов, сцен и документов на устройствах с ограниченной памятью.
Уникальность модели — в переносе качеств «большого» ИИ на периферию. Это идеальное решение для мобильных приложений, дронов, роботов и умных камер, где требуется локальный и быстрый анализ визуальной и текстовой информации без интернета. Она выгодно отличается от других моделей своего класса редким сочетанием глубоких мультимодальных способностей и агентного «мышления» в таком компактном формате.
A compact model with 9 billion parameters, a 262K token context, and multimodal capabilities designed for efficiently solving a wide range of tasks under limited resources. It is perfectly suited for deployment on consumer hardware while being capable of delivering performance comparable to models 3–4 times its size.
A model with 122 billion parameters and a sparse MoE architecture that activates only 10B parameters per token, plus hybrid attention and native multimodality. It is ideal for tasks requiring reasoning, long-document analysis, and enterprise deployment with optimized resource requirements.
A dense model with 27 billion parameters and 64 layers of hybrid architecture, delivering memory efficiency, maximum predictability, and stable results in tasks requiring multimodal image analysis, programming, and logical reasoning.
A versatile model with 35 billion total parameters (activating 3B), it perfectly balances high performance with resource efficiency. It is ideally suited for production environments on accessible user hardware and excels at tasks requiring speed, multimodal support, reasoning, and long-context processing.
A video generation model capable of creating video from text (T2V), images (I2V), and video (V2V), designed for real-time use and long-duration applications. It can generate video sequences lasting several minutes at a frame rate of 19.5 frames per second (FPS) using a single H100 GPU. The uniqueness of the model lies in its avoidance of traditional anti-drift methods (e.g., self-forcing, error-banks) and standard acceleration techniques (KV-cache, causal masking), all while maintaining video quality and synchronization.
A hybrid model from the Qwen team that combines advanced multimodal capabilities with exceptional efficiency thanks to the Gated DeltaNet and sparse Mixture-of-Experts (MoE) architecture. With a total of 397 billion parameters, the model activates only 17 billion per token, delivering high performance across a wide range of tasks—from complex mathematical reasoning to multimodal understanding and agent development.
A model for image editing tasks that ensures high accuracy, quality, and consistency across various scenarios.
The flagship model of the series, achieving State-of-the-Art (SOTA) performance in coding, agentic tool use, and real-world practical "office" tasks. Thanks to massive-scale Reinforcement Learning (RL) and the innovative Forge framework, the M2.5 not only solves the most complex tasks but does so with exceptional accuracy and speed.
A foundational open-source model designed for solving complex tasks and long-running agent scenarios. With an MoE architecture of 754B parameters (40B active), sparse attention (DSA), innovative slime RL infrastructure, and a focus on practical utility, GLM-5 pushes AI interaction far beyond simple chat, transforming it into a full-fledged executive assistant.
An efficient MoE model with 80B parameters (3B active), specifically designed for programming-oriented agents. The model features highly efficient inference, an extended context length (262K tokens), and best-in-class handling of various tool call formats, making it a highly suitable choice for deploying intelligent developer assistants.
A foundation model designed for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks, enabling simultaneous generation of high-fidelity video and synchronized audio. It addresses limitations of cascaded pipelines and proprietary systems by providing a fully open-source solution.
A foundation model designed for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks, enabling simultaneous generation of high-fidelity video and synchronized audio. It addresses limitations of cascaded pipelines and proprietary systems by providing a fully open-source solution.
It is a native multimodal autoregressive model designed for image generation, supporting both text-to-image and image-to-image (TI2I) tasks. It features a unified architecture for multimodal understanding and generation, achieving performance comparable to leading closed-source models. The model includes two main variants: HunyuanImage-3.0 (text-to-image) and HunyuanImage-3.0-Instruct (enhanced with reasoning capabilities for intelligent prompt improvement and creative editing).
An innovative multimodal model for optical character recognition (OCR) that mimics human visual perception. Instead of standard line-by-line image scanning, its new DeepEncoder V2 uses a compact language model to dynamically reorder visual tokens, following the semantic logic of the document. This significantly improves the understanding of complex layouts, tables, and formulas while maintaining the high efficiency of the previous version.
This model is designed for image-to-video generation. It falls under the category of "World Model". The project is licensed under Apache-2.0, ensuring open access to the code and models.
This is the base model of the ⚡️-Image family, designed for high-quality image generation, broad style coverage, and precise alignment with text prompts. It is intended for professional use, creative tasks, and research, in contrast to the accelerated version Z-Image-Turbo.
A 30-billion parameter MoE model with only ~3.6B parameters activated per token, delivering record-breaking performance in its class with minimal resource requirements (~24 GB VRAM). The model leads in agent-based tasks and programming, supports a 200K context, and is optimized for easy local deployment.
It is a 4 billion parameter rectified flow transformer model designed for fast image generation and editing. It unifies text-to-image generation and multi-reference image editing into a single compact architecture, enabling end-to-end inference in under a second. Optimized for real-time applications without compromising quality, it runs on consumer-grade GPUs such as NVIDIA RTX 3090/4070 with approximately 13GB VRAM.
It is a 9 billion parameter rectified flow transformer model designed for high-speed image generation and editing. It unifies text-to-image generation and multi-reference image editing into a single compact architecture, achieving state-of-the-art quality with end-to-end inference in under half a second. The model leverages an 8 billion parameter Qwen3 text embedder and is step-distilled to 4 inference steps, enabling real-time performance while matching or exceeding the quality of models five times its size.