The NVIDIA Nemotron 3 Super 120B (12B active) is a hybrid model based on a sparse Latent Mixture-of-Experts (MoE) and Mamba-2 architecture, optimized for building complex agentic systems and handling contexts of up to 1 million tokens. Thanks to its innovative architecture, which activates only 12 billion parameters per token, and its Multi-Token Prediction (MTP) mechanism, the model delivers high inference efficiency, combining response quality with performance and computational savings when processing long sequences.
This is an updated version of the LTX-2 model, developed by Lightricks for synchronized video and audio generation within a single model. It is based on the DiT architecture and integrates key components of modern video generation systems. The model delivers improved audio and visual quality, as well as increased text prompt accuracy.
An ultra-compact multimodal model with 0.8 billion parameters, featuring a hybrid architecture of Gated DeltaNet and Gated Attention. It boasts a record-breaking context length of 262,144 tokens for its size, supports 201 languages, and offers two operational modes—standard and reasoning (thinking)—making it an ideal solution for prototyping, research, and fine-tuning for specific tasks.
A miniature 2B parameter model designed for prototyping, research tasks, and experiments. Despite its minimal size (2 billion parameters), it retains the key features of the series — the thinking mode, multimodality, a 262K token context, and a hybrid architecture, making it an excellent sandbox for studying the behavior of modern LLMs..
Qwen3.5-4B — небольшая модель с 4 миллиардами параметров, оптимизированная для развёртывания на edge-устройствах и мобильных платформах. Гибридная архитектура включает 32 слоя их которых 8 слоев с полным вниманием обеспечивает эффективную обработку последовательностей с минимальными вычислительными затратами. Несмотря на компактный размер, модель сохраняет все технические инновации серии Qwen3.5 в том числе нативную мультимодальность и контекстное окно 262K токенов позволяя обрабатывать длинные документы даже на устройствах с ограниченной памятью
На бенчмарках модель показывает результаты, превосходящие многие модели вдвое большего размера. В языковых тестах, таких как MMLU-Pro (79.1) и GPQA Diamond (76.2), она опережает Qwen3-Next-80B-A3B-Thinking в ряде сценариев. В агентных задачах TAU2-Bench (79.9) она демонстрирует результаты на уровне моделей в 20 раз больше, подтверждая свою эффективность в планировании и использовании инструментов. Мультимодальные способности также сильны: результат Mathvista(mini) (85.1) лишь немногим уступает модели 9B, а в CountBench (96.3) и MMBench (89.4) она входит в число лучших. Это делает ее идеальной для задач распознавания объектов, сцен и документов на устройствах с ограниченной памятью.
Уникальность модели — в переносе качеств «большого» ИИ на периферию. Это идеальное решение для мобильных приложений, дронов, роботов и умных камер, где требуется локальный и быстрый анализ визуальной и текстовой информации без интернета. Она выгодно отличается от других моделей своего класса редким сочетанием глубоких мультимодальных способностей и агентного «мышления» в таком компактном формате.
A compact model with 9 billion parameters, a 262K token context, and multimodal capabilities designed for efficiently solving a wide range of tasks under limited resources. It is perfectly suited for deployment on consumer hardware while being capable of delivering performance comparable to models 3–4 times its size.
A model with 122 billion parameters and a sparse MoE architecture that activates only 10B parameters per token, plus hybrid attention and native multimodality. It is ideal for tasks requiring reasoning, long-document analysis, and enterprise deployment with optimized resource requirements.
A dense model with 27 billion parameters and 64 layers of hybrid architecture, delivering memory efficiency, maximum predictability, and stable results in tasks requiring multimodal image analysis, programming, and logical reasoning.
A versatile model with 35 billion total parameters (activating 3B), it perfectly balances high performance with resource efficiency. It is ideally suited for production environments on accessible user hardware and excels at tasks requiring speed, multimodal support, reasoning, and long-context processing.
A hybrid model from the Qwen team that combines advanced multimodal capabilities with exceptional efficiency thanks to the Gated DeltaNet and sparse Mixture-of-Experts (MoE) architecture. With a total of 397 billion parameters, the model activates only 17 billion per token, delivering high performance across a wide range of tasks—from complex mathematical reasoning to multimodal understanding and agent development.
A model for image editing tasks that ensures high accuracy, quality, and consistency across various scenarios.
The flagship model of the series, achieving State-of-the-Art (SOTA) performance in coding, agentic tool use, and real-world practical "office" tasks. Thanks to massive-scale Reinforcement Learning (RL) and the innovative Forge framework, the M2.5 not only solves the most complex tasks but does so with exceptional accuracy and speed.
A foundational open-source model designed for solving complex tasks and long-running agent scenarios. With an MoE architecture of 754B parameters (40B active), sparse attention (DSA), innovative slime RL infrastructure, and a focus on practical utility, GLM-5 pushes AI interaction far beyond simple chat, transforming it into a full-fledged executive assistant.
An efficient MoE model with 80B parameters (3B active), specifically designed for programming-oriented agents. The model features highly efficient inference, an extended context length (262K tokens), and best-in-class handling of various tool call formats, making it a highly suitable choice for deploying intelligent developer assistants.
A foundation model designed for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks, enabling simultaneous generation of high-fidelity video and synchronized audio. It addresses limitations of cascaded pipelines and proprietary systems by providing a fully open-source solution.
A foundation model designed for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks, enabling simultaneous generation of high-fidelity video and synchronized audio. It addresses limitations of cascaded pipelines and proprietary systems by providing a fully open-source solution.
It is a native multimodal autoregressive model designed for image generation, supporting both text-to-image and image-to-image (TI2I) tasks. It features a unified architecture for multimodal understanding and generation, achieving performance comparable to leading closed-source models. The model includes two main variants: HunyuanImage-3.0 (text-to-image) and HunyuanImage-3.0-Instruct (enhanced with reasoning capabilities for intelligent prompt improvement and creative editing).
An innovative multimodal model for optical character recognition (OCR) that mimics human visual perception. Instead of standard line-by-line image scanning, its new DeepEncoder V2 uses a compact language model to dynamically reorder visual tokens, following the semantic logic of the document. This significantly improves the understanding of complex layouts, tables, and formulas while maintaining the high efficiency of the previous version.
This model is designed for image-to-video generation. It falls under the category of "World Model". The project is licensed under Apache-2.0, ensuring open access to the code and models.
This is the base model of the ⚡️-Image family, designed for high-quality image generation, broad style coverage, and precise alignment with text prompts. It is intended for professional use, creative tasks, and research, in contrast to the accelerated version Z-Image-Turbo.