This is an updated version of the LTX-2 model, developed by Lightricks for synchronized video and audio generation within a single model. It is based on the DiT architecture and integrates key components of modern video generation systems. The model delivers improved audio and visual quality, as well as increased text prompt accuracy.
A hybrid model from the Qwen team that combines advanced multimodal capabilities with exceptional efficiency thanks to the Gated DeltaNet and sparse Mixture-of-Experts (MoE) architecture. With a total of 397 billion parameters, the model activates only 17 billion per token, delivering high performance across a wide range of tasks—from complex mathematical reasoning to multimodal understanding and agent development.
A model for image editing tasks that ensures high accuracy, quality, and consistency across various scenarios.
The flagship model of the series, achieving State-of-the-Art (SOTA) performance in coding, agentic tool use, and real-world practical "office" tasks. Thanks to massive-scale Reinforcement Learning (RL) and the innovative Forge framework, the M2.5 not only solves the most complex tasks but does so with exceptional accuracy and speed.
A foundational open-source model designed for solving complex tasks and long-running agent scenarios. With an MoE architecture of 754B parameters (40B active), sparse attention (DSA), innovative slime RL infrastructure, and a focus on practical utility, GLM-5 pushes AI interaction far beyond simple chat, transforming it into a full-fledged executive assistant.
An efficient MoE model with 80B parameters (3B active), specifically designed for programming-oriented agents. The model features highly efficient inference, an extended context length (262K tokens), and best-in-class handling of various tool call formats, making it a highly suitable choice for deploying intelligent developer assistants.
A foundation model designed for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks, enabling simultaneous generation of high-fidelity video and synchronized audio. It addresses limitations of cascaded pipelines and proprietary systems by providing a fully open-source solution.
A foundation model designed for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks, enabling simultaneous generation of high-fidelity video and synchronized audio. It addresses limitations of cascaded pipelines and proprietary systems by providing a fully open-source solution.
It is a native multimodal autoregressive model designed for image generation, supporting both text-to-image and image-to-image (TI2I) tasks. It features a unified architecture for multimodal understanding and generation, achieving performance comparable to leading closed-source models. The model includes two main variants: HunyuanImage-3.0 (text-to-image) and HunyuanImage-3.0-Instruct (enhanced with reasoning capabilities for intelligent prompt improvement and creative editing).
An innovative multimodal model for optical character recognition (OCR) that mimics human visual perception. Instead of standard line-by-line image scanning, its new DeepEncoder V2 uses a compact language model to dynamically reorder visual tokens, following the semantic logic of the document. This significantly improves the understanding of complex layouts, tables, and formulas while maintaining the high efficiency of the previous version.
This model is designed for image-to-video generation. It falls under the category of "World Model". The project is licensed under Apache-2.0, ensuring open access to the code and models.
This is the base model of the ⚡️-Image family, designed for high-quality image generation, broad style coverage, and precise alignment with text prompts. It is intended for professional use, creative tasks, and research, in contrast to the accelerated version Z-Image-Turbo.
A 30-billion parameter MoE model with only ~3.6B parameters activated per token, delivering record-breaking performance in its class with minimal resource requirements (~24 GB VRAM). The model leads in agent-based tasks and programming, supports a 200K context, and is optimized for easy local deployment.
It is a 4 billion parameter rectified flow transformer model designed for fast image generation and editing. It unifies text-to-image generation and multi-reference image editing into a single compact architecture, enabling end-to-end inference in under a second. Optimized for real-time applications without compromising quality, it runs on consumer-grade GPUs such as NVIDIA RTX 3090/4070 with approximately 13GB VRAM.
It is a 9 billion parameter rectified flow transformer model designed for high-speed image generation and editing. It unifies text-to-image generation and multi-reference image editing into a single compact architecture, achieving state-of-the-art quality with end-to-end inference in under half a second. The model leverages an 8 billion parameter Qwen3 text embedder and is step-distilled to 4 inference steps, enabling real-time performance while matching or exceeding the quality of models five times its size.
It is a text-to-image and image-to-image generation model employing a hybrid architecture combining an autoregressive generator and a diffusion decoder. It excels in generating high-fidelity images with precise text rendering and semantic understanding, particularly in complex, information-dense scenarios.
an audio-visual base model based on the DiT architecture, developed for synchronized generation of video and audio within a single model. It incorporates key components of modern video generation systems, including open weights and optimization for local use.
An open-source model built on a Mixture-of-Experts architecture with 1 trillion parameters, of which 32 billion are activated per token. The developers have implemented a "visual agentic intelligence" paradigm within it—a combination of visual perception, reasoning, and autonomous agents. The model is multimodal, presented in native INT4 quantization, and includes a unique Agent Swarm mechanism that orchestrates and enables the parallel operation of up to 100 sub-agents. This improves quality and reduces the execution time for complex tasks by an average factor of 4.5.
It is the December 2025 update to Qwen-Image, a text-to-image foundational model. It is designed to generate high-quality images from textual prompts with enhanced capabilities in realism, detail rendering, and text integration.
The model for text-to-image generation represents an improved version of the previous NextStep-1 model. It was developed to enhance image quality and address visualization issues inherent in earlier versions.