MOVA-720p

A foundation model designed for Image-to-Video-Audio (IT2VA) and Text-to-Video-Audio (T2VA) tasks, enabling simultaneous generation of high-fidelity video and synchronized audio. It addresses limitations of cascaded pipelines and proprietary systems by providing a fully open-source solution.  

Key Features:  

  • Native Bimodal Generation: Generates video and audio in a single inference pass, ensuring precise synchronization (e.g., lip-sync and environment-aware sound effects).  
  • Accurate lip synchronization and sound effects: The model demonstrates state-of-the-art results in multilingual speech synchronization and context-aware sound effect generation.
  • Mixture-of-Experts (MoE) Design: Total 32B parameters, with 18B active during inference for efficient deployment.  
  • Open-Source Framework: Releases model weights, inference code, training pipelines, and LoRA fine-tuning scripts.  
  • Resolution: 720p

The model is a component of the video generation pipeline, consisting of:

  • Text encoder: ~5.7B parameters,
  • audio VAE: ~372M parameters, 
  • audio DiT: ~1.4B parameters, 
  • video DiT: ~28.6B parameters, 
  • video VAE: ~127M parameters, 
  • dual-tower bridge: ~2.7B parameters,

Total: ~38.8B parameters


For local running, authors recommends using at least 24GB GPU to generate a 8-second video at360p resolution (with offloading).


Announce Date: 29.01.2026
Parameters: 32B
Developer: OpenMOSS
Diffusers Version: 0.36.0
License: Apache 2.0

Public endpoint

Use our pre-built public endpoints for free to test inference and explore MOVA-720p capabilities. You can obtain an API access token on the token management page after registration and verification.
Model Name Context Type GPU Status Link
There are no public endpoints for this model yet.

Private server

Rent your own physically dedicated instance with hourly or long-term monthly billing.

We recommend deploying private instances in the following scenarios:

  • maximize endpoint performance,
  • enable full context for long sequences,
  • ensure top-tier security for data processing in an isolated, dedicated environment,
  • use custom weights, such as fine-tuned models or LoRA adapters.

Recommended server configurations for hosting MOVA-720p

Prices:
Name GPU Price, hour Generation time, sec.
teslaa10-1.16.32.160 1 $0.53 Launch
rtx3090-1.16.32.160 1 $0.84 Launch
rtx4090-1.16.32.160 1 $1.02 Launch
teslav100-1.12.64.160 1 $1.20 Launch
rtx5090-1.16.64.160 1 $1.59 Launch
teslaa100-1.16.64.160 1 $2.37 Launch
h100-1.16.64.160 1 $3.83 Launch
h100nvl-1.16.96.160 1 $4.11 Launch
h200-1.16.128.160 1 $4.74 Launch
Prices:
Name GPU Price, hour Generation time, sec.
teslat4-1.16.16.160 1 $0.33 Launch
teslaa2-1.16.32.160 1 $0.38 Launch
teslaa10-1.16.32.160 1 $0.53 Launch
rtx3090-1.16.24.160 1 $0.83 Launch
rtx4090-1.16.32.160 1 $1.02 Launch
teslav100-1.12.64.160 1 $1.20 Launch
rtx5090-1.16.64.160 1 $1.59 Launch
teslaa100-1.16.64.160 1 $2.37 Launch
h100-1.16.64.160 1 $3.83 Launch
h100nvl-1.16.96.160 1 $4.11 Launch
h200-1.16.128.160 1 $4.74 Launch
Prices:
Name GPU Price, hour Generation time, sec.
teslat4-1.16.16.160 1 $0.33 Launch
rtx2080ti-1.10.16.500 1 $0.38 Launch
teslaa2-1.16.32.160 1 $0.38 Launch
teslaa10-1.16.32.160 1 $0.53 Launch
rtx3080-1.16.32.160 1 $0.57 Launch
rtx3090-1.16.24.160 1 $0.83 Launch
rtx4090-1.16.32.160 1 $1.02 Launch
teslav100-1.12.64.160 1 $1.20 Launch
rtx5090-1.16.64.160 1 $1.59 Launch
teslaa100-1.16.64.160 1 $2.37 Launch
h100-1.16.64.160 1 $3.83 Launch
h100nvl-1.16.96.160 1 $4.11 Launch
h200-1.16.128.160 1 $4.74 Launch

Related models

Need help?

Contact our dedicated neural networks support team at nn@immers.cloud or send your request to the sales department at sale@immers.cloud.