Sana Video: Efficient Video Generation with Block Linear Diffusion Transformer

Sana Video is a small diffusion model designed to generate videos up to 720×1280 resolution with minute-length duration. It creates high-resolution, high-quality videos with strong text-video alignment at fast speeds, and can be deployed on RTX 5090 GPUs. The model represents a significant advancement in making video generation more accessible and cost-effective.

What is Sana Video?

Sana Video is a diffusion model that extends the capabilities of Sana, a text-to-image framework, into video generation. The model uses a block linear diffusion transformer architecture with linear attention mechanisms to efficiently process the large number of tokens required for video generation. Two core designs enable efficient, effective, and long video generation: Linear DiT, which replaces vanilla attention with linear attention, and a constant-memory KV cache for Block Linear Attention that enables minute-long video generation with a fixed memory footprint.

Key Features

High-Resolution Generation: Produces videos at 720×1280 resolution with 16 frames per second.
Long-Duration Content: Generates videos up to one minute in length through efficient memory management.
Text-to-Video: Creates videos from natural language descriptions with strong text-video alignment.
Image-to-Video: Animates static images, bringing still frames to life with temporal dynamics.
Efficient Processing: Uses linear attention to reduce computational complexity and improve speed.
Fast Generation: Achieves 16 times faster latency compared to similar models while maintaining competitive quality.
Cost-Effective Training: Requires only 12 days on 64 H100 GPUs, representing 1% of the cost of training larger models.
Hardware Accessibility: Deployable on RTX 5090 GPUs with NVFP4 precision for improved inference speed.

Technical Architecture

Sana Video builds on the technical foundations of Sana, including DC-AE compression that reduces images by 32 times, linear attention mechanisms that scale efficiently to high resolutions, and efficient training strategies. The model extends these capabilities into the temporal dimension, adding block-wise autoregressive processing with constant-memory KV caching to enable long video generation.

Performance

Sana Video achieves competitive performance compared to modern state-of-the-art small diffusion models such as Wan 2.1-1.3B and SkyReel-V2-1.3B, while being 16 times faster in measured latency. The model can generate a 5-second 720p video with 60 seconds of latency, and with NVFP4 precision on RTX 5090 GPUs, this can be reduced to 29 seconds, representing a 2.4 times speedup.

Applications

Sana Video supports various applications including content creation, prototyping and previsualization, educational content generation, research and development, and world simulation. The model's efficiency and performance make it suitable for both creative and technical applications.

Note: This is an unofficial about page for Sana Video. For the most accurate information, please refer to official documentation.

About Sana Video

What is Sana Video?

Key Features

Technical Architecture

Performance

Applications