InfinityStar by ByteDance: AI Image and Video Generation

InfinityStar is a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Developed by FoundationVision, this system represents a significant advancement in visual generation technology, combining spatial and temporal modeling within a single architecture.

What is InfinityStar?

InfinityStar is a unified spacetime autoregressive framework that generates high-resolution images and dynamic videos from text descriptions. The system uses a discrete approach that jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long-duration video synthesis through straightforward temporal autoregression.

Key Achievements

Scores 83.74 on VBench, outperforming all autoregressive models by significant margins
Surpasses diffusion-based competitors like HunyuanVideo in benchmark performance
Generates 5-second 720p videos approximately 10 times faster than leading diffusion-based methods
First discrete autoregressive video generator capable of producing industrial-level 720p videos
Accepted as NeurIPS 2025 Oral presentation

Technical Architecture

InfinityStar operates as an 8 billion parameter model that processes visual information in a unified manner. The system uses Flan-T5-XL as its text encoder, which helps translate natural language descriptions into visual representations. The model architecture combines:

Unified spacetime modeling that processes spatial and temporal information together
Discrete autoregressive approach that treats visual content as sequences
FlexAttention mechanism for efficient training and inference
Support for multiple generation modes within a single architecture

Research and Development

InfinityStar was developed by FoundationVision and represents years of research into autoregressive modeling for visual content. The project combines insights from natural language processing with computer vision techniques, demonstrating that discrete autoregressive approaches can compete with and exceed the performance of continuous diffusion models in video generation tasks.

Open Source Commitment

The InfinityStar project is committed to open source development and research. The project includes:

Complete training code for reproducibility
Inference code for generating images and videos
Model checkpoints for both 480p and 720p resolutions
Web demo for interactive exploration
Comprehensive documentation and guides

All code and models are released to foster further research in efficient, high-quality video generation.

Applications

InfinityStar's capabilities make it suitable for various applications:

Content creation and social media
Film and animation production
Educational content development
Prototyping and design visualization
Research and development in visual generation