What is InfinityStar?

InfinityStar is a unified spacetime autoregressive framework designed for high-resolution image and dynamic video synthesis. It represents a significant advancement in visual generation technology, combining spatial and temporal modeling within a single architecture. The system uses a discrete approach that captures both spatial and temporal dependencies, enabling it to generate high-quality images and videos from text descriptions.

The framework builds on recent successes in autoregressive modeling, applying principles that have proven effective in both vision and language domains. By treating visual generation as a sequence prediction problem, InfinityStar can model complex visual patterns and generate coherent, high-resolution content. This approach allows the system to handle various generation tasks including text-to-image conversion, text-to-video synthesis, image-to-video transformation, and long-duration video generation.

InfinityStar was developed by FoundationVision and accepted as a NeurIPS 2025 Oral presentation, indicating its significance in the research community. The project represents years of research into autoregressive modeling for visual content, combining insights from natural language processing with computer vision techniques. The system demonstrates that discrete autoregressive approaches can compete with and exceed the performance of continuous diffusion models in video generation tasks.

Overview of InfinityStar

FeatureDescription
Model NameInfinityStar
CategoryUnified Spacetime AutoRegressive Model
FunctionHigh-Resolution Image and Video Generation
Model Size8 Billion Parameters, ~35GB
Text EncoderFlan-T5-XL
Supported Resolutions480p and 720p
Video Duration5 seconds (720p), 5-10 seconds (480p)
Generation Speed~10x faster than diffusion-based methods
VBench Score83.74
ConferenceNeurIPS 2025 Oral

Key Features of InfinityStar

  • Unified Spacetime Modeling

    InfinityStar combines spatial and temporal dependencies within a single architecture. This unified design allows the model to understand both how objects appear in space and how they change over time, creating more coherent and natural-looking videos.

  • Discrete Autoregressive Approach

    The system uses a purely discrete approach to visual generation, treating pixels or visual tokens as discrete units that are predicted sequentially. This method has proven effective in both language and vision tasks, providing a solid foundation for high-quality generation.

  • Multiple Generation Modes

    InfinityStar supports various generation tasks through straightforward temporal autoregression. Users can generate images from text, create videos from text descriptions, transform images into videos, and produce long-duration video content. The unified architecture makes these different modes work naturally together.

  • High Performance and Speed

    The model generates 5-second 720p videos approximately 10 times faster than leading diffusion-based methods. This speed advantage makes InfinityStar practical for real-world applications where quick generation is important. The system achieves this while maintaining high quality, scoring 83.74 on VBench.

  • Industrial-Level Quality

    InfinityStar is the first discrete autoregressive video generator capable of producing industrial-level 720p videos. The model generates content that meets professional standards for resolution, clarity, and temporal coherence, making it suitable for various commercial and creative applications.

  • Flexible Video Length

    The 480p model supports variable-length video generation, capable of creating videos of 5 and 10 seconds. This flexibility allows users to generate content of different durations based on their specific needs, from short clips to longer sequences.

  • Image-to-Video Capabilities

    InfinityStar can transform static images into dynamic videos, bringing still images to life. This feature is particularly useful for creating animated content from photographs or artwork, enabling new creative possibilities for artists and content creators.

  • Video Continuation

    The system can extend existing videos by continuing the sequence, maintaining temporal coherence and visual consistency. This feature allows users to create longer videos from shorter clips, preserving the style and content of the original footage.

Try InfinityStar

Experience InfinityStar's capabilities through our interactive demo. Generate videos from text descriptions and explore the power of unified spacetime autoregressive modeling.

How InfinityStar Works

InfinityStar operates on the principle of autoregressive modeling, which means it generates visual content one piece at a time, with each piece depending on what came before. This approach is similar to how language models generate text, but applied to visual information.

The system first processes the input text using the Flan-T5-XL text encoder, which converts natural language descriptions into numerical representations that the model can understand. These representations capture the semantic meaning of the text, including objects, actions, scenes, and relationships.

For image generation, InfinityStar predicts visual tokens sequentially, building up the image from left to right and top to bottom. Each token represents a small region of the image, and the model uses the context of previously generated tokens to predict the next one. This creates coherent images where all parts work together harmoniously.

For video generation, the process extends into the temporal dimension. The model generates frames one at a time, with each frame depending on both the previous frames and the text description. This ensures that videos maintain temporal coherence, meaning objects move naturally and consistently throughout the sequence.

The unified spacetime architecture means that spatial and temporal information are processed together, not separately. This allows the model to understand how objects should appear in space and how they should change over time simultaneously, creating more natural and coherent videos.

The discrete approach means that InfinityStar works with quantized visual tokens rather than continuous pixel values. This quantization allows the model to learn more efficiently and generate content faster, while still maintaining high quality. The model has been trained on large datasets of images and videos, learning patterns and relationships that enable it to generate new content.

Applications and Use Cases

InfinityStar's capabilities make it suitable for a wide range of applications across different industries and creative fields. The system's speed and quality make it practical for both professional and personal use.

Content Creation

Content creators can use InfinityStar to generate video content from text descriptions, creating visual material for social media, marketing campaigns, or entertainment. The system's speed allows for rapid iteration and experimentation with different concepts and styles.

Film and Animation

Filmmakers and animators can use InfinityStar to create storyboards, visualize concepts, or generate background elements. The image-to-video capability allows static artwork to be animated, bringing illustrations and concept art to life.

Education and Training

Educational content creators can generate visual materials to illustrate concepts, create training videos, or develop interactive learning experiences. The text-to-video capability makes it easy to create visual content from written descriptions.

Prototyping and Design

Designers and developers can use InfinityStar to quickly visualize ideas, create mockups, or generate visual prototypes. The system's speed makes it ideal for rapid prototyping workflows where quick iteration is important.

Research and Development

Researchers can use InfinityStar to generate synthetic data, create visualizations, or explore new ideas in visual generation. The open-source nature of the project encourages further research and development in the field.

Technical Details

InfinityStar is built using modern deep learning techniques and requires specific computational resources. The model uses FlexAttention to speed up training, which requires PyTorch version 2.5.1 or higher. This attention mechanism allows the model to process information more efficiently during both training and inference.

The model architecture consists of 8 billion parameters, making it a large-scale model that requires significant computational resources. The full model checkpoint is approximately 35 gigabytes, which includes all the learned weights and parameters necessary for generation.

For inference, the system supports two main modes: 720p video generation and 480p variable-length video generation. The 720p model is optimized for 5-second video generation, providing the highest quality output at that resolution. The 480p model offers more flexibility, supporting both 5 and 10-second video generation, and is particularly effective for image-to-video and video-to-video tasks.

The training process involves organizing data, extracting features, and training the model on large datasets of images and videos. The system uses a comprehensive workflow that covers data preparation, feature extraction, and training scripts. This process requires substantial computational resources and time, but results in a model capable of generating high-quality visual content.

The model's performance is measured using VBench, a comprehensive benchmark for video generation. InfinityStar achieves a score of 83.74 on this benchmark, outperforming all other autoregressive models and even surpassing diffusion-based competitors. This performance demonstrates the effectiveness of the unified spacetime autoregressive approach.

Pros and Cons

Pros

  • Generates 720p videos approximately 10 times faster than diffusion-based methods
  • High-quality output scoring 83.74 on VBench
  • Unified architecture supports multiple generation tasks
  • First discrete autoregressive model for industrial-level 720p videos
  • Supports variable-length video generation at 480p
  • Open-source code and models available for research
  • Image-to-video and video continuation capabilities

Cons

  • Large model size requires significant storage space (~35GB)
  • Requires substantial computational resources for inference
  • 720p model limited to 5-second video generation
  • 480p model not specifically optimized for text-to-video tasks
  • Requires PyTorch 2.5.1 or higher for FlexAttention support
  • Training process computationally intensive

How to Use InfinityStar

Step 1: Installation

Install the required dependencies, including PyTorch 2.5.1 or higher for FlexAttention support. Install other Python packages using pip and the provided requirements file. Detailed installation instructions are available on the Installation page.

Step 2: Download Model Checkpoints

Download the appropriate model checkpoints for your use case. Choose between the 720p model for high-quality 5-second videos or the 480p model for variable-length generation. The checkpoints are available from the official repository.

Step 3: Prepare Input

For text-to-video generation, prepare a text description of the video you want to create. For image-to-video generation, have an image file ready. For video continuation, prepare the initial video clip you want to extend.

Step 4: Run Inference

Use the appropriate inference script based on your needs. For 720p video generation, use the infer_video_720p.py script. For 480p variable-length generation, use the infer_video_480p.py script. Specify parameters such as text prompts, image paths, or video paths as needed.

Step 5: Review and Export

Review the generated video output. The system will save the result to a specified location. You can then use the generated video in your projects or further process it as needed.

InfinityStar FAQs