Text-to-Video: OpenSource vs SaaS
A comprehensive overview of text-to-video algorithms: open-source models, online services, and the latest research advancements.
With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques (e.g., LoRA and DreamBooth), it is possible for everyone to manifest their imagination into high-quality images at an affordable cost. Subsequently, there is an excellent demand for image animation techniques to combine further generated stationary images with motion dynamics. This article will look at OpenSource and Service approaches to solving this task.
Why does Text-to-Video matter?
Text-to-video technology holds enormous potential across various fields:
- Film and Animation: It could revolutionize how we create animated features and short films, allowing creators to generate preliminary scenes or entire sequences from script excerpts.
- Advertising and Marketing: Brands could leverage this technology to produce quick, cost-effective video content for marketing campaigns tailored to diverse and specific audiences.
- Gaming and Virtual Reality: Narrative construction in games and VR environments can become more dynamic, possibly generating real-time, story-specific scenes and backgrounds.
Open Source
MODELSCOPE APPROACH
ModelScope is the text-to-video synthesis model that evolves from a text-to-image synthesis model. The diffusion model for text generation in video consists of three subnetworks: VQGAN, a text encoder, and a denoising UNet. The total number of model parameters is about 1.7 billion.
Fig. 1: Source: ModelScope Text-to-Video Technical Report
The model uses the 3D-Unet structure, which was introduced in 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation, and implements the video generation function using an iterative smoothing process from a pure video with Gaussian noise. The following videos were generated using the zeroscope_v2_576w model - the improved version of ModelScope:
“An apple on a tree. Close up.”
“Sunset in the forest.”
“City landscape.”
“A tattoo artist making a tattoo. Close up”
ZERO-SHOT TEXT-TO-VIDEO GENERATION
In the paper called Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators, researchers have tackled the challenge of text-to-video generation by introducing an innovative and cost-effective approach. This new method sidesteps the need for computationally intensive training and extensive video datasets that have characterized previous attempts in this area. Instead, it cleverly repurposes existing text-to-image synthesis technologies, like Stable Diffusion, adapting them for video generation.
Fig. 2: Source: Text2Video-Zero Project Page
The core of this approach involves two significant modifications:
- Enhancing Frame Latent Codes for Motion Dynamics: The technique enriches the latent codes within the generated frames. This addition ensures that the overall scene and background remain consistent over time, effectively capturing the essence of motion without losing the scene's context.
- Reprogramming Frame-Level Self-Attention: A novel cross-frame attention mechanism is introduced. Each frame references the first frame, preserving the foreground objects' context, appearance, and identity across the video sequence.
This method achieves low overhead and ensures high-quality output, with remarkable consistency in the generated videos. Furthermore, the versatility of this approach extends beyond just text-to-video synthesis. It shows promise in other applications, such as conditional and content-specialized video generation, and even in Video Instruct-Pix2Pix, a form of instruction-guided video editing.
These advancements suggest a significant leap forward in AI-driven video generation, offering both efficiency and quality in producing video content from textual descriptions.
“A little Darth Vader standing in flowers with folded arms.”
“The sea near the coastline.”
Video generation using pose reference (ControlNet+StableDiffusionXL)
Reference of moves
“A man in a black suit dances with a disco ball.”
Prompted video editing based on Pix2Pix model
Pre-generated video
“Make it look like it's sunset time.”
“Make the sand red.”
“Make it in the style of a cartoon.”
MOTION MODULE INJECTION (AnimateDiff)
In this project, the researchers propose an effective framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to append a newly initialized motion modeling module to the frozen-based text-to-image model and train it on video clips to distill a reasonable motion prior. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base readily become text-driven models that produce diverse and personalized animated images.
Fig. 3: Source: AnimateDiff Project Page
Here are examples that use RCNZ Cartoon 3d and ToonYou as frozen-based models in AnimateDiff approach.
“A little Darth Vader standing in flowers with folded arms.” (RCNZ Cartoon 3d)
“A girl with black hair on a windy day in a bamboo forest.” (ToonYou)
STABLE DIFFUSION WEBUI (+Deforum)
Deforum is an extension of Stable Diffusion WebUI, which is solely made for AI animations. It’s a powerful tool that lets you create 2D, 3D, and Interpolation or even add some art style to your videos.
This extension uses Stable Diffusion’s image-to-image function to generate a series of images and stitches them together to create a video. Since the change between frames is small, it creates the perception of a continuous animation.
Deforum provides a very flexible configuration for video generation and editing. You can control “camera” movement and its rotation and define these parameters like formulas that depend on time. In addition, you can use different model checkpoints for particular parts of your video, apply ControlNet to get precisely what you want, or even put a video or an image as additional input, which will be used as a base for generating a video from your prompt.
“The Great Wave of Kanagawa by Katsushika Hokusai”
"Drawing of black flowers on a brick wall of a house, high quality”
"A man looks at the viewer and smokes a cigarette, high quality” + ”Drawing of black flowers on a brick wall of a house”
“A girl with flowers in hair, nvinkpunk”
With Deforum, you can change the prompt while creating a video, which allows you to control the scene and objects on it. For example:
{
"0": "Drawing of the lower left quarter of the front wheel of a bicycle",
"30": "Blueprint of a bicycle",
"60": "Blueprint of a motorcycle"
}
Here, "0", "30", and "60" are the frame numbers from which the corresponding prompt will be applied. We also enabled translation along the X and Y axes to make the camera's perception move. The result is shown below.
Example of using different prompts on different frames
WARP FUSION
Stable Warp Fusion is an advanced, alpha-version software designed for image and video diffusion tasks. It introduces innovative features such as alpha-masked diffusion and inverse alpha-mask diffusion, allowing for more precise control over the diffusion process by utilizing alpha masks to dictate which areas of the frame are to be diffused or remain fixed.
The resulting video is very similar to the output of the Deforum. You only need to set a path to the initial video and a path to the stable diffusion checkpoint.
Input video (Runway Gen-2): “A brown-haired man standing at the center of Times Square and looking into the camera. Front view.”
Output video: “Beautiful, highly detailed wooden sculpture, forest in the background.”
Services
STABLE DIFFUSION API
StableDiffusionAPI is a service that provides plenty of tools, from interior design to voice cloning, but the main feature of this service is Stable Diffusion text-to-image generation. However, we are interested in text-to-video API, which also uses Stable Diffusion under the hood.
Below, some examples are shown
“A Darth Vader looking towards the viewer and standing in flowers with folded arms.”
"Create a magical, fantasy-themed video of an enchanted forest.”
PIKA
Pika is an AI company founded by Ph.D. students from Stanford focusing on the processing and generation of video. Recently, developers have released the Lip Sync feature, a text-to-audio tool that can dub a video generated by the Pika text-to-video pipeline.
There is no detailed description of the text-to-video algorithm, but maybe they will reveal it in the future, as well as a publicly accessible API. If you need any help with this service, you can contact the developers directly on their Discord server.
“A Darth Vader looking towards the viewer and standing in flowers with folded arms.”
"Create a magical, fantasy-themed video of an enchanted forest"
RUNWAY GEN-2
Runway's Gen-2 AI video tool is a trailblazer in the digital creative space, offering an unprecedented ability to transform text prompts into dynamic, high-quality videos. This tool signifies a leap in AI-driven content creation, enabling users, irrespective of their expertise level, to bring their imaginative ideas to life in video format.
Runway, an applied AI research company, drives Gen-2's development. Their focus is on making advanced content creation accessible to everyone, leveraging cutting-edge AI in computer graphics and machine learning.
Gen-2 operates by generating a series of images based on the provided prompts, which are then seamlessly stitched into a fluid video. It intelligently balances the content and structural aspects of the video to ensure a natural and high-quality output.
“A Darth Vader looking towards the viewer and standing in flowers with folded arms”
"Create a magical, fantasy-themed video of an enchanted forest”
Research you might like
In this part, we will tell you about models and approaches that are not currently available but show impressive results. Research was produced by the most extensive AI labs, e.g., Google, Nvidia, and Meta.
Sora is one of the latest research in the field of video models. Inspired by text tokenization used by LLMs, researchers at OpenAI use visual patches previously used in visual recognition models. They found that these patches are highly scalable and effective in training generative models.
Sora is a diffusion transformer, so the scaling rules of LLMs are also applied to this model. Researchers have fixed seeds and inputs during model training and compared the results. It turns out that when the model size is grown, output quality is also increased.
The model can create videos with a resolution between 1920x1080 and 1080x1920, allowing you to create content for different devices more easily. Also, it can generate real-world and virtual environments, such as Minecraft world, copying its dynamics.
You can find more details about technical aspects of the model at OpenAI’s technical report.
However, Sora has difficulty modeling physics, creating videos with many entities on a single frame, and simulating complex motions.
VideoPoet is an easy method of turning LLMa into great video creators. It uses a MAGVIT V2 video tokenizer and SoundStream audio tokenizer to turn images, videos, and audio clips into codes compatible with language models.
The autoregressive language model predicts the next video or audio token in a sequence, learning from various modalities like video, image, audio, and text.
Multimodal generative learning tasks, like text-to-video and image-to-video, are added to the LLM training, enhancing capabilities. VideoPoet excels in synthesizing and editing videos with consistent timing, showing high-quality video generation with diverse motions. The model supports square or portrait video orientations for short-form content and can generate audio from a video input.
Emu Video is a simple method for text-to-video generation based on diffusion models, factorizing the generation into two steps:
- First, generating an image conditioned on a text prompt
- Then generating a video conditioned on the prompt and the generated image
Emu Edit is another model that uses Emu architecture for instructed video editions. It has a diverse task range, such as region-based editing, free-form editing, etc. To handle these tasks, they introduced the concept of learned task embeddings to increase the accuracy of execution of the editing instruction.
Imagen Video is a text-to-video generation technology based on a cascade of video diffusion models.
It begins by encoding text prompts into textual embeddings using a T5 text encoder. This encoded information is then processed by a base Video Diffusion Model, which initially generates a low-resolution, short video. This initial output is progressively enhanced through multiple Temporal Super-Resolution (TSR) and Spatial Super-Resolution (SSR) models, resulting in a high-definition video with improved frame rate and resolution. The system employs the Video U-Net architecture, integrating both temporal self-attention and temporal convolutions to effectively capture spatial details and temporal dynamics, allowing for the modeling of long-term temporal sequences.
Imagen Video was not released because the research team wanted to detect and filter out violent content, social biases, and stereotypes that appeared because of problematic data.
VideoLDM is a research produced by Nvidia Toronto AI Lab. This technology uses Latent Diffusion Models (LDMs), which enable high-quality image synthesis and avoid excessive computing demands by training a diffusion model in a compressed lower-dimensional latent space.
The team says that this model achieves state-of-the-art performance in generating driving videos. On their project page, you can find that these videos are awe-inspiring.
PYoCo is a large-scale text-to-video diffusion model that is finetuned from a state-of-the-art image generation model, eDiff-I, with a novel video noise prior, together with several design choices from the prior works, including temporal attention, joint image-video finetuning, a cascaded generation architecture, and an ensemble of expert denoisers.
This model achieved new state-of-the-art results with a 10 times smaller model and 14 times smaller training time on the small-scale unconditional generation benchmark.
REAL-TIME TEXT2VIDEO VIA CLIP-GUIDED, PIXEL-LEVEL OPTIMIZATION
The research on video generation was conducted by Peter Schaldenbrand, Zhixuan Liu, and Jean Oh from Carnegie Mellon University. Their approach involves generating video frames one by one and using a CLIP image-text encoder to guide the optimization process. This method differs from traditional techniques that rely on complex image generator models, as it calculates the CLIP loss directly at the pixel level. This allows for faster video generation, making it suitable for real-time systems. You can try to run this model in their Google Colab notebook.
Make Pixels Dance is the method for high-dynamic video generation. You may notice that in most text-to-video models, there is a low amount of movement in the resulting video. The generated video will look like a static image in the worst cases. To solve this issue, Make Pixels Dance incorporates image instructions for the first and last frames and text instructions for video generation. This allows the user to generate videos with complex scenes and movement. Researchers are still upgrading the model to make it available for demos.
Conclusions
Advances in text-to-video conversion technology represent significant advances in AI-driven content creation. These technologies offer a combination of efficiency, quality, and creativity, allowing users to bring complex and imaginative ideas to life in video format. Open-source communities and commercial services contribute significantly to this field, providing tools and platforms tailored to the needs of different applications and users.
As technology advances, we can expect more sophisticated and user-friendly solutions to democratize video creation further and expand its use in various fields. The future of Text-to-Video technology is bright and promises to transform how video content is created and consumed.