Microsoft is leading the next wave of AI video creation by launching a new AI video creation model called DragNUWA.
This model aims to fine-tune video creation using text, image, and path as three main control factors to facilitate the creation of videos that can be well controlled in terms of semantics, space, and time.
AI companies are competing to master the creation of AI-powered videos, and in recent months, several players in this space have released models that can create different videos based on text-based prompts and image-based prompts.
The DragNUWA model allows users to directly manipulate the background or objects in the image and seamlessly convert these actions into camera movements or object movements to create corresponding videos.
This model adds path-based generation as a new method in addition to the well-known methods based on text hints and image hints.
This allows users to process objects or entire video frames via specific paths. This provides an easy way to create videos with a high level of semantic, spatial and temporal control while ensuring high-quality output.
Microsoft has opened up the learning parameters of the model and provided a demo of the project for the community to try out.
AI video creation involves text-, image- or path-based inputs, each of which makes it difficult to precisely control the desired output.
The combination of text and images alone cannot convey the details of complex movement in videos, images and paths may not adequately represent future objects, and text and paths may lead to ambiguity when representing abstract concepts.
In August 2023, the Microsoft AI team proposed the DragNUWA model to solve this problem, which is an open-space diffusion model that combines three factors.
This allows users to precisely specify the desired text, images and paths in the input to control aspects such as camera movement including zoom effects or movement of objects in the resulting video.
Paths provide details about movements, text provides details about future objects, and images provide distinctions between objects.
In tests, Microsoft claims that the model is able to accurately move the camera and objects with different resistance paths.