Not to be outdone by Meta’s Make-A-Video, Google today detailed its work on Imagen Video, an AI system capable of generating video clips from a text prompt (e.g. “a teddy bear doing the dishes”). While the results aren’t perfect – the looping clips the system generates tend to have artifacts and noise – Google says Imagen Video is a step towards a system with a “high degree of controllability” and knowledge of the world, including the ability to generate sequences in a range of art styles.
As my colleague Devin Coldewey noted in his article on Make-A-Video, video synthesis systems are nothing new. Earlier this year, a group of researchers from Tsinghua University and the Beijing Academy of Artificial Intelligence released CogVideo, which can translate text into reasonably high-fidelity short clips. But Imagen Video appears to be a significant leap from the previous state of the art, showing an ability to animate subtitles that existing systems would struggle to understand.
“It’s definitely an improvement,” Matthew Guzdial, an assistant professor at the University of Alberta who studies AI and machine learning, told TechCrunch via email. “As you can see in the sample videos, even though the communications team selects the best releases, there’s still some weird fuzziness and artifice. So this definitely won’t be used directly in animation or television. But that, or something like that, could definitely be integrated into tools to help speed things up.
Imagen Video is based on Google’s Imagen, an image generation system comparable to DALL-E 2 and OpenAI’s Stable Diffusion. Imagen is what is known as a ‘diffusion’ model, generating new data (eg videos) by learning how to ‘destroy’ and ‘recover’ many existing data samples. As it feeds into existing samples, the model better recovers the data it had previously destroyed to create new works.
As the Google research team behind Imagen Video explains in an article, the system takes a textual description and outputs a 16-frame video, three frames per second at 24 x 48 pixel resolution. Then the system scales and “predicts” additional frames, producing a final video of 128 frames, 24 frames per second at 720p (1280×768).
Google says Imagen Video was trained on 14 million video-text pairs and 60 million image-text pairs as well as the publicly available LAION-400M image-text dataset, which allowed it to generalize to a range of aesthetics. (It’s no coincidence that part of LAION was used to form Stable Diffusion.) In experiments, they found that Imagen Video could create videos in the style of Van Gogh’s paintings and watercolors. Perhaps most impressively, they claim that Imagen Video demonstrated an understanding of depth and three-dimensionality, allowing it to create videos like drone flyovers that rotate and capture objects from different angles without distorting them.
In a major improvement over the image generation systems available today, Imagen Video can also render text correctly. While Stable Diffusion and DALL-E 2 struggle to translate prompts such as “a logo for ‘Diffusion'” into legible characters, Imagen Video does it without a problem – at least judging by the paper.
This does not mean that Imagen Video is limitless. As is the case with Make-A-Video, even selected clips from Imagen Video are jittery and distorted in places, as Guzdial alluded to, with objects blending together in physically unnatural ways – and impossible -.
“Overall, the text-in-video problem is still unresolved, and it’s unlikely we’ll reach anything like DALL-E 2 or Midjourney in quality any time soon,” Guzdial continued.
To improve this, the Imagen Video team plans to combine forces With the researchers behind Phenaki, another video synthesis system from Google debuted today that can turn long, detailed prompts into videos longer than two minutes, but at lower quality.
It’s worth pulling back the curtain on Phenaki a bit to see where a collaboration between the teams could lead. While Imagen Video focuses on quality, Phenaki favors consistency and length. The system can turn one-paragraph prompts into movies of arbitrary length, from a scene of a person riding a motorcycle to an alien spacecraft flying over a futuristic city. Phenaki-generated clips suffer from the same issues as Imagen Video’s, but I find it remarkable how closely they follow the long and nuanced text descriptions that drive them.
For example, here is a prompt sent to Phenaki:
A lot of traffic in the futuristic city. An alien spaceship is coming to the futuristic city. The camera goes inside the alien spaceship. The camera moves forward to show an astronaut in the blue room. The astronaut is typing on the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves past the astronaut and looks at the screen. The screen behind the astronaut shows fish swimming in the sea. Zoom in on the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom to a futuristic skyscraper. The camera zooms in on one of the many windows. We are in an office with empty desks. A lion runs over the desks. The camera zooms in on the lion’s face inside the office. Zoom out to the lion wearing a dark suit in an office. The carrying lion looks at the camera and smiles. The camera slowly zooms out of the skyscraper. Timelapse of sunset in modern city.
And here is the generated video:
Back to Imagen Video, the researchers also note that the data used to train the system contained problematic content, which could cause Imagen Video to produce graphically violent or sexually explicit clips. Google says it won’t release the Imagen Video template or source code “until those concerns are alleviated” and, unlike Meta, it won’t provide any sort of registration form to register interest. .
Yet, with text-to-video technology advancing at a rapid pace, it may not be long before an open-source model emerges – both stimulating human creativity and presenting an intractable challenge when it comes to deepfakes. copyright and disinformation.
#Google #responds #Metas #videogenerating #image #dubbed #Imagen #Video