Actual measurement of Runway AI model Gen-2, the behind-the-scenes technology company of "The Instant Universe": There is still a long way to go to generate a movie-quality video

By Kyle Wiggers

Source: TechCrunch

Image source: Generated by Unbounded AI tool

In a recent interview with Collider, Joe Russo, director of Marvel films such as Avengers: Endgame, predicted that within two years, AI will be able to create a full-fledged movie. In this regard, I would say that this is a fairly optimistic estimate. But we're getting closer.

This week, Google-backed AI startup Runway (which helped develop AI image generator Stable Diffusion) released Gen-2, a model that generates video based on text prompts or existing images. (Gen-2 was previously only available on a limited waitlist.) A follow-up to the Gen-1 model that Runway launched in February, the Gen-2 was one of the first commercially available text-to-video models.

"Commercially available" is an important distinction. Text-to-video, the logical next logical frontier for generative AI after images and text, is becoming a bigger area of focus, especially among the tech giants, some of which have demonstrated text-to-video over the past year Model. But these models are still in the research phase and inaccessible to all but a handful of data scientists and engineers.

Of course, first doesn't mean better.

Out of personal curiosity and as a service to you, dear reader, I ran a few hints through the Gen-2 to see what the model could -- and couldn't -- accomplish. (Runway currently offers about 100 seconds of free video generation.) There isn't much of a method to my madness, but I'm trying to capture a range of angles that either professional or amateur directors might want to see on screen or on a laptop , type and style.

The limitations of Gen-2 became immediately apparent, with the model generating 4-second-long videos at a frame rate so low that it stuttered like a slideshow in places.

What's unclear is whether this is a technical issue, or Runway's attempt to save computational costs. But in either case, this makes the Gen-2 a rather unattractive proposition for editors looking to avoid post-production work.

Aside from frame rate issues, I also found that Gen-2-generated clips tended to share a certain graininess or blurriness, as if they had some sort of old-fashioned Instagram filter applied to them. Also, there are artifacts elsewhere, like pixelation around objects when the "camera" (for lack of a better word) goes around them or zooms in quickly on them.

Like many generative models, Gen-2 isn't particularly consistent in terms of physics or anatomy. Like something a surrealist would create, Gen-2 produced videos of people's arms and legs fused together and then separated, while objects melted into the floor and disappeared, and shadows were distorted. And -- on cue -- the human face could be doll-like, with shiny, emotionless eyes and pale skin reminiscent of cheap plastic.

Beyond that, there is the matter of content. Gen-2 seems to have a hard time understanding nuance, and sticking to certain descriptions in prompts while ignoring others seems arbitrary.

I tried a hint - "a video of an underwater utopia, filmed with an old camera, 'found footage' film style" - but Gen-2 doesn't generate such a utopia, only one that looks like a first-person view Dive video, across an anonymous coral reef. Among my other prompts, the Gen-2 also failed to generate a zoomed-in shot for a prompt that specifically asked for a "slow zoom", nor did it fully grasp what an average astronaut would look like.

Are these issues related to the Gen-2 training dataset? Maybe.

Gen-2, like Stable Diffusion, is a diffusion model, which means it learns how to gradually subtract noise from a starting image made entirely of noise to approach the cue step by step. Diffusion models learn by training on millions to billions of examples; in an academic paper detailing the Gen-2 architecture, Runway says the model was trained on a dataset of 240 million images and 6.4 million video clips. trained on the internal dataset.

Variety of examples is key. If the dataset doesn't contain many animation clips, then the model -- lacking reference points -- won't be able to generate animations of reasonable quality. (Of course, animation is a broad field, and even if the dataset did have clips of anime or hand-drawn animation, the model wouldn't necessarily generalize well to all types of animation).

On the plus side, the Gen-2 passes the superficial bias test. While generative AI models like the DALL-E 2 were found to reinforce social biases, generating images of authoritative positions -- such as "CEO or Director" -- that mostly depicted white men, Gen-2 was more effective in generating A little more variety in content -- at least in my tests.

Based on the prompt "A video of a CEO walking into a conference room," Gen-2 generated videos of men and women (although there were more men than women) sitting around similar conference tables. Meanwhile, Gen-2 outputs an Asian female doctor behind a desk, according to the description "Video of a Doctor Working in an Office".

Still, any prompt that included the word "nurse" turned up less positively, consistently showing young white women. The same goes for the phrase "waiter". Clearly, Gen-2 still has a lot of work to do.

The takeaway from all of this, for me, is that the Gen-2 is more of a novelty toy than a truly useful tool in any video workflow. Can these outputs be edited into something more coherent? May be. But depending on the video, this might be more work than shooting the footage in the first place.

This is not to dismiss the technology. What Runway has done is impressive, effectively beating the tech giants to take the text-to-video advantage. I'm sure some users will find that Gen-2's uses don't require realism, nor a lot of customizability. (Runway CEO Cristóbal Valenzuela recently told Bloomberg that he sees Gen-2 as a tool for artists and designers to aid in their creative process).

I also tried it myself. Gen-2 does understand a range of styles, such as anime and claymation animation, which are suitable for lower frame rates. It's not impossible to string several pieces together to create a narrative composition with a little modification and editing.

To avoid deepfakes, Runway says it is using a combination of artificial intelligence and human moderation to prevent users from producing videos that include pornography or violence or violate copyrights. I can confirm that Gen-2 has a content filter -- a little bit too much, in fact. These are not foolproof methods, and we'll have to see how well they work in practice.

But at least for now, filmmakers, animators, CGI artists and ethicists can rest easy. It will be at least a few iterations before Runway's technology comes close to producing cinematic-quality video -- assuming it gets there.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)