Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.
👇 press the tab for different tasks and datasets
Kitchen (1282) | |||||
---|---|---|---|---|---|
First Frame | w/o Text | Text | TATS | Ours | GT |
Flintstones (1282) | |||||
---|---|---|---|---|---|
First Frame | w/o Text | Text | TATS | Ours | GT |
Fred is speaking. |
|||||
living room, talking. |
|||||
and talking about something. |
|||||
to Barney and turning his head. |
|||||
the living room and nodding. |
|||||
across a room, talking to himself. |
|||||
a window blind. |
|||||
is looking behind. |
|||||
with one hand while talking. |
|||||
car talking and laughing. |
MUGEN (1282) | |||||
---|---|---|---|---|---|
First Frame | w/o Text | Text | TATS | Ours | GT |
the right across a platform. It picks up a gem and a coin before crushing a worm. |
|||||
a gear and then collects a coin. |
|||||
then collects three coins and a gem. |
|||||
Kitchen (1282) | |||
---|---|---|---|
Last Frame | Text | Ours | GT |
Flintstones (1282) | |||
---|---|---|---|
Last Frame | Text | Ours | GT |
living room reading the paper. |
|||
the car and talking. |
|||
at the window. |
|||
the fence to have a glimpse. |
|||
in the kitchen. She turns her head then she speaks. |
MUGEN (1282) | |||
---|---|---|---|
Last Frame | Text | Ours | GT |
gets the coins. |
|||
left to right, collects a coin, and jumps over a worm. Then it jumps up. |
|||
a box and then towards a coin. |
|||
and gets the coins. |
Kitchen (1282) | ||||
---|---|---|---|---|
First Frame | Last Frame | Text | Ours | GT |
Flintstones (1282) | ||||
---|---|---|---|---|
First Frame | Last Frame | Text | Ours | GT |
to sing in a living room. |
||||
and holds a finger over her lips. |
||||
at Betty who is speaking then she turns her head. |
||||
are dancing across a room. |
MUGEN (1282) | ||||
---|---|---|---|---|
First Frame | Last Frame | Text | Ours | GT |
left. Then collects a coin and a gem. |
||||
the stage. It runs from left to right and jumps on a worm. |
||||
down the ladder and jumps up. It collects a gem. |
||||
ladder. It jumps onto a stack of boxes, drops down, and is killed by a worm. |
||||
Kitchen (1282) | ||||||
---|---|---|---|---|---|---|
K Frames | Text | K=1 | K=2 | K=3 | K=4 | GT |
Flintstones (1282) | ||||||
---|---|---|---|---|---|---|
K Frames | Text | K=1 | K=2 | K=3 | K=4 | GT |
are peeking their heads into a room. |
||||||
at Wilma while he talks to her. |
MUGEN (1282) | ||||||
---|---|---|---|---|---|---|
K Frames | Text | K=1 | K=2 | K=3 | K=4 | GT |
it collects a coin. |
||||||
a platform, and collects coins. |
||||||
the left and collects coins. |
Flintstones (1282) | ||||
---|---|---|---|---|
First Frame | Text 1 | Output 1 | Text 2 | Output 2 |
a bone beside his head. |
bone up and down beside his head. |
|||
are walking through a room. |
are speaking in a room. |
MUGEN (1282) | ||||
---|---|---|---|---|
First Frame | Text 1 | Output 1 | Text 2 | Output 2 |
to left and collects a gem. |
||||
the right. It jumps down the ground. |
the right. It jumps landing on a face. |
|||
MUGEN (1282) | ||||
---|---|---|---|---|
Last Frame | Text 1 | Output 1 | Text 2 | Output 2 |
MUGEN (1282) | |||||
---|---|---|---|---|---|
First Frame | Last Frame | Text 1 | Output 1 | Text 2 | Output 2 |
a ladder, collects coins, and then jumps down. |
a ladder, collects coins, and then jumps up and down. |
||||
left to right. |
left to right. |
||||
off the platform. |
UCF-101 (1282) | ||||
---|---|---|---|---|
BAIR (642) | ||||
---|---|---|---|---|
UCF-101 (1282) | ||||
---|---|---|---|---|
Typing | Writing on Board | Knitting | Pull Up | Mixing |
Front Crawl | Skiing | Yo Yo | Playing Guitar | Surfing |
WebVid (3842) | ||||
---|---|---|---|---|
wash hand | cut chicken with knife |
boy writes in notebook |
sailboat on horizon |
cloudscape time-lapse |
type on laptop keyboard |
downtown city with traffic car | cook goulash soup |
wind turbine | flame and wood |
pour coffee into cup |
walk outdoors on beach |
high-speed railway link | busy city of times square |
young team busy discusses |
green sea turtle swims and relaxes | man trains on exercise machine | flags flatter on strong wind | beautiful girl swings | river flows under old stone bridge |
rotates apple lollipop | sunset on the baltic sea | woman unveils curtain | elephant shakes his head | hand of man playing guitar |
designer mixes paint | underwater of coral reef | drive on empty road |
child rides a bike |
shave his face in bathroom |
TGIF (3842) | ||||
---|---|---|---|---|
shove food in his mouth | tilt her head | hug each other | surf in the ocean | sing a song |
perform in front of audiences | skateboard down the hill | move his lips | talk and smile | stick her tongue out |
kiss each other | ice skating | snowboard on a mountain | walk | palm tree is blowing |
cut a mini pizza | swing his club | slap their hands together | flying above the clouds | run along a track |