Multi-subject consistency in AI videos has always been a challenge. While AI can easily generate models and clothing from scratch, if a client specifically requests Elon Musk as a spokesperson wearing a furry coat, AI might struggle to deliver.

Pika’s recently updated 2.0 model offers an interesting solution – users can upload multiple images, and Pika will precisely reference elements from these images to generate videos.

By combining photos of people, products, and scenes, a basic commercial can be created where everything looks just like it does in the reference photos.

Does this mean that AI video consistency issues are solved and advertising professionals should be worried? Not quite. After testing, while Pika offers high entertainment value, it’s not yet practical enough for professional use.

Musk and Altman Watch Movies, Classical Paintings Eat Fries: Crossovers Made Simple

Pika’s multi-image reference feature is called “Scene Ingredients.”

It’s simple to use: 1. Click “+” to upload images (up to 6); 2. Write simple prompts in the text box.

Let’s try it out – having the feuding Musk and Altman make peace by watching a movie together.

▲ Prompt: Two people sitting in a dark theater. They’re holding popcorn, eating while fully absorbed in what they’re watching. Their wide-eyed expressions convey eager anticipation or fascination, as if completely immersed in the unfolding drama. The surroundings suggest a crowded venue, but the focus remains on their reactions.

Just uploading photos of the two is enough, as the theater setting can be described in the prompt.

Pika handles Musk fairly consistently. However, Altman looks rather unnatural, with exaggerated eating motions and unnaturally large eyes.

One interesting aspect of Pika is that materials can be “reused.”

So, we can let Musk and Altman try their hand at modeling. By uploading just one clothing image and using prompts, we can have them wear matching green coats for a fashion shoot.

▲ Prompt: Two men taking a selfie together in a magnificent winter landscape. Both wearing identical green long coats. Full-body shot showing them from head to toe. They strike professional model-like poses with confident smiles. Cinematic lighting highlights their faces and the luxurious texture of their coats. High-end fashion photography style, professional camera quality, fashion magazine aesthetics.

We used existing photos of both men, while the green coat and snowy background were generated separately by AI. The “AIGC” text on the clothing serves as an additional test for Pika.

The result maintains good consistency in the scene and coat, with the “AIGC” text still somewhat recognizable, and the models’ poses following the prompt.

However, the biggest issue is facial recognition – the faces in the video bear little resemblance to the reference photos.

Undeterred, we continue with more outfit changes.

This time, we feature Zuckerberg, first generating an AI image of a shirt with “I was human” written on it, referencing the classic robot meme.

Then, we find a photo of Zuckerberg and a ukulele image to create a musical scene.

▲ Prompt: A man in a black T-shirt stands in a cozy room, playing a ukulele. The shot begins at medium distance showing his full body, gradually moving closer to focus on the text on his T-shirt.

Pika follows the prompt well with good camera movement, and the shirt appears smoothly, but the right hand, especially the thumb, isn’t perfect.

Compared to Google Veo and OpenAI Sora, Pika’s model capabilities aren’t top-tier – as one problem is solved, more bugs emerge.

After trying realistic styles, let’s experiment with anime aesthetics. To create a crossover between Gintoki Sakata and Naruto Uzumaki, I specifically chose two images with blue sky and white cloud backgrounds.

▲ Prompt: Anime-style scene focusing on two young men’s faces against a backdrop of blue sky and white clouds. They chat while making eye contact, maintaining their original anime art styles.

The background blends naturally, and their frontal expressions work well, with appropriately flowing hair and clothes. However, the turning motion is terrifying – Gintoki has dead fish eyes, not literally rolled-back eyes!

Having broken the dimensional barrier, we can also have classic paintings interact across centuries – the Mona Lisa and Girl with a Pearl Earring eating fries at McDonald’s.

▲ Prompt: Mona Lisa and Girl with a Pearl Earring dining at McDonald’s. They sit across from each other with fries on the table. They chat while enjoying their fries, captured from the side by the camera, occasionally glancing at the lens, creating a casual and friendly atmosphere.

The result is hard to describe – Da Vinci might want to rise from his grave seeing this Mona Lisa. Both figures appear pasted into the video, with bizarrely moving heads.

Sometimes, keeping things simple yields unexpectedly better results.

▲ Prompt: Close-up shot of bubbles appearing on a pond’s surface, then a coffee cup emerging from the water.

By uploading just a Starbucks image and one of Monet’s water lilies, we can create a coffee cup “emerging like a lotus from clear waters.”

Competing with Domestic Models: Lowering the Barrier to AI Video Control

To some extent, Pika has improved video controllability. But let’s not oversell it – in practice, Pika maintains good consistency with scenes, clothing, and objects, but faces tend to distort regardless of art style.

Meanwhile, Pika’s basic capabilities still need improvement, with issues in movements like eating and playing instruments. Could these problems be solved by generating more variations?

Three words: It’s too expensive.

Pika 2.0 is currently only available to Pro and Fancy users, costing at least $35 per month for a subscription, with no free trial credits.

Moreover, Pro users only get 2,000 credits per month, and using the Scene Ingredients feature costs 100 credits per video.

▲ Vidu interface

Actually, the domestic AI video model Vidu implemented “multi-image reference” functionality before Pika. Better yet, it offers free trial credits.

I ran Pika’s test cases on Vidu. For the Mona Lisa and Girl with Pearl Earring eating fries, while both figures look antiquated, the Mona Lisa’s likeness is better than Pika’s version.

For Musk and Altman watching a movie, Musk’s face is about 70-80% accurate, while Altman’s remains disastrous.

With Gintoki and Naruto together, Vidu surprisingly generates side profiles from frontal faces, though the art style differs from the originals.

However, Vidu falls short of Pika in one aspect – it only allows uploading three images maximum. So for the Musk and Altman fashion shoot, I only uploaded their photos and the green coat, without a background.

Both figures look unfamiliar. Clearly, facial stability remains a challenge.

Comparing Vidu and Pika’s results is subjective. We used Pika’s Pro version versus Vidu’s free version, which objectively contributes to their differences.

But Pika and Vidu share a similar approach – generating relatively stable objects using just a few image references and simple prompts.

In AI video generation, maintaining subject consistency currently relies most reliably on LoRA solutions, fine-tuning models with a specific number of subject-specific materials. Through adequate samples and training, the model gradually masters the character’s visual features.

However, to make AI video more accessible and commercially viable, the entry barrier needs to be lowered. At least, Vidu and Pika show us the possibilities.

Going Viral with AI Short Videos: No Turning Back on the Path of Innovation

Just days after Pika’s 2.0 model release, overseas users have gone wild with it.

Using their own photos to generate videos in different scenes creates an “Everything Everywhere All at Once” effect.

▲ Image source: X@EladRichardson

One-click AI clothing try-ons flow seamlessly through models and outfits without changing scenes, saving on production costs.

▲ Image source: X@martgent

Playing with Pika feels like using “QQ Show” or The Sims – we decide how to dress up the characters in the video.

To help Musk “achieve his dream,” it’s easy: first use other AI tools to generate a “Colonize Mars” T-shirt and a red “MAGA” cap.

Then upload these images, Mars scenes, Musk’s photo, his Optimus humanoid robot, and his beloved Doge meme dog to Pika.

▲ Prompt: A man stands on Mars’ surface wearing a black T-shirt and red cap. A dog sits on his left, and a robot stands on his right. The shot begins with a wide angle capturing the full bodies of the man, dog, and robot. As the camera smoothly zooms in, the man waves cheerfully at the camera, his expression radiating joy and adventure.

The result shows a sunny, cheerful guy, flanked by companions, looking endearing but unlike Musk.

Whether it looks like the original is one thing – as long as you think creatively, the possibilities are endless.

Using our own and celebrities’ photos, we can live out fan fantasies. Uploading hats, clothes, and instruments lets us style ourselves from head to toe. Combining scenes, products, and models creates a basic commercial…

Photos + AI images + Pika 2.0 + prompts can generate many entertaining scenes. This generation method also bypasses some video model limitations, like text generation, which can be solved at the image model stage.

Rather than competing head-on with Google’s model capabilities or rivals like Runway chasing Hollywood dreams, Pika has found its own way to excel.

In fact, Pika has always excelled at creating viral content. Their previous series of AI effects called Pikaffect went viral on Xiaohongshu and TikTok, helping Pika reach over 11 million users.

▲ AI Squeeze. Image source: Pika

By Kaiho

Leave a Reply

Your email address will not be published. Required fields are marked *