Head to head: CogVideoX-5B vs Luma Ray 3.2 Image to Video

This matchup wasn’t especially close. CogVideoX-5B can sell atmosphere, but Luma Ray 3.2 Image to Video is the model that actually follows the shot list, preserves continuity, and lands the prompt’s specific beats.

By · Published

Side-by-side comparison of two AI video model outputs as technical plans (Blueprint)

CogVideoX-5B’s best argument is mood. In the foggy pier scene, it delivers the damp dawn atmosphere and even gets in the green harbor lamp, which shows it can latch onto evocative visual anchors. But that only gets you so far when the prompt is asking for a very particular action and camera move. Luma Ray 3.2 Image to Video is the one that makes the whippet in the mustard raincoat pause to sniff and lets the camera progress from behind toward the dog’s side in a way that reads like an intentional shot, not just a nice-looking clip.

The gap widens in the night market test. Luma Ray 3.2 Image to Video keeps more of the prompt alive at once: the patchwork-sweater goats, the woman with indigo baskets, the child chasing a red balloon, and the tabby cat perched on the sack of limes all show up within a coherent wet-stone bazaar. CogVideoX-5B again brings appealing surface qualities — warm light, copper-kettle texture, decent atmosphere — but it drops too many required elements and struggles to maintain temporal clarity as subjects cross and the camera advances.

That’s the real story of this head-to-head: CogVideoX-5B is better at vibes than obedience. It can produce attractive frames and occasionally nail a memorable detail, but across both tasks it behaves like a model that treats prompts as inspiration. Luma Ray 3.2 Image to Video behaves like a model that understands it has assignments to complete.

The aggregate score reflects exactly what the clips show: 17.0 for Luma Ray 3.2 Image to Video versus 13.9 for CogVideoX-5B. That margin feels earned. Final call: Luma Ray 3.2 Image to Video wins because it is more reliable on action, camera choreography, and multi-element scene continuity — not just prettier when it gets lucky.

How they were tested

We ran 2 fresh video tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. CogVideoX-5B scored 13.9 to Luma Ray 3.2 Image to Video's 17.0.

1. Foggy pier calm

A short continuous shot on a narrow cedar pier at Blue Lantern Marsh just before sunrise: a silver-gray whippet in a mustard raincoat trots ahead, pauses to sniff the wet planks, then resumes at an unhurried pace while thin fog drifts past and tiny ripples catch the first cold blue light; the camera follows in a fluid gliding gimbal move from low behind, slowly arcing to the dog’s side as a distant green harbor lamp fades and the sky warms from slate to pale apricot, building a serene, quietly hopeful mood through motion, pacing, and evolving light, 16:9

Winner: Luma Ray 3.2 Image to Video — Model B better matches the prompt’s action and cinematography: the whippet in a mustard raincoat clearly pauses to sniff and the camera progresses from behind toward the dog’s side with pleasing dawn light and fog. Model A has strong foggy-pier mood and includes the green harbor lamp, but the dog’s motion and camera evolution are less aligned with the described sniffing beat and side arc.

2. Night market goats

One continuous shot through the cramped aisle of the San Telmo Lantern Bazaar at night: seven small goats in bright patchwork sweaters weave independently among hanging copper kettles, a woman with a stack of indigo baskets sidesteps left, two children hurry past chasing a red balloon, and a sleepy tabby cat springs onto a sack of limes as each subject keeps its own plausible path without colliding; the camera makes a slow forward gimbal glide at waist height, subtly panning to track the densest crossings under swaying amber bulbs and splashes of neon magenta reflected on wet stone, creating a lively, bustling mood, 16:9

Winner: Luma Ray 3.2 Image to Video — Model B matches more of the prompt’s key beats and continuity: patchwork-sweater goats, the woman with indigo baskets, a child chasing a red balloon, and the tabby cat on the sack of limes all appear in a coherent wet-stone bazaar setting. Model A has strong warm lighting and copper-kettle atmosphere, but it misses several specified actions and subjects, with less clear temporal consistency around the crossings and camera progression.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.