Head to head: CogVideoX-5B vs Happy Horse 1.1 Image to Video
CogVideoX-5B vs Happy Horse 1.1 Image to Video
One model showed up; the other barely produced viewable video. Across both prompt-following tests, Happy Horse 1.1 Image to Video delivered recognizable scenes, action, and mood, while CogVideoX-5B collapsed into unusable output.
This wasn’t a close contest. The aggregate score says it plainly: **16.9 to 0.5 in favor of Happy Horse 1.1 Image to Video**. More importantly, the footage backs it up. Happy Horse produced actual scenes with readable action; CogVideoX-5B mostly produced absence. In **Saffron Bowl Spill**, CogVideoX-5B failed at the most basic level: instead of a cat knocking over a bowl of batter in a warm dawn kitchen, it looked like a uniform beige screen with no discernible scene or motion. Happy Horse, by contrast, actually staged the prompt: the cat, the bowl, the spill, the flour puff, the spoon, the blueberries, and the warm kitchen light all register on screen, with a coherent sequence of events. Some motion is a touch stylized, but stylized beats nonexistent every time. The gap was just as stark in **Hallway Lantern Return**. CogVideoX-5B again effectively no-showed, rendering frames that were essentially black. Happy Horse delivered the blackout hallway, the child with the lantern, the toy fire engine, the grandfather, and the hanging plant, while preserving the prompt’s reassuring emotional tone. That matters: this wasn’t just object recall, but scene construction, lighting control, and temporal coherence. What sinks CogVideoX-5B here is not subtle artifacting or slightly weak motion physics. It’s total prompt failure. You can forgive a model for imperfect continuity; you cannot forgive it for not generating the scene at all. Happy Horse 1.1 Image to Video may not be flawless, but it consistently clears the threshold that makes a video model usable. **Final call: Happy Horse 1.1 Image to Video wins decisively. CogVideoX-5B didn’t lose on polish; it lost on basic visibility and prompt execution.**
Saffron Bowl Spill
In a sunlit apartment kitchen at blue-gold dawn, a tabby cat’s tail clips a ceramic mixing bowl and a ribbon of saffron-tinted batter sloshes over the rim, droplets arcing onto a wooden counter while a thin veil of flour puffs up and hangs in the air; the camera makes a slow lateral dolly from the espresso machine toward the spill, staying close to counter height as the liquid folds, splashes, and runs naturally around a scattered teaspoon and three blueberries, warm window light catching every particle, with a cozy but slightly chaotic mood, 16:9
Model A appears to be a uniform beige screen with no discernible scene or motion, failing the prompt entirely. Model B clearly depicts the cat, bowl, batter spill, flour puff, spoon, blueberries, warm dawn kitchen lighting, and coherent temporal progression, though some object motion and continuity are slightly stylized rather than fully natural.
Hallway Lantern Return
One continuous shot inside a narrow family hallway during a summer blackout: starting low behind a child in mismatched socks carrying a small battery lantern, the camera slowly tracks backward in front of her as she walks from the dim laundry nook toward the living room, pausing to nudge a rolling toy fire engine with her foot while her grandfather in the background reaches to steady a swaying hanging plant; soft amber lantern light and faint moonlight from a side window shape the scene, shadows shifting across framed drawings as the moment evolves with a hushed, reassuring mood, 16:9
Model A appears essentially black across all sampled frames and fails to depict the prompt. Model B clearly matches the hallway blackout scene with the child, lantern, toy fire engine, grandfather, and hanging plant, while maintaining coherent motion, lighting, and a reassuring mood.
Matchup powered by OpenRouter.