Head to head: CogVideoX-5B vs Happy Horse
This one wasn’t especially close. Happy Horse wins by being the more obedient, more cinematically precise model on both prompts, while CogVideoX-5B settles too often for the general vibe instead of the actual shot.
By RuntimeWire · Published

Happy Horse takes the matchup on aggregate, 17.3 to 14.0, and the reason is straightforward: it listens better. Across both tests, it doesn’t just reproduce mood; it reproduces the requested camera grammar, staging, and scene cues with much more discipline.
In Courier launch, that gap is obvious. Happy Horse delivers the prompt’s low, lateral curbside tracking feel and packs in the details that make the shot read correctly: orange crates, manhole steam, wet blue-hour reflections, and flyers whipping behind the rider. CogVideoX-5B gets some of the energy right — there’s speed and a convincing wet-street atmosphere — but it defaults to a more generic trailing angle and leaves too many prompt-specific elements on the table.
The same pattern holds in Violinist follow. Happy Horse better understands the assignment as a backward tracking shot at chest height, with the violinist centered and actively playing while commuters and a red tram animate the wet dawn plaza around her. CogVideoX-5B again produces decent atmosphere and continuity, but it frames her too often from behind, which undercuts the implied follow-facing composition and makes the result feel less intentional.
What hurts CogVideoX-5B here is not baseline image quality; it’s shot obedience. It can generate a plausible scene, but Happy Horse is the one that consistently lands the requested perspective, movement, and environmental specifics. In a head-to-head built on prompt fidelity rather than vague prettiness, that’s the difference between competitive and convincing.
Final call: Happy Horse wins, clearly. It is the more exact, more controllable video model in this matchup, and it earned both task victories on specifics rather than style alone.
How they were tested
We ran 2 fresh video tasks, generated on the fly for this matchup so neither model could prepare in advance, and had gpt-5.4 score each one. CogVideoX-5B scored 14.0 to Happy Horse's 17.3.
1. Courier launch
A single continuous 16:9 shot of a bike courier in a teal rain shell exploding out of a narrow alley onto a glistening city street at blue hour, standing on the pedals and weaving past orange delivery crates and a puffing manhole vent, newspaper flyers whipping in the slipstream; the camera sprints low alongside from curb height in a fast lateral tracking move, briefly surging ahead then falling back as wet asphalt reflections streak and the spinning wheels throw fine spray, with natural motion blur emphasizing speed, cool neon storefront light mixed with sodium streetlamps, and a tense, electric, high-adrenaline mood.
Winner: Happy Horse — Model B matches the prompt more closely with a low lateral tracking feel, visible orange crates, manhole steam, wet blue-hour street reflections, and flyers whipping behind the rider, while maintaining stronger realism and composition. Model A has good speed and wet-street mood, but the camera angle is mostly trailing rather than curb-height lateral, and it misses several key prompt elements.
2. Violinist follow
A single continuous 16:9 shot following a street violinist in a mustard coat walking briskly through a crowded tram plaza at first light, bow arm still moving as she plays while commuters in charcoal suits stream around her and a red tram glides behind; the camera tracks backward smoothly at chest height, matching her pace to keep her centered and sharply in focus as pigeons burst up, kiosk signs and wet cobblestones drift past, and the background parallax slides continuously, lit by pale dawn sun and cool storefront fluorescents for an intimate, observant, energetic city-documentary mood.
Winner: Happy Horse — Model B better matches the prompt with a backward tracking shot at chest height, the violinist centered and playing while commuters and a red tram move around her in a wet dawn plaza. Model A has strong atmosphere and continuity, but it shows her mostly from behind rather than the implied follow-facing composition and feels less specific in camera behavior and scene detail.
See every prompt and the full side-by-side outputs in the interactive Head-to-Head.