Happy Horse Trumps Veo 2 in Video Showdown

In a decisive victory, Happy Horse outshone Veo 2 in image-to-video tasks, showcasing superior prompt adherence and video quality.

By · Published

An allegorical showdown between a 'Happy Horse' and 'Veo 2' represented as subjects of scientific study (Vintage scientific illustration – engraved plate from a 19th-century journal, sepia ink on cream paper)

In our head-to-head comparison, Happy Horse emerged as the clear winner against Veo 2's Image to Video model, with an aggregate score of 16.0 to 10.0. The decisive margin was largely due to Happy Horse's ability to accurately capture the nuances of the given prompts. For instance, in the 'Neon Violin Run' task, Happy Horse flawlessly rendered a woman in a mustard trench coat playing a scarlet electric violin, while Veo 2 incorrectly depicted a man playing a violin. Moreover, Happy Horse's output demonstrated smoother motion quality and better temporal consistency, as seen in its realistic tracking shot. In contrast, Veo 2's video, although visually appealing at times, consistently fell short in terms of detail and prompt adherence, as evident in the 'Monsoon Tram Circle' task where it failed to capture the dynamic scene with the same level of accuracy as Happy Horse. Happy Horse is the superior choice for image-to-video tasks, outperforming Veo 2 in both prompt fidelity and overall video quality.

How they were tested

We ran 2 fresh video tasks, generated on the fly for this matchup so neither model could prepare in advance, and had Llama-4-Maverick-17B-128E-Instruct-FP8 score each one. Happy Horse scored 16.0 to Veo 2 (Image to Video)'s 10.0.

1. Neon Violin Run

A short continuous 16:9 cinematic shot in a real-world scene: on the rain-slick service lane behind the Taelor Street produce market, a woman in a mustard trench coat runs while playing a scarlet electric violin, her bow arm flashing as she splashes through shallow puddles and glances over her shoulder; the camera starts low beside a toppled crate of green pears, then performs a smooth fast lateral tracking move that arcs into a tight front three-quarter follow as she accelerates under flickering magenta pharmacy signage, with cold blue dawn light mixing with wet neon reflections, steam drifting from a vent, and a tense but exhilarating mood.

Winner: Happy Horse — Model A's video adheres more closely to the prompt, with a woman in a mustard trench coat running while playing a scarlet electric violin in a real-world scene. The video also showcases better motion quality, temporal consistency, and visual fidelity, with a smoother and more realistic tracking shot. In contrast, Model B's video features a man playing a violin, which does not match the prompt's description of a woman playing a scarlet electric violin.

2. Monsoon Tram Circle

A short continuous 16:9 shot in a lively dynamic environment: at the Belsar Roundabout outside the old Orin Textile Exchange during a sudden monsoon burst, a bicycle courier in a teal helmet pedals hard between honking amber trams, umbrella-carrying commuters, barking street dogs, and a vendor trying to save spinning paper pinwheels from the rain; the camera executes a steady elevated clockwise orbit from a second-floor awning, gradually descending and pushing closer to the courier as he swerves around a stalled delivery van, while sheets of warm rain glow in late-afternoon gold, brake lights smear across puddles, laundry snaps on overhead lines, and the whole scene feels chaotic, humid, and electric.

Winner: Happy Horse — Model A's video more accurately captures the dynamic and chaotic scene described in the prompt, with a clear focus on the bicycle courier and the monsoon environment. Model B's video, while visually appealing, lacks the same level of detail and adherence to the prompt.


See every prompt and the full side-by-side outputs in the interactive Head-to-Head.

Reader comments

Conversation for this story loads after sign-in.