LTX-Video 2.3 vs LTX-Video 2.0 — What Actually Changed

2026-05-18 · 6 min read · LoreMotion Team

Side-by-side comparison of LTX-Video 2.3 and LTX-Video 2.0 from Lightricks. Audio, distilled inference, quality improvements, and whether the upgrade is worth it.

LTX-Video 2.3 dropped in early 2026, roughly four months after the LTX-Video 2.0 release. The changelog promises three big things: native audio generation, distilled inference (faster sampling), and improved temporal coherence at longer clip lengths. After running both versions head-to-head on thousands of clips, this post explains what actually changed, what didn't, and whether the upgrade is worth it for self-hosters.

LoreMotion runs LTX-Video 2.3 in production today. We migrated from 2.0 about six weeks after 2.3 shipped. Here's the full evaluation we did before pulling the trigger.

TL;DR

Distilled inference

The biggest under-the-hood change in 2.3 is a properly distilled sampler. LTX-Video 2.0 used a 50-step diffusion process by default; 2.3's distilled variant produces equivalent quality in 16–20 steps.

Real-world impact on an RTX 3090 with Wan2GP profile=4:

Version 5s clip @ 720p 8s clip @ 720p
LTX-Video 2.0 180s 290s
LTX-Video 2.3 (distilled) 72s 115s

That's a 2.5x speedup with no measurable quality loss in blind comparison. For LoreMotion this single change meant a 3090 went from ~480 clips/day to ~1,200 clips/day. The infrastructure savings were significant.

The non-distilled "high quality" sampler is still available in 2.3 if you want pixel-perfect output and don't mind the wait. We've tested both and consistently can't reliably distinguish distilled from non-distilled in blind ranking. Use distilled.

Native audio

LTX-Video 2.3 generates synced ambient audio jointly with the video. This was the headline feature of the release and it works better than we expected.

What audio looks like:

What audio doesn't include:

For social media use, the audio is genuinely good. A LoreMotion clip with native LTX 2.3 audio can be posted directly to TikTok with no audio pass and it sounds natural. That's a workflow win.

For commercial post-production, you'll still want to replace the audio with proper foley + score, but having a reference track of what the model "imagined" the scene sounded like is useful for the audio engineer.

Quality improvements

Beyond speed and audio, what actually got better in the video itself?

We ran the same 50 prompts through 2.0 and 2.3 (both at int8 via Wan2GP, same seed where possible) and had five people rank outputs blind. Results:

The biggest jump is on longer clips. LTX-Video 2.0 had a known issue with quality degradation past the 5-second mark — subjects would drift, lighting would shift, motion would lose coherence. LTX-Video 2.3 holds together meaningfully better at 8 seconds. We rarely saw the dramatic mid-clip identity drift that plagued 2.0 outputs.

Multi-subject handling also improved noticeably. 2.0 frequently merged secondary subjects into the background or duplicated faces. 2.3 still struggles past three subjects but two-subject scenes are handled cleanly.

Single-subject scenes improved less. If your use case is mostly single-character portraits or product shots, the visible quality jump from 2.0 to 2.3 is modest.

Image-to-video specifics

LTX-Video 2.0's i2v mode was usable but had two known weaknesses: it sometimes ignored the reference image's colour palette, and it tended to invent dynamic motion even when the reference suggested stillness.

LTX-Video 2.3's i2v:

If you previously dismissed 2.0's i2v as unreliable, give 2.3 another chance — particularly for animating product photography where colour accuracy matters.

What didn't change

To set expectations correctly, some longstanding LTX limitations persist in 2.3:

These are the limitations to plan around when building production workflows. LTX-Video 3 will presumably address some of them; until that release, work within these constraints.

VRAM and hardware

Practical change here is zero. Both 2.0 and 2.3 run on the same hardware via Wan2GP profile=4 (int8 + block-swap) on a 24 GB consumer GPU. Cold-start time is slightly longer for 2.3 (model is marginally larger), inference is dramatically faster (distilled sampler).

If you have an RTX 3090 / 4090 setup running 2.0, the upgrade requires no hardware changes — just update Wan2GP and re-pull the LTX-Video 2.3 weights.

Inference code changes

The good news: most code is backward-compatible. The Wan2GP integration is essentially drop-in if you update both Wan2GP and the LTX weights.

New flags in 2.3 worth knowing about:

If you're scripting LTX inference, audit your config to make sure you're explicitly enabling audio if you want it — some default templates don't.

Upgrade decision

For most LTX-Video 2.0 self-hosters the upgrade calculus is straightforward:

There's no good reason to stay on 2.0 unless you've heavily tuned a specific pipeline around it and don't want to retest. The speed alone justifies the upgrade for any production user.

LoreMotion users see only 2.3 — we migrated all production traffic in February 2026 and all new clips are 2.3 generations. If you want to compare 2.3 quality without setting up local hardware, generate a free clip at LoreMotion — the first one needs no signup, no watermark.

Looking forward

A few things to watch on the LTX roadmap:

For now, LTX-Video 2.3 is the strongest open-source video model that runs on a single 24 GB GPU and it's a clear improvement over 2.0. The audio support alone makes it the right pick for anyone whose output ends up on social platforms.