LTX-Video 2.3 vs LTX-Video 2.0 — What Actually Changed

2026-05-18 · 6 min read · LoreMotion Team

Side-by-side comparison of LTX-Video 2.3 and LTX-Video 2.0 from Lightricks. Audio, distilled inference, quality improvements, and whether the upgrade is worth it.

LTX-Video 2.3 dropped in early 2026, roughly four months after the LTX-Video 2.0 release. The changelog promises three big things: native audio generation, distilled inference (faster sampling), and improved temporal coherence at longer clip lengths. After running both versions head-to-head on thousands of clips, this post explains what actually changed, what didn't, and whether the upgrade is worth it for self-hosters.

LoreMotion runs LTX-Video 2.3 in production today. We migrated from 2.0 about six weeks after 2.3 shipped. Here's the full evaluation we did before pulling the trigger.

TL;DR

Speed: ~2.5x faster than 2.0 at equivalent quality thanks to a distilled sampler.
Audio: New in 2.3, joint generation of synced ambient audio. Big quality-of-life win.
Quality: Modest improvement, mostly in temporal coherence and subject consistency. Not a generational leap.
VRAM: Same — both fit on 24 GB via Wan2GP profile=4.
Backward compatibility: Inference code mostly drop-in; some new config flags for audio.
Upgrade verdict: Yes, upgrade. The speed + audio improvements alone justify it.

Distilled inference

The biggest under-the-hood change in 2.3 is a properly distilled sampler. LTX-Video 2.0 used a 50-step diffusion process by default; 2.3's distilled variant produces equivalent quality in 16–20 steps.

Real-world impact on an RTX 3090 with Wan2GP profile=4:

Version	5s clip @ 720p	8s clip @ 720p
LTX-Video 2.0	180s	290s
LTX-Video 2.3 (distilled)	72s	115s

That's a 2.5x speedup with no measurable quality loss in blind comparison. For LoreMotion this single change meant a 3090 went from ~480 clips/day to ~1,200 clips/day. The infrastructure savings were significant.

The non-distilled "high quality" sampler is still available in 2.3 if you want pixel-perfect output and don't mind the wait. We've tested both and consistently can't reliably distinguish distilled from non-distilled in blind ranking. Use distilled.

Native audio

LTX-Video 2.3 generates synced ambient audio jointly with the video. This was the headline feature of the release and it works better than we expected.

What audio looks like:

Footsteps that match on-screen walking, including pace and surface (concrete vs grass vs metal grating)
Vehicle sounds for cars, motorcycles, trains — pitch shifts plausibly with acceleration
Ambient nature (wind, water, birds) for outdoor scenes
Room tone for interiors — quiet HVAC hum, distant traffic, refrigerator buzz
Action sounds — door slams, glass breaking, fabric rustling

What audio doesn't include:

Speech. The model doesn't generate language. Mouths that move just don't sync to anything.
Music. The model doesn't generate melodic content.
Foley-quality polish. Sounds are plausible but not as crisp as a Pro Tools session.

For social media use, the audio is genuinely good. A LoreMotion clip with native LTX 2.3 audio can be posted directly to TikTok with no audio pass and it sounds natural. That's a workflow win.

For commercial post-production, you'll still want to replace the audio with proper foley + score, but having a reference track of what the model "imagined" the scene sounded like is useful for the audio engineer.

Quality improvements

Beyond speed and audio, what actually got better in the video itself?

We ran the same 50 prompts through 2.0 and 2.3 (both at int8 via Wan2GP, same seed where possible) and had five people rank outputs blind. Results:

Single-subject scenes: 2.3 won 42%, 2.0 won 28%, 30% tied.
Multi-subject scenes: 2.3 won 51%, 2.0 won 22%, 27% tied.
Long clips (8s): 2.3 won 65%, 2.0 won 18%, 17% tied.
Image-to-video: 2.3 won 38%, 2.0 won 31%, 31% tied.

The biggest jump is on longer clips. LTX-Video 2.0 had a known issue with quality degradation past the 5-second mark — subjects would drift, lighting would shift, motion would lose coherence. LTX-Video 2.3 holds together meaningfully better at 8 seconds. We rarely saw the dramatic mid-clip identity drift that plagued 2.0 outputs.

Multi-subject handling also improved noticeably. 2.0 frequently merged secondary subjects into the background or duplicated faces. 2.3 still struggles past three subjects but two-subject scenes are handled cleanly.

Single-subject scenes improved less. If your use case is mostly single-character portraits or product shots, the visible quality jump from 2.0 to 2.3 is modest.

Image-to-video specifics

LTX-Video 2.0's i2v mode was usable but had two known weaknesses: it sometimes ignored the reference image's colour palette, and it tended to invent dynamic motion even when the reference suggested stillness.

LTX-Video 2.3's i2v:

Colour preservation: Materially better. The output now consistently matches the reference image's grading, white balance, and saturation.
Motion appropriateness: Slightly more conservative. Less likely to invent dramatic motion from a still subject.
First-frame fidelity: Roughly equal. Both produce a first frame that closely matches the input image.

If you previously dismissed 2.0's i2v as unreliable, give 2.3 another chance — particularly for animating product photography where colour accuracy matters.

What didn't change

To set expectations correctly, some longstanding LTX limitations persist in 2.3:

Text rendering is still gibberish. Don't put readable text in prompts.
Hand anatomy still fails roughly half the time. Fingers fuse, palms warp, sometimes you get six fingers.
Compositional / spatial prompts ("X to the left of Y") still adhered at ~45–50% rate, lower than Wan 2.1 or HunyuanVideo.
Maximum clip length is still 8 seconds. The 30-second clips that closed APIs offer aren't here yet.
Native resolution is 720p. There's no 1080p or 4K mode; you'll upscale in post.

These are the limitations to plan around when building production workflows. LTX-Video 3 will presumably address some of them; until that release, work within these constraints.

VRAM and hardware

Practical change here is zero. Both 2.0 and 2.3 run on the same hardware via Wan2GP profile=4 (int8 + block-swap) on a 24 GB consumer GPU. Cold-start time is slightly longer for 2.3 (model is marginally larger), inference is dramatically faster (distilled sampler).

If you have an RTX 3090 / 4090 setup running 2.0, the upgrade requires no hardware changes — just update Wan2GP and re-pull the LTX-Video 2.3 weights.

Inference code changes

The good news: most code is backward-compatible. The Wan2GP integration is essentially drop-in if you update both Wan2GP and the LTX weights.

New flags in 2.3 worth knowing about:

--audio-enable / --audio-disable: toggle audio generation. Off by default in some configs; check your Wan2GP version.
--audio-strength: 0.0–1.0, controls how prominent the ambient audio is in the mix. Default 0.7 works well.
--sampler distilled / full: choose distilled (fast, default) or full (slower, slightly higher quality).

If you're scripting LTX inference, audit your config to make sure you're explicitly enabling audio if you want it — some default templates don't.

Upgrade decision

For most LTX-Video 2.0 self-hosters the upgrade calculus is straightforward:

You care about speed: Upgrade. 2.5x throughput improvement is a big deal.
You care about audio: Upgrade. The new audio mode is genuinely useful.
You generate longer clips (>5s): Upgrade. Coherence improvements are real here.
You only do short single-subject clips and want stability: You can stay on 2.0 if you want. Quality difference is minor.

There's no good reason to stay on 2.0 unless you've heavily tuned a specific pipeline around it and don't want to retest. The speed alone justifies the upgrade for any production user.

LoreMotion users see only 2.3 — we migrated all production traffic in February 2026 and all new clips are 2.3 generations. If you want to compare 2.3 quality without setting up local hardware, generate a free clip at LoreMotion — the first one needs no signup, no watermark.

Looking forward

A few things to watch on the LTX roadmap:

LTX-Video 3 is in active development at Lightricks. Public hints suggest 30-second native clip length, 1080p output, and speech generation (lip-sync). No public release date as of this writing.
Open weights for the audio adapter would be useful for the community to fine-tune for specific sound categories (sports, ASMR, music videos). Not currently available.
ControlNet-style conditioning (depth, pose, motion brush) is rumoured but not announced.

For now, LTX-Video 2.3 is the strongest open-source video model that runs on a single 24 GB GPU and it's a clear improvement over 2.0. The audio support alone makes it the right pick for anyone whose output ends up on social platforms.