LTX-Video 2.3 vs Wan 2.1 — Which Open AI Video Model Should You Self-Host?

2026-05-20 · 6 min read · LoreMotion Team

Hands-on comparison of LTX-Video 2.3 and Wan 2.1. VRAM, generation speed, quality, prompt adherence, licensing, and a clear recommendation for which model to run.

LTX-Video 2.3 and Wan 2.1 are the two best open-source AI video models that actually fit on consumer hardware in 2026. HunyuanVideo is technically better but requires 80 GB VRAM for usable quality, putting it out of reach without cloud H100 rental. So if you want to self-host on an RTX 3090 or 4090, your real choice is between LTX-Video 2.3 (Lightricks) and Wan 2.1 (Alibaba DAMO).

We've run both extensively. LoreMotion's production service uses LTX-Video 2.3. Wan 2.1 was the model we evaluated as a potential replacement in early 2026 and decided against. This post explains exactly why, and when you should pick Wan instead.

TL;DR

Detail below.

Architecture

LTX-Video 2.3: 22B-parameter diffusion transformer, distilled from a larger teacher model. Native support for text-to-video (t2v), image-to-video (i2v), and synced ambient audio generation. Output up to 8 seconds at 720p.

Wan 2.1: 14B-parameter DiT (Diffusion Transformer), with a smaller 1.3B "lite" variant for low-VRAM setups. Supports t2v and i2v. Output up to 5 seconds at 720p natively. No audio generation.

LTX has more parameters; Wan has Alibaba's substantial training infrastructure and a curated 100M+ clip dataset. The architectures are conceptually similar (both are DiTs with a text encoder, transformer backbone, and VAE) — the differences are in training data, scale, and how the team handles things like audio joint training.

VRAM and hardware

LTX-Video 2.3:

Wan 2.1 (14B flagship):

Wan 2.1 fits cleanly on 24 GB without the int8 + block-swap dance. LTX needs the quantisation framework to fit, and even with it the model uses more aggressive memory management. In practice this means:

If you're starting and stopping the model frequently (rare in production, common in development), Wan's cleaner memory footprint matters. For long-running inference workers it's a wash.

Generation speed

Wall-clock for a 5-second 720p clip on the same RTX 3090:

Model Time
LTX-Video 2.3 (Wan2GP int8) 72s
Wan 2.1 14B (bf16) 110s
Wan 2.1 1.3B (bf16) 55s

LTX is faster than Wan 14B by about 35% per clip. The 1.3B Wan variant is faster than both but produces visibly lower-quality output — useful for prototyping, not for production.

On an RTX 4090 the gap narrows but LTX still leads: 44s vs 68s.

Output quality (subjective)

We ran 100 prompts through each model and had five people rank outputs blind. Headline result: LTX-Video 2.3 won 51% of pairs, Wan 2.1 won 35%, 14% were rated equal.

A breakdown by category:

The composition-heavy categories are where Wan 2.1 genuinely surprises. If your prompts are spatial ("a wide shot of three figures arranged in a triangle"), Wan executes them more reliably than LTX.

Prompt adherence specifics

Subject count is a useful diagnostic. We asked each model for "two children" and "three children" and "five children" 20 times each and counted clear, distinct figures in the output:

Prompt LTX correct count Wan correct count
Two children 18/20 19/20
Three children 13/20 16/20
Five children 4/20 7/20

Wan is meaningfully better at count adherence. This matters for product photography (multi-pack shots), group scenes, and anything where the prompt specifies an exact number.

Audio

LTX-Video 2.3: Native synced ambient audio. Footsteps, wind, water, room tone, vehicle sounds all generated jointly with video and matched to on-screen action.

Wan 2.1: Silent. You'll need to add audio in post-production or via a separate model.

For social media use, this is a real differentiator. LTX clips are usable as-posted on TikTok / Reels / Shorts. Wan clips need an audio pass first.

Licensing

LTX-Video 2.3: OpenRAIL-M licence. Commercial use allowed but with explicit content restrictions (sexual content, harassment, weapons, etc.). Most commercial work is fine. Edge cases (adult production, certain political imagery, some controversial historical content) are not.

Wan 2.1: Apache 2.0. Genuinely permissive. No content restrictions in the licence itself. Use it for whatever, including content categories LTX prohibits.

For commercial client work where the licence chain matters (you may need to indemnify clients), Wan 2.1's Apache 2.0 is materially cleaner than LTX's OpenRAIL-M. This is the single strongest argument for choosing Wan over LTX for production use.

Image-to-video

Both models support i2v mode. We tested with the same 30 reference images:

For animating product photography where the product must look identical to the source image, Wan's stricter preservation is helpful. For animating creative artwork where you want the AI to add interpretation, LTX's more dynamic motion is better.

Why LoreMotion uses LTX-Video 2.3

The decision came down to four factors:

  1. Speed. 72s vs 110s per clip is a real cost difference at scale. At 24h × 50 clips/h on a 3090, LTX produces ~30% more clips per day.
  2. Audio. LoreMotion users post to social platforms. Audio matters.
  3. Free tier feasibility. The speed advantage means we can offer LTX free at sustainable per-clip cost. We couldn't offer Wan free at the same generosity.
  4. i2v dynamism. Our users tend to want their animations to have energy. LTX's more dynamic motion fits the use case.

If we were running a B2B product where licence cleanliness or spatial prompt adherence mattered more than speed, we'd seriously consider Wan 2.1 instead.

Which one to pick

Both models are good. The gap between them is real but narrow — much narrower than the gap between either of them and HunyuanVideo (which is better but needs 80 GB). Pick based on the specific factor that matters most for your use case.

You can try LTX-Video 2.3 in your browser with no signup at loremotion.com/generate. For Wan 2.1, the cleanest self-hosting path is the official Wan-Video/Wan2.1 repository on GitHub plus a 24 GB GPU on RunPod Community.