The Future of Open-Source AI Video — What's Coming After LTX-Video 2.3, Wan, and Hunyuan

2026-05-28 · 7 min read · LoreMotion Team

A grounded look at the next 12 months of open-source AI video models. What's in training, who's shipping, and which architectures will actually matter for self-hosters and indie developers.

Open-source AI video moved faster in the last 12 months than anyone predicted. A year ago, the best you could run on a single 24GB card was a 2-second 480p clip from a stitched-together pipeline. Today LTX-Video 2.3 produces 5-second 720p output in 70 seconds on the same hardware, and Wan 2.1 cracks 1080p with the right offloading.

So what's next? We talk to model authors, watch the research literature, and run early checkpoints in our own infrastructure. Here's our honest read on the next 12 months, separated into "almost certainly happening," "probably happening," and "do not bet the company on this."

Almost certainly happening (next 6 months)

LTX-Video 3.0

Lightricks has been telegraphing the next major version for a while. From conversations at conferences and what's visible in their public commits, expect:

Native 1080p output without upscaling tricks. The current 720p ceiling is an artefact of training data, not architecture.
10–15 second clips with maintained motion coherence. Their internal demos at NeurIPS already showed 12-second outputs without the "personality drift" current models suffer from past ~6 seconds.
A 28B-parameter variant alongside the current 22B. Trade more VRAM for substantially better complex-scene handling.
Improved text encoder — likely replacing T5-XXL with a Gemma-based encoder, similar to what Imagen 3 did. Better prompt understanding, especially for negation and spatial relationships.

The hard question for self-hosters: will it still fit on 24GB? Probably yes for the 22B variant with int8 + block-swap, no for the 28B. We're already speccing 32GB cards (RTX 5090, A6000 Ada) for our next worker generation.

Hunyuan Video 2

Tencent's first version was uneven — beautiful aesthetic, terrible motion. The team has been very open about what's in 2.0:

Motion-specific training data at scale, addressing the limp-physics problem.
Image-to-video and start-end-frame interpolation as first-class features (currently bolt-on).
A meaningful drop in VRAM requirements — they claim 24GB for 720p output, down from 40GB+ in v1.

Hunyuan 2 will be the one to watch for cinematic quality on consumer hardware, if Tencent ships when they're saying they will (sometime Q3 2026).

A "small fast" model worth taking seriously

CogVideoX-5B is the closest thing today: fast, small, runs anywhere, but quality is well behind the frontier. Several teams are working on the equivalent of "SDXL-Turbo for video" — a distilled model that produces 80% of the quality at 10% of the cost.

We've seen one promising early checkpoint (we won't say from whom) that generates a 3-second 480p clip in under 5 seconds on an RTX 3090. If something like this ships, it changes the economics of free-tier AI video tools entirely.

Probably happening (6–12 months)

Multi-modal conditioning becomes standard

Today most open models accept text and (sometimes) a single starting image. Real production video pipelines need more — pose conditioning, depth maps, segmentation masks, multiple reference images, audio-driven generation. The closed models (Veo, Sora, Kling) all have some of these. The open ecosystem will catch up in 2026, probably via the ControlNet-style pattern that worked so well for image generation.

The risk: VRAM. Every additional conditioning input costs memory. We could end up with capable open models that require A100-class hardware to actually use the full feature set.

Longer-clip generation through latent extension

The fundamental limit on clip length right now isn't compute, it's training. Models are trained on 5-second clips and lose coherence past that. Several research papers in late 2025 showed promising approaches to "extending" outputs at inference time — generating clip N+1 conditioned on the last frames of clip N, with techniques to maintain identity.

If one of these techniques makes it into a mainstream open model, the practical maximum jumps from 5–10 seconds to 30+. That unlocks entirely different use cases (short ads, music video segments, explainer clips).

Audio-aware video generation

Sora 2 and Veo 3 both generate synchronised audio (voices, ambient sound, music). No open model does this today. There are independent efforts to bolt audio generation onto existing video pipelines but the results are nowhere near integrated systems.

Realistically, we expect one of the big Chinese labs (Tencent, Alibaba, ByteDance) to ship the first open audio-video integrated model in 2026. They have the data, the compute, and the willingness to open-source flagship work that US labs have stepped back from.

Do not bet the company on this (12+ months out)

True real-time generation

People keep predicting this — "we'll have real-time text-to-video soon!" — and the timeline keeps slipping. The fundamental issue is that diffusion is iterative; you need 20–50 denoising steps for current architectures to produce quality output. Each step on a frontier model takes 100ms+ on the best hardware. You don't get to real-time without either (a) drastically fewer steps via consistency models / step distillation, or (b) a non-diffusion architecture entirely.

Both are active research directions. Neither is close to mainstream. We'd guess 2027 at earliest for real-time text-to-video on consumer hardware, and that's optimistic.

Open-source models that match Veo 3.1 or Sora 2

The gap between the best open model (currently LTX-Video 2.3 or Wan 2.1, depending on what you value) and the best closed model (Veo 3.1, Sora 2) is large and not closing as fast as the equivalent gap did in image generation. There are structural reasons:

Video training data is much harder to source legally at scale than image data.
Compute requirements are 10–20x higher than image training, putting it out of reach of smaller research labs.
The closed labs are heavily investing specifically because they see video as the next moat.

Could an open model match Veo 3.1 in 2026? Maybe. Could it match whatever Veo 4 looks like at the end of 2026? Almost certainly not. The gap is more likely to grow than shrink for at least 18 months.

Fully on-device generation

Phones and laptops generating AI video locally is a popular vision. The reality: even with aggressive quantisation, current video models require more VRAM than even M4 Max MacBooks have available for general workloads. Apple silicon will get there eventually but the gating factor is memory bandwidth, not raw compute, and bandwidth improves slowly.

For mobile specifically: 2027–2028 for usable on-device 480p generation. Earlier than that for slideshow-style "image with light motion" outputs that don't count as real video.

What this means for builders

If you're building anything that depends on AI video, here's how to think about the roadmap:

Don't lock in to one model. The frontier moves every quarter. Architect your system so swapping the model is a configuration change, not a refactor. (We have a single pricing.ts file with all model definitions for exactly this reason.)
Plan VRAM ahead. If you're buying hardware today, get 32GB or wait for 48GB consumer cards. The 24GB ceiling is going to bite within 12 months.
Get comfortable with quantisation. It's the only way to keep pace with the model size race on commodity hardware. The Wan2GP ecosystem is the current gold standard; learn it.
Don't wait for "the perfect open model" to launch. Whatever you can do today with LTX 2.3 or Wan 2.1 is already useful and lets you build muscle for the better tools that arrive. The teams that ship now will have a year of advantage when the next-gen models drop.

What we're betting on at LoreMotion

For internal planning:

Q3 2026: migrate to LTX-Video 3.0 the moment it's stable. We expect a 30–40% throughput improvement on the same hardware.
Q4 2026: add Hunyuan 2 as a premium option once VRAM requirements are confirmed manageable.
Hardware: when we expand our worker fleet, RTX 5090s (32GB) over more 3090s, even though the per-clip cost is higher. We need the headroom for whatever ships next.
What we're NOT doing: building anything that assumes audio-video integration or 30-second clip outputs. Both are coming but the timing is too uncertain to plan around.

The open-source AI video ecosystem is in roughly the same place open LLMs were in mid-2023 — capable, fast-moving, full of energy, but with real gaps versus the closed alternatives. The same play that worked then will work now: build with what's available today, watch the field weekly, and be ready to swap components quarterly.

Try the current generation free

If you haven't actually used a current open-source model lately, the easiest way to see where the state of the art is right now is to generate a free clip on LoreMotion — we run LTX-Video 2.3 on our free tier with no signup required. It's not Veo 3.1, but it's much closer than most people expect.

For deeper dives on individual models, see our LTX vs Wan vs Veo comparison and open-source video model roundup.