How We Moderate an Open-Source AI Video Generator

2026-05-30 · 8 min read · LoreMotion Team

A behind-the-scenes look at content moderation for open-source AI video models — why hosted APIs get safety for free, why self-hosted models don't, and the layered system we built to keep LoreMotion safe without ruining it for creators.

When you call a hosted video API like Veo or Kling, content moderation comes bundled in. The provider screens your prompt, screens the output, and silently rejects anything that violates their policy before a single frame reaches you. You never see the machinery, but it is always running.

Open-source models give you none of that. The weights that produce beautiful, controllable video are the same weights that will cheerfully produce things that are illegal, abusive, or that get your storage account and ad partnerships terminated overnight. The model does not know or care. When you run LTX-Video, Wan, or HunyuanVideo yourself, you are the safety layer. There is no one else.

LoreMotion runs open models on our own GPUs and offers them free, with no signup required for the first generation. That combination — free, anonymous, open-weight — is exactly the threat profile abuse actors look for. This post is an honest account of how we handle it: what we block, what we deliberately allow, and a specific abuse vector we closed recently that neither of our image checks could catch.

Why this is harder than it sounds

The naive approach to moderation is a keyword blocklist. It fails almost immediately, for two opposite reasons.

It blocks too much. A creative tool that rejects the word "blood" is useless for horror filmmakers. One that rejects "girl" breaks the most common phrasing in the entire prompt corpus — most adult women are described as "a girl in a red dress." Over-filtering doesn't make a tool safe; it makes it worthless, and it pushes legitimate users away while doing nothing to stop a determined bad actor.

It blocks too little. Any naive filter is trivially defeated. Attackers space out letters (n a k e d), swap digits for letters (n4ked), inject zero-width characters between every character, or run a banned prompt through an LLM that rewrites it into flowery, "safe-looking" prose that means exactly the same thing. A filter that matches the literal word and nothing else is security theater.

So the real job is not "build a blocklist." It is to build a layered system where each layer catches what the others miss, tuned so that the false-positive rate stays low enough that real creators never notice it.

Layer one: the prompt gate

The cheapest possible defense is to reject a bad request before it ever reaches a GPU. Rendering costs money and time; a string comparison costs microseconds. So the first thing every prompt hits is a text gate at job-creation time.

Before any pattern runs, we normalize the prompt aggressively. That means Unicode normalization, stripping invisible and control characters (the zero-width-space trick), lowercasing, collapsing repeated letters, undoing common leetspeak substitutions, and de-spacing single letters that have been spread out to dodge word boundaries. Crucially, we run every rule against multiple normalized forms of the same prompt, because some attacks only show up after one transformation and others get destroyed by it. A prompt has to look clean in all of them to pass.

The patterns themselves are tiered by severity. The top tier — content that is illegal or that exposes us to immediate legal risk — is a hard block, no exceptions. The most important rule here is a proximity rule: any minor-referencing token appearing anywhere near any sexual token, in either order, within a generous window, is rejected outright. The window is wide on purpose. An 80-character window is defeated by any LLM rewriter that pads the sentence; a wide window is not. The cost of that width is essentially zero false positives, because there is no legitimate prompt that places those two concepts near each other.

Below that sits an adult-content tier. This is where the judgment calls live, and where we make a deliberate product decision that a lot of tools get wrong.

What we deliberately allow

LoreMotion is a creative tool, and creative tools have to permit things that are edgy without being abusive. We do not hard-block gore, profanity, generic adult themes, political content, satire, or copyrighted character names. We do not treat "girl" or "boy" as age tokens, because doing so breaks ordinary prompts. And we allow ordinary swimwear in ordinary contexts — "a woman in a bikini on the beach" is a normal request that the vast majority of beach, travel, and fashion content depends on.

That last carve-out matters for the rest of this story, because it is exactly the gap an abuse vector tried to crawl through.

Layers two and three: looking at the pixels

Text gates only see text. They cannot tell whether the model actually rendered something harmful, and they cannot see an uploaded image at all. So two more layers operate on actual pixels using a vision model.

For image-to-video, we screen the uploaded reference image before queueing the job. If you upload something that already violates policy, you get an instant rejection instead of waiting for a render.

After a render completes, we screen a frame from the output itself. We sample past the midpoint of the clip rather than the opening frame, because the classic evasion is a clip that opens innocently and reveals its real content in the back half. The vision layer flags nudity, sexualized or semi-nude content, see-through clothing, and overtly sexual posing even when the subject is technically clothed — while still allowing that ordinary swimwear carve-out, so a genuine beach clip passes.

These two vision layers have an important property: they fail open on infrastructure errors but fail strict on judgment. If the vision service times out or errors, we don't block legitimate users over an outage. But if the model returns a confident "this violates policy," we act on it. For a free, public platform, that is the right balance.

The gap: innocent image, guilty prompt

Here is the vector that motivated our most recent change, and it is a good illustration of why no single layer is enough.

A user uploads a completely innocent, fully-clothed photo of a real person. Then, in the image-to-video prompt, they ask the model to change that person's clothing into a bikini, swimwear, or underwear.

Walk through our three existing layers and you'll see why each one waved it through:

The prompt gate allowed it, because plain "bikini" and "swimwear" are on our deliberate allow-list — we permit them for ordinary text-to-video beach content.
The uploaded-image check allowed it, because the uploaded photo genuinely is innocent. There is nothing wrong with the input.
The output frame check allowed it, because ordinary swimwear is, again, deliberately allowed.

Every layer behaved exactly as designed, and the abuse still got through. The harm here isn't "swimwear exists in a video." The harm is taking a real, identifiable person from an uploaded photo and non-consensually transforming them into revealing clothing. The signal that distinguishes abuse from a normal request is not any single word — it's the combination: a reference image of a real person, plus an instruction to change their clothing into revealing attire.

Closing it: a context-aware gate

The fix is a new gate that runs only in the exact situation where this abuse lives — an image-to-video job that carries a reference image. In that narrow context, and only there, we apply a stricter rule than we'd ever apply to text-to-video: we block clothing-change instructions (change, replace, dress, put-into, and similar) aimed at swimwear or undergarments, including the plain terms we otherwise allow.

The scoping is the whole point. Because the gate is wired to fire only on image-to-video jobs with an uploaded image, it has zero effect on text-to-video. "A woman in a bikini on the beach" as a from-scratch generation still works exactly as it did yesterday. We didn't make the product more restrictive for everyone to close a hole that only existed for a few. We made one specific, high-risk path stricter and left the creative surface untouched.

There is a deliberate tradeoff baked in. On the image-to-video path, a prompt that merely describes swimwear can now get caught even if the uploaded photo already showed it — for example, animating an existing beach photo. We decided that's an acceptable price. The legitimate version of that request is rare; the abusive version is not; and the cost of a false block there is a mild inconvenience, while the cost of a miss is someone's likeness being abused. When the stakes are that asymmetric, you err strict.

Principles we keep coming back to

Building this system over many iterations, a few principles have held up:

Block at the cheapest layer that can see the signal. A string match is thousands of times cheaper than a GPU render. Push detection as early in the pipeline as it can reliably live.
Layers exist to cover each other's blind spots. Text gates can't see pixels; vision gates can't see intent in an edit instruction. The clothing-swap vector slipped because we were reasoning about each layer in isolation. The fix came from looking at the request end-to-end.
Scope strictness to context. A rule that's correct for image-to-video can be completely wrong for text-to-video. Global rules over-block; context-aware rules let you be aggressive exactly where the risk is and permissive everywhere else.
Fail open on errors, fail strict on judgment. Don't punish real users for your infrastructure hiccups, but trust a confident harmful verdict.
Asymmetric harm justifies asymmetric thresholds. When a miss is far worse than a false positive, accept more false positives. When it's the reverse, protect the creative experience.
Never publish the exact patterns. A blocklist you print is a bypass guide. We'll happily describe the architecture and the philosophy; we won't hand out the regexes.

Moderation for open-source models isn't a feature you turn on. It's an ongoing, adversarial process — new vectors appear, you close them, the next one appears. What keeps it sustainable is refusing the false choice between "block everything" and "block nothing," and instead doing the harder work of figuring out precisely where the line is and defending exactly that line.

If you build on open models yourself, budget for this. The weights are the easy part.