There’s a moment you hit in AI video work where everything looks like it should be working… but isn’t.
The shots are beautiful.
The lighting is right.
The camera language feels almost expensive.
And yet the whole thing collapses into something slightly off—like a dream that keeps changing the actor’s face every time you blink.
That’s where this recent exploration landed me.
Not in “how do I make better prompts?”
But in something more uncomfortable:
I’ve been fighting the system in the wrong place.
The problem wasn’t the shots — it was the identity layer.
I started from a simple observation.
In 9:16 AI filmmaking, especially with tools like Midjourney-style generation, I kept running into a frustrating split:
- If I used character reference tools (
--oref), I got consistency… but boring framing - If I removed them, I got beautiful cinematic shots… but identity drift
So I tried to solve it like a filmmaker:
“Fix continuity in post. Add control later.”
But that wasn’t the real issue.
The real issue was this:
Identity tools don’t just “hold a face.”
They reshape the entire visual system around that face.
And not in a helpful way.
They collapse composition into safe, repetitive portrait logic:
- too many headshots
- too many centered faces
- too little camera aggression
- too little cinematic risk
Lab note: I originally thought this was a limitation of prompting. It’s not. It’s a structural bias in how identity conditioning steers composition.
The first breakthrough: identity is not a reference problem — it’s a prompt completeness problem.
At first I assumed the solution was abstraction:
“Don’t describe too much. Let the model imagine.”
That was wrong.
Abstraction didn’t create freedom. It created instability.
The model didn’t get more cinematic—it just started re-inventing the character from scratch every shot.
So instead of solving identity with less information, I tried the opposite:
I over-specified it.
Age. Hair. Skin tone. Wardrobe. Era. Physical presence. Emotional posture.
Not as decoration.
As constraint.
And something interesting happened:
When identity becomes fully defined in text, it stops interfering with composition.
It becomes stable.
Predictable.
Reusable.
Almost like casting a real actor instead of re-generating a new one every shot.
Lab shift: from reference-driven identity to structured character blocks.
This is where the workflow actually changed.
Instead of:
- “use character reference”
- or “same as before”
- or abstract identity phrases
I started building a fixed identity block:
- Age
- Skin tone
- Hair style
- Hair color
- Body type
- Wardrobe baseline
- Emotional presence
And then I stopped touching it.
Every prompt became:
Identity block + camera instruction + scene action
That separation mattered more than anything else.
Lab note: this is where AI filmmaking starts feeling less like prompting and more like production design.
The second discovery: cinematic freedom comes from removing identity tools, not identity itself.
This was the contradiction I didn’t expect.
The most cinematic shots came from:
- removing
--oref - removing identity shortcuts
- removing reference anchoring
But not removing identity itself.
That distinction is everything.
Because what actually kills dynamism is not consistency—it’s how consistency is enforced.
Reference-based identity forces:
- face-first framing
- conservative composition
- portrait gravity
Text-defined identity allows:
- full-body shots
- aggressive angles
- environmental storytelling
- spatial freedom
Same character. Different physics.
The production reality: expensive shots, selective repair.
Of course, this introduced a new problem.
Now I had freedom—but less guaranteed consistency.
So the workflow split into two passes:
- cinematic generation pass
- selective identity repair pass
Not everything gets fixed.
Only the shots that matter:
- emotional close-ups
- dialogue beats
- narrative anchors
Everything else is left raw if it works.
Lab note: this is where cost actually gets controlled. Not in generation—but in deciding what is worth stabilizing.
The hidden system: shot tagging becomes the real editing tool.
Once the pipeline expanded, I had to stop thinking in clips and start thinking in categories.
Every shot now gets tagged:
- role (setup, tension, reveal, payoff)
- camera (OTS, close-up, wide, reaction)
- identity status (OK, partial, broken)
- fix status (raw, fixed, locked)
What this does is simple but powerful:
It stops the workflow from becoming emotional.
You’re no longer reacting to every clip.
You’re routing them.
Lab note: this is where AI video stops feeling like improvisation and starts feeling like orchestration.
The uncomfortable truth: smooth “Hollywood” AI is not about generation quality.
Watching the reference material I studied, something stood out.
The polish wasn’t coming from better AI.
It came from:
- extreme shot density control
- intentional pacing (fractions of seconds matter)
- selective lip sync usage
- ruthless editing rhythm
- and very deliberate over-the-shoulder spatial logic
But the real hidden cost?
Time.
Credits.
Iteration cycles.
This is not a fast workflow.
It’s a controlled burn.
The real insight I’m taking from this.
It’s not that AI filmmaking needs better prompts.
It’s that it needs clearer separation of roles:
Identity is not cinematography.
Cinematography is not identity.
Editing is not generation.
When those collapse into one prompt, everything becomes average.
When they separate, the system starts to behave like a real production pipeline.
Imperfect. Expensive. Slow.
But cinematic.
Closing thought:
I used to think the goal was to make AI “understand the shot.”
Now I think the goal is simpler and harder:
To stop the system from confusing who is in the frame with how the frame is built.
Once you separate those two things, something interesting happens.
The images stop feeling generated.
They start feeling staged.
And that’s where cinema quietly begins.
Steve Teare
video alchemist
TerminallyBored.Monster
Palouse, Washington USA
