Vera: A Layered Diffusion Model for Content-Preserving Video Editing

Hongkai Zheng1,2†,   Ta-Ying Cheng2,   Benjamin Klein2,   Yisong Yue1,   Zhuoning Yuan2,‡ 
1California Institute of Technology,  2Netflix, Inc.
Work done during an internship at Netflix
Project Lead
TL;DR: A layered diffusion framework for video editing. Vera jointly generates an edit layer, an alpha matte, and a composite video, separating what to generate from what to preserve.

Disclaimer: this is a research prototype, not an official Netflix product.

"Add a massive sea turtle with a textured olive-green shell and weathered flippers swimming gracefully..."
(Click on any video to pause)
Input
Hover for raw RGB
Edit Layer
Composite

Method

Inference Pipeline

Vera inference pipeline

Given a source video and a text editing instruction, Vera's MoT architecture jointly generates an edit layer, an alpha matte, and a composite video. The edit layer and alpha matte are then composited with the source video to produce the final edited output.

Model Architecture

We extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture. Separate DiTs process the edit layer, alpha matte, and composite video independently, while interacting through joint self-attention to encourage coherent compositing.

Results | Qualitative Comparisons

Object Addition
Background Change
"Add a medieval knight in polished silver plate armor walking steadily on the sand..."
(Click on any video to pause · Hover over outputs for difference heatmap)
Input
Hover for diff
Vera-14B
Hover for diff
VACE-14B

Results | Quantitative Comparisons

Comparison with existing video editing methods. VLM-based metrics (CS, CT, IS) are averaged over three VLMs. OC and TF are near-identical across models and not bolded. Best results in other columns are in bold.

Method Content Preservation Video Quality Instruction Compliance
PSNR SSIM LPIPS OC TF CS CT OSC IS
Object Addition
Ditto13.10.4470.5290.980.982.843.130.2393.62
Lucy-Edit19.00.7980.3010.980.982.542.790.2382.98
VideoPainter10.70.3540.6690.640.982.542.800.2472.87
VACE-1.3B10.30.3230.6670.980.982.553.010.2502.99
VACE-14B10.10.3250.6660.980.982.633.080.2483.26
ReCo16.80.7780.2800.980.983.183.350.2433.77
Vera-1.3B25.30.9490.0780.980.983.493.470.2434.05
Vera-14B26.10.9500.0820.980.983.593.660.2444.13
Background Change
Ditto21.70.8880.0920.970.973.293.250.2404.26
Lucy-Edit22.30.9080.0780.960.962.502.500.2353.71
VideoPainter29.90.9750.0320.830.972.292.310.2373.64
VACE-1.3B31.50.9810.0240.960.973.743.610.2454.48
VACE-14B31.70.9830.0210.960.963.683.590.2434.44
ReCo24.90.9350.0610.960.972.452.710.2203.05
Vera-1.3B35.20.9930.0100.960.973.343.250.2434.35
Vera-14B36.20.9930.0100.960.973.283.210.2424.38

Results | User Preference Study

User study results comparing Vera-1.3B against five baselines. Bars show our win rate across three evaluation dimensions. Bold values with * indicate statistically significant results (p < 0.05, binomial test) where our model is preferred.

User study win rates by edit type

Dataset

We construct a layered video dataset to train our model using data pipelines that combine automated tools with manual annotations (see paper for details). The full corpus contains 486K frames at 832×480 resolution, organized into three complementary subsets: synthetic composites derived from VideoMatte240K with synthetic backgrounds; realistic single-object videos from Pexels and Mixkit with natural camera motion; and realistic multi-object videos with effects including shadows, reflections, occlusion, etc. Below are representative samples per subset: each row shows the input video, alpha matte, edit layer (RGB), and target composite, with the text editing instruction.

Data sources & acknowledgments: We gratefully acknowledge the following sources used in constructing our dataset. Synthetic composites are derived from VideoMatte240K. Real stock footage is sourced from Pexels and Mixkit under their respective free-use licenses. Please refer to each platform for full license terms.

(Click any video to pause or resume that row)
Input Video
Alpha Matte
Edit Layer
Composite

Conclusion and Limitations

We investigated how to introduce editable layer structure into diffusion models for video editing, where generated edit layers must support coherent compositing with the source video. Vera provides a concrete formulation: it jointly produces an edit layer, an alpha matte, and a composite video, separating what to generate from what to preserve. The resulting editable layers can support iterative refinement in downstream editing workflows. Through controlled experiments, we identified three key ingredients that enable layer separation while retaining competitive composition quality: an MoT architecture with cross-layer interaction through joint self-attention, composite-branch supervision, and curated layered data with accurately aligned edit layers and alpha mattes.

Three limitations remain in this work. First, jointly generating three layers increases inference cost: Vera-1.3B is roughly 3x slower than VACE. Second, our evaluation is limited to object addition and background replacement. Extending the approach to relighting, complex visual effects, and broader editing operations will require layered supervision that captures the corresponding interactions. Third, our inference procedure approximates the preserved video with the source video and therefore assumes that preserved content contains only small semi-transparent regions. Direct recovery in cases such as glass or water requires suitable layered training data and explicit evaluation. Addressing these boundaries would extend layered generation toward a broader set of production editing operations.

BibTeX

@article{zheng2026vera,
    title     = {Vera: A Layered Diffusion Model for Content-Preserving Video Editing},
    author    = {Zheng, Hongkai and Cheng, Ta-Ying and Klein, Benjamin and Yue, Yisong and Yuan, Zhuoning},
    journal   = {arXiv preprint arXiv:2606.23610},
    year      = {2026}
}