Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Kunyu Feng, Yuxuan Xue, Zixiang Zhao, Konrad Schindler, Qifeng Chen, Linfeng Zhang
is Accpeted by ICLR 2026
TL; DR: FastVMT eliminates redundancy in video motion transfer, enabling fast and efficient motion pattern transfer from reference videos to generated content.
CLICK for the full abstract
Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: \emph{\textbf{motion redundancy}} arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; \emph{\textbf{gradient redundancy}} occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a \textit{\textcolor{Blue}{\textbf{3.43}}}$\times$ speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
FastVMT_demo_video.mov
- 2026.01.28 Initial release with efficient tile-based AMF support
- Add more examples and demo videos
- Add support with CPU-offload to support low VRAM GPUs
- Attention Motion Flow (AMF): Custom implementation for transferring motion patterns from reference videos to generated content
- Efficient Tile-based AMF: Optimized computation with reduced memory usage while maintaining accuracy
- Flexible Inference Modes: Support for multiple generation modes (
effi_AMF,No_transfer) - VRAM Management: Built-in CPU offload strategies for running on consumer GPUs
# Create conda environment
conda create -n fastvmt python=3.10
conda activate fastvmt
# Install dependencies
cd FastVMT
pip install -e . # Editing mode- Python 3.10+
- PyTorch 2.0+
- CUDA 12.x
- 80GB+ GPU VRAM recommended (can run on lower VRAM with CPU offload)
We support two model variants:
| Model | VRAM Required | Quality |
|---|---|---|
| Wan2.1-T2V-1.3B | ~24GB | Good |
| Wan2.1-T2V-14B | ~80GB | Best |
# Download 14B model (default, best quality)
python examples/download_model.py --model 14b
# Download 1.3B model (lower VRAM requirement)
python examples/download_model.py --model 1.3b
# Download both models
python examples/download_model.py --model allOr use ModelScope CLI directly:
# Download 14B model
modelscope download --model Wan-AI/Wan2.1-T2V-14B --local_dir ./models/Wan2.1-T2V-14B
# Download 1.3B model
modelscope download --model Wan-AI/Wan2.1-T2V-1.3B --local_dir ./models/Wan2.1-T2V-1.3B# Using 1.3B model (lower VRAM)
python examples/wan_1.3b_text_to_video.py
# Using 14B model (better quality)
python examples/wan_14b_text_to_video.pyimport torch
from diffsynth import ModelManager, WanVideoPipeline, save_video, VideoData
# Load models
model_manager = ModelManager(device="cpu")
model_manager.load_models([
"models/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors",
"models/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth",
"models/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth",
], torch_dtype=torch.bfloat16)
pipe = WanVideoPipeline.from_model_manager(model_manager, torch_dtype=torch.bfloat16, device="cuda")
pipe.enable_vram_management(num_persistent_param_in_dit=None)
# Load reference video for motion transfer
ref_video = VideoData("data/source.mp4", height=480, width=832)
# Generate with motion transfer (num_frames auto-inferred from input_video)
video = pipe(
prompt="Documentary photography style. A lively puppy running quickly on a green grass field. The puppy has brown-yellow fur, ears perked up, with a focused and joyful expression. Sunlight shines on it, making the fur look extra soft and shiny. The background is an open grass field, occasionally dotted with wildflowers, with blue sky and white clouds visible in the distance. Strong perspective, capturing the puppy's dynamic movement and the vitality of the surrounding grass. Medium shot, side tracking view.",
negative_prompt="low quality, blurry",
num_inference_steps=50,
denoising_strength=0.75,
input_video=ref_video,
seed=42,
tiled=True,
sf=4,
mode="effi_AMF",
)
save_video(video, "output.mp4", fps=15, quality=5)| Mode | Description | Speed | Accuracy |
|---|---|---|---|
effi_AMF |
Efficient tile-based Attention Motion Flow (default) | Fast | Good |
No_transfer |
Standard generation without motion transfer | Fastest | N/A |
Click for directory structure
FastVMT/
├── diffsynth/ # Core library
│ ├── models/ # Model implementations
│ │ ├── wan_video_dit.py # Modified DiT with Q/K extraction
│ │ ├── wan_video_vae.py # Video VAE encoder/decoder
│ │ └── wan_video_text_encoder.py
│ ├── pipelines/ # Inference pipelines
│ │ └── wan_video.py # Pipeline with AMF implementation
│ ├── schedulers/ # Noise schedulers (Flow Matching)
│ ├── prompters/ # Prompt processing
│ └── vram_management/ # Memory optimization utilities
├── examples/ # Example scripts
├── models/ # Model checkpoints
├── requirements.txt # Dependencies
└── setup.py # Package setup
This repository includes the following modifications to the original DiffSynth-Studio:
- Added Q/K tensor extraction in self-attention layers for AMF computation
- Custom forward pass preserving spatial size information
- Implemented Attention Motion Flow (AMF) computation algorithm
- Added efficient tile-based AMF variant for reduced memory usage
- Integrated guidance optimization steps for motion transfer
- Added tracking loss for improved temporal consistency
If you use this code, please cite:
@article{ma2025fastvmt,
title={FastVMT: Eliminating Redundancy in Video Motion Transfer},
author={Ma, Yue and Wang, Zhikai and Ren, Tianhao and Zheng, Mingzhe and Liu, Hongyu and Guo, Jiayi and Feng, Kunyu and Xue, Yuxuan and Zhao, Zixiang and Schindler, Konrad and Chen, Qifeng and Zhang, Linfeng},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025}
}This project is open source and licensed under the MIT License. See LICENSE.md for details.
This repository borrows heavily from DiffSynth-Studio and Wan Video. Thanks to the authors for sharing their code and models.
This is the codebase for our research work. If you have any questions or ideas to discuss, feel free to open an issue.