VOID is an open-source AI model from Netflix that removes objects from videos, including physical interactions like falling items caused by the removal.
It uses a CogVideoX 3D Transformer base with quadmask conditioning encoding remove, overlap, affected, and background regions. Input requires a source video, quadmask, and text prompt; inference needs a 40GB+ VRAM GPU and supports up to 197 frames at 384x672 resolution.