Tracking Anything with Decoupled Video Segmentation

ICCV 2023




Abstract

Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.




Demo with Grounded Segment Anything (text prompt: "guinea pigs" and "chicken"):

Source: https://youtu.be/FM9SemMfknA



Demo with Grounded Segment Anything (text prompt: "pigs"):

Source: https://youtu.be/FbK3SL97zf8



Demo with Grounded Segment Anything (text prompt: "capybara"):

Source: https://youtu.be/couz1CrlTdQ



Demo with Segment Anything (automatic points-in-grid prompting); original video follows DEVA result overlaying the video:

Source: DAVIS 2017 validation set "soapbox"



Demo with Segment Anything on a out-of-domain example; original video follows DEVA result overlaying the video:

Source: https://youtu.be/FQQaSyH9hZI


Contact: Ho Kei (Rex) Cheng (hkchengrex@gmail.com)