profile picture
Ho Kei (Rex) Cheng

I am a Ph.D. candidate at the University of Illinois Urbana-Champaign, advised by Alexander Schwing. Before that, I was at The Hong Kong University of Science and Technology, advised by Yu-Wing Tai and Chi Keung Tang.

I work on visual understanding, focusing on videos. My past research includes video object tracking, segmentation, and multimodal-conditioned video-to-audio synthesis. I have interned at Adobe Research (open-world video segmentation), Kaiber (videos diffusion models), and Sony AI (multimodal flow matching models).

[GitHub] | [Google Scholar] | [CV]


Research (hover over videos to play)
arXiv 2024
Project page / code / arXiv / Space demo / Replicate
Generates high-quality synchronized audio from video or text inputs, with an architecture that enables training on data from multiple sources even when some modalities are missing.
CVPR 2024 Highlight
Project page / code / arXiv
Uses an object transformer to combine pixel-level and object-level features for efficient and robust video object segmentation in challenging scenarios. Used by iMotions and Annolid.
ICCV 2023
Project page / code / arXiv
Achieves open-world video segmentation by combining universal image segmentation with temporal propagation. Easy to extend.
Ho Kei Cheng, Alexander Schwing.
ECCV 2022
Project page / code / arXiv
Approaches video object segmentation from a memory perspective with a pipeline that effectively models both short-term and long-term dependencies. Used by supervisely and Track-Anything.
Ho Kei Cheng, Yu-Wing Tai, Chi Keung Tang.
NeurIPS 2021
Project page / code / arXiv
A simple yet effective method to model pixel correspondences between frames. Used by Trioscope and BURST.
Ho Kei Cheng, Yu-Wing Tai, Chi Keung Tang.
CVPR 2021
Project page / code / arXiv
Decouples interactive video segmentation into two components: single-frame interaction and temporal propagation, demonstrating significantly improved performance. Used by Sieve.
CVPR 2020
Project page / code / arXiv / pypi
An iterative refinement network that achieves high-quality 4K+ segmentation using only low-resolution training data (less than 500 pixels per side).


Invited Talks
Tools
Professional Activities
Misc