HuMoCon

Concept Discovery for Human Motion Understanding

1 The University of Hong Kong, 2 Meta

Abstract

We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery, including explicit cross-modal feature alignment and preserving high-frequency information via velocity reconstruction. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon significantly outperforms state-of-the-art methods in human motion understanding.

Overview

HuMoCon teaser
🎯 HuMoCon Framework Overview

HuMoCon introduces a novel approach to human motion understanding through automated concept discovery. Our framework identifies meaningful motion concepts and their relationships, enabling more interpretable and effective human behavior analysis.

Method

HuMoCon overview
🏗️ HuMoCon Architecture

Our method consists of three main components: (1) Motion Encoder that processes raw motion sequences, (2) Concept Discovery Module that identifies semantic concepts via VQ-VAE-based discretization with masked and velocity reconstruction objectives, and (3) Concept Reasoning Module that establishes relationships between discovered concepts for comprehensive understanding. We explicitly align video and motion features during encoder pre-training and leverage LLM fine-tuning for downstream motion-video question answering tasks.

Experiments

Quantitative Results

BABEL-QA Benchmark

ModelPred typeOverallActionDirectionBodyPartBeforeAfterOther
2s-AGCN-Mcls.0.3550.3840.3520.2280.3310.2640.295
2s-AGCN-Rcls.0.3570.3960.3520.1940.3370.3010.285
MotionCLIP-Mcls.0.4300.4850.3610.2720.3720.3210.404
MotionCLIP-Rcls.0.4200.4890.3100.2500.3980.3140.387
MotionLLMgen.0.4360.5170.3540.1540.4270.3680.529
Oursgen.0.7110.8090.6970.6230.7070.6350.797

Our method outperforms baselines by a large margin on the BABEL-QA test set, achieving 0.711 overall accuracy compared to 0.436 by MotionLLM, with notable gains in BodyPart queries (0.623 vs. 0.154).

ActivityNet-QA Benchmark

ModelAcc↑Score↑
FrozenBiLM24.7-
VideoChat-2.2
LLaMA-Adapter34.22.7
Video-LLaMA12.41.1
Video-ChatGPT35.22.7
Video-LLaVA45.33.3
VideoChat249.13.3
MotionLLM53.33.5
Ours54.23.6

On ActivityNet-QA, HuMoCon achieves 54.2% accuracy and a score of 3.6, outperforming previous methods including MotionLLM (53.3%, 3.5).

Ablation Study

BABEL-QA Ablation

ModelPred typeOverallActionDirectionBodyPartBeforeAfterOther
MotionLLMgen.0.4360.5170.3540.1540.4270.3680.529
Ours-w/oLrecgen.0.6960.7410.6450.5770.6000.5970.762
Ours-w/oLdis&Lactgen.0.6370.6930.4780.6060.6670.5260.709
Ours-w/oLaligngen.0.6750.7430.5790.5230.5840.5700.743
Oursgen.0.7110.8090.6970.6230.7070.6350.797

Ablation confirms that velocity reconstruction and feature alignment are critical for performance; removing these modules degrades results significantly.

Qualitative Results

HuMoCon overview
Example Q&A results demonstrate detailed motion understanding

Example Q&A results demonstrate detailed motion understanding, e.g., answering kinematic and contextual questions such as muscle engagement during push-ups and phase descriptions in jump sequences, showing HuMoCon’s capability to reason about motion sequences.

Citation

@inproceedings{Fang2025HuMoCon, title={HuMoCon: Concept Discovery for Human Motion Understanding}, author={Qihang Fang and Chengcheng Tang and Bugra Tekin and Shugao Ma and Yanchao Yang}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2025} }

Acknowledgments

This work is supported by the Early Career Scheme of the Research Grants Council (grant #27207224), the HKU-100 Award, a donation from the Musketeers Foundation, and an Academic Gift from Meta. Data collection, processing, and model development were conducted at The University of Hong Kong.