MoBind: Motion Binding for Fine-Grained IMU

Abstract

We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a hierarchical contrastive strategy that first aligns token-level temporal segments, then fuses local (body-part) alignment with global (body-wide) motion aggregation. Evaluated on mRi, TotalCapture, and EgoHumans, MoBind consistently outperforms strong baselines across all four tasks, demonstrating robust fine-grained temporal alignment while preserving coarse semantic consistency across modalities.

MoBind Architecture

MoBind first encodes each IMU stream together with the motion of its corresponding body part from the skeleton keypoints extracted from video, yielding token-level and local-level representations per sensor. These local representations are then aggregated across sensors to form global-level embeddings. The contrastive objective applies at all three levels. In addition, a Masked Token Prediction (MTP) module is used only during training to preserve coarse semantic structure, preventing the model from over-focusing on fine-grained alignment.

Cross-Modal Retrieval

Given a short segment from one modality, the goal is to retrieve the corresponding moment in the other. Each example here shows the query IMU signal, its corresponding ground-truth video segment, and the top three retrieved video segments. MoBind successfully retrieves the ground-truth segment, and the other top-ranked results are also visually similar to the ground truth.

Temporal Synchronization

Given paired IMU–video sequences, this task aims to estimate the temporal offset between the two modalities. Here, we show four 20-second examples: the top row shows the IMU signals, while the bottom row shows the offset histogram obtained by accumulating predictions from short temporal segments. The results demonstrate that our method remains effective even under highly repetitive movements, where temporal alignment is often ambiguous.

Spatial Localization

Given a synchronized IMU segment and a multi-person video, we identify the IMU wearer by comparing the IMU global embedding with each person’s global pose embedding produced by MoBind, using cosine similarity. The person with the highest similarity score is assigned as the IMU wearer. For body-part association, we follow the same principle at the local level. Specifically, we compare the IMU local embedding with candidate body-part embeddings across all detected individuals, and assign the IMU to the subject–body-part pair with the highest similarity. Here, we show qualitative examples of body-part localization, where the IMU signal shown in the top row is correctly matched to both the corresponding subject and the worn body location.

BibTeX