A New Dataset and Approach for Timestamp Supervised Action Segmentation Using Human Object Interaction

Abstract

This paper focuses on leveraging Human Object Interaction (HOI) information to improve temporal action segmentation under timestamp supervision, where only one frame is annotated for each action segment. This information is obtained from an off-the-shelf pre-trained HOI detector, that requires no additional HOI-related annotations in our experimental datasets. Our approach generates pseudo labels by expanding the annotated timestamps into intervals and allows the system to exploit the spatio-temporal continuity of human interaction with an object to segment the video. We also propose the (3+1)Real-time Cooking (ReC)1 dataset as a realistic collection of videos from 30 participants cooking 15 breakfast items. Our dataset has three main properties: 1) to our knowledge, the first to offer synchronized third and first person videos, 2) it incorporates diverse actions and tasks, and 3) it consists of high resolution frames to detect fine-grained information. In our experiments we benchmark state-of-the-art segmentation methods under different levels of supervision on our dataset. We also quantitatively show the advantages of using HOI information, as our framework improves its baseline segmentation method on several challenging datasets with varying viewpoints, providing improvements of up to 10.9% and 5.3% in F1 score and frame-wise accuracy respectively.

(3+1) Real-time Cooking (ReC) Dataset

Real-time instructional video dataset comparison. * indicates approximation due to a hidden test set. “Views” refers to 3rd person. Some statistical discrepancies between 1ReC and 3ReC is due to frame loss in some videos. “Envir.” includes various camera setups.

BibTeX

@inproceedings{sayed2023new, title={A New Dataset and Approach for Timestamp Supervised Action Segmentation Using Human Object Interaction}, author={Sayed, Saif and Ghoddoosian, Reza and Trivedi, Bhaskar and Athitsos, Vassilis}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={3132--3141}, year={2023} }

A New Dataset and Approach for Timestamp Supervised Action Segmentation Using Human Object Interaction

CVPR 2023

Abstract

(3+1) Real-time Cooking (ReC) Dataset

Real-time instructional video dataset comparison. * indicates approximation due to a hidden test set. “Views” refers to 3rd person. Some statistical discrepancies between 1ReC and 3ReC is due to frame loss in some videos. “Envir.” includes various camera setups.

Video Samples

Diverse environments for Dish "Avocado Toast"

Multiple Viewpoints for the same Dish "Orange Juice"

Qualitative Results

Qualitative results on (a) 50Salads, (b) GTEA datasets and (c)(3+1)ReC. The baseline method suffers from over- segmentation, while our approach alleviates this issue by utilizing the continuity in human object interaction.

BibTeX