A New Dataset and Approach for Timestamp Supervised Action Segmentation Using Human Object Interaction

University of Texas at Arlington  

CVPR 2023

Abstract

This paper focuses on leveraging Human Object Interaction (HOI) information to improve temporal action segmentation under timestamp supervision, where only one frame is annotated for each action segment. This information is obtained from an off-the-shelf pre-trained HOI detector, that requires no additional HOI-related annotations in our experimental datasets. Our approach generates pseudo labels by expanding the annotated timestamps into intervals and allows the system to exploit the spatio-temporal continuity of human interaction with an object to segment the video. We also propose the (3+1)Real-time Cooking (ReC)1 dataset as a realistic collection of videos from 30 participants cooking 15 breakfast items. Our dataset has three main properties: 1) to our knowledge, the first to offer synchronized third and first person videos, 2) it incorporates diverse actions and tasks, and 3) it consists of high resolution frames to detect fine-grained information. In our experiments we benchmark state-of-the-art segmentation methods under different levels of supervision on our dataset. We also quantitatively show the advantages of using HOI information, as our framework improves its baseline segmentation method on several challenging datasets with varying viewpoints, providing improvements of up to 10.9% and 5.3% in F1 score and frame-wise accuracy respectively.


system_image

The proposed training framework. The secondary labels generator creates new pseudo ground-truth, κ using the HOI detections ρ and existing timestamp annotations. The binarized pseudo ground-truth(α) also provides new supervisory signal to the primary label generator for generating frame-wise labels β.



(3+1) Real-time Cooking (ReC) Dataset

data_props

Real-time instructional video dataset comparison. * indicates approximation due to a hidden test set. “Views” refers to 3rd person. Some statistical discrepancies between 1ReC and 3ReC is due to frame loss in some videos. “Envir.” includes various camera setups.

Video Samples

Diverse environments for Dish "Avocado Toast"

Multiple Viewpoints for the same Dish "Orange Juice"

Qualitative Results

qual_res

Qualitative results on (a) 50Salads, (b) GTEA datasets and (c)(3+1)ReC. The baseline method suffers from over- segmentation, while our approach alleviates this issue by utilizing the continuity in human object interaction.



BibTeX


      @inproceedings{sayed2023new,
        title={A New Dataset and Approach for Timestamp Supervised Action Segmentation Using Human Object Interaction},
        author={Sayed, Saif and Ghoddoosian, Reza and Trivedi, Bhaskar and Athitsos, Vassilis},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        pages={3132--3141},
        year={2023}
      }