End-to-End Temporal Action Detection Using Bag of Discriminant Snippets

Detecting human actions in long untrimmed videos is a challenging problem. Existing temporal-action detection methods have difficulties in finding the precise starting and ending times of the actions in untrimmed videos. In this letter, we propose a temporal-action detection framework that can detect multiple actions in an end-to-end manner, based on a Bag of Discriminant Snippets (BoDS). BoDS is based on the observation that multiple actions and the background classes have similar snippets, which cause incorrect classification of action regions and imprecise boundaries. We solve this issue by finding the key-snippets from the training data of each class and compute their discriminative power, which is used in BoDS encoding. During testing of an untrimmed video, we find the BoDS representation for multiple candidate proposals and find their class label based on a majority voting scheme. We test BoDS on the Thumos14 and ActivityNet datasets and obtain state-of-the-art results. For the sports subset of ActivityNet dataset, we obtain a mean Average Precision (mAP) value of 29% at 0.7 temporal Intersection over Union (tIoU) threshold. For the Thumos14 dataset, we obtain a significant gain in terms of mAP, i.e., improving from 20.8% to 31.6% at tIoU = 0.7.

End-to-End Temporal Action Detection Using Bag of Discriminant Snippets Articles