Electronic International Standard Serial Number (EISSN)
1090-235X
abstract
In this work, we propose a new method to generate temporal action proposals from long untrimmed videos named Temporally Aggregated Bag-of-Discriminant-Words (TAB). TAB is based on the observation that there are many overlapping frames in action and background temporal regions of untrimmed videos, which cause difficulties in segmenting actions from non-action regions. TAB solves this issue by extracting class-specific codewords from the action and background videos and extracting the discriminative weights of these codewords based on their ability to discriminate between these two classes. We integrate these discriminative weights with Bag of Word encoding, which we then call Bag-of-Discriminant-Words (BoDW). We sample the untrimmed videos into non-overlapping snippets and temporally aggregate the BoDW representation of multiple snippets into action proposals using a binary classifier trained on trimmed videos in a single pass. TAB can be used with different types of features, including those computed by deep networks. We present the effectiveness of the TAB proposal extraction method on two challenging temporal action detection datasets: MSR-II and Thumos14, where it improves upon state-of-the-art with recall rates of 87.0% and 82.0% respectively at a temporal intersection over union ratio of 0.8.
Classification
subjects
Computer Science
keywords
temporal action detection; bag of words; temporal action proposals