Relation aware network for weakly-supervised temporal action localization

Abstract

Temporal action localization has become an important and challenging research orientation due to its various applications. Since fully supervised localization requires a lot of manpower expenditure to get frame-level or segment-level fine annotations on untrimmed long videos, weakly supervised methods have received more and more attention in recent years. Weakly-supervised Temporal Action Localization (WS-TAL) aims to predict action temporal boundaries with only video-level labels provided in the training phase. However, the existing methods often only perform classification loss constraints on independent video segments, but ignore the relation within or between these segments. In this paper, we propose a novel framework called Relation Aware Network (RANet), which aims to model the segment relations of intra-video and inter-video. Specifically, the Intra-video Relation Module is designed to generate more complete action predictions, while the Inter-video Relation Module is designed to separate the action from the background. Through this design, our model can learn more robust visual feature representations for action localization. Extensive experiments on three public benchmarks including THUMOS 14 and ActivityNet 1.2/1.3 demonstrate the impressive performance of our proposed method compared with the state-of-the-arts.

FullText(HTML)

Get Citation

{{if article.articleBusiness.pdfLink && article.articleBusiness.pdfLink != ''}} {{else}} {{/if}}PDF

XML

Export File