AdaFuse: Adaptive Temporal Fusion Network for
Efficient Action Recognition
Yue Meng
Rameswar Panda
Chung-Ching Lin
Prasanna Sattigeri
Leonid Karlinsky
Kate Saenko§
Aude Oliva
Rogerio Feris
Massachusetts Institute of Technology
MIT-IBM Watson AI Lab
IBM Research
Microsoft Research
Boston University§
ICLR 2021

Abstract

Temporal modelling is the key for efficient video action recognition. While understanding temporal information can improve recognition accuracy for dynamic actions, removing temporal redundancy and reusing past features can significantly save computation leading to efficient action recognition. In this paper, we introduce an adaptive temporal fusion network, called AdaFuse, that dynamically fuses channels from current and past feature maps for strong temporal modelling. Specifically, the necessary information from the historical convolution feature maps is fused with current pruned feature maps with the goal of improving both recognition accuracy and efficiency. In addition, we use a skipping operation to further reduce the computation cost of action recognition. Extensive experiments on SomethingV1 & V2, Jester and Mini-Kinetics show that our approach can achieve about 40% computation savings with comparable accuracy to state-of-the-art methods.

Action recognition results on SomethingSomethingV1 & V2




Accuracy versus efficiency comparison




Dataset-specific policy distribution




Policy distribution and trends for eachresidual block




Paper and Code

Yue Meng, Rameswar Panda, Chung-Ching Lin, Prasanna Sattigeri, Leonid Karlinsky, Kate Saenko, Aude Oliva, and Rogerio Feris.
AdaFuse: Adaptive Temporal Fusion Network for Efficient Action Recognition
International Conference on Learning Representations (ICLR), 2021
[PDF][Code]