Shijie WANG Xuejiao HU Sheng LIU Ming LI Yang LI Sidan DU
Detecting key frames in videos has garnered substantial attention in recent years, it is a point-level task and has deep research value and application prospect in daily life. For instances, video surveillance system, video cover generation and highlight moment flashback all demands the technique of key frame detection. However, the task is beset by challenges such as the sparsity of key frame instances, imbalances between target frames and background frames, and the absence of post-processing method. In response to these problems, we introduce a novel and effective Temporal Interval Guided (TIG) framework to precisely localize specific frames. The framework is incorporated with a proposed Point-Level-Soft non-maximum suppression (PLS-NMS) post-processing algorithm which is suitable for point-level task, facilitated by the well-designed confidence score decay function. Furthermore, we propose a TIG-loss, exhibiting sensitivity to temporal interval from target frame, to optimize the two-stage framework. The proposed method can be broadly applied to key frame detection in video understanding, including action start detection and static video summarization. Extensive experimentation validates the efficacy of our approach on action start detection benchmark datasets: THUMOS’14 and Activitynet v1.3, and we have reached state-of-the-art performance. Competitive results are also demonstrated on SumMe and TVSum datasets for deep learning based static video summarization.
Lei WANG Shanmin YANG Jianwei ZHANG Song GU
Human action recognition (HAR) exhibits limited accuracy in video surveillance due to the 2D information captured with monocular cameras. To address the problem, a depth estimation-based human skeleton action recognition method (SARDE) is proposed in this study, with the aim of transforming 2D human action data into 3D format to dig hidden action clues in the 2D data. SARDE comprises two tasks, i.e., human skeleton action recognition and monocular depth estimation. The two tasks are integrated in a multi-task manner in end-to-end training to comprehensively utilize the correlation between action recognition and depth estimation by sharing parameters to learn the depth features effectively for human action recognition. In this study, graph-structured networks with inception blocks and skip connections are investigated for depth estimation. The experimental results verify the effectiveness and superiority of the proposed method in skeleton action recognition that the method reaches state-of-the-art on the datasets.
In this paper we propose an extension of the Attention Branch Network (ABN) by using instance segmentation for generating sharper attention maps for action recognition. Methods for visual explanation such as Grad-CAM usually generate blurry maps which are not intuitive for humans to understand, particularly in recognizing actions of people in videos. Our proposed method, Object-ABN, tackles this issue by introducing a new mask loss that makes the generated attention maps close to the instance segmentation result. Further the Prototype Conformity (PC) loss and multiple attention maps are introduced to enhance the sharpness of the maps and improve the performance of classification. Experimental results with UCF101 and SSv2 shows that the generated maps by the proposed method are much clearer qualitatively and quantitatively than those of the original ABN.
In this paper, we propose a multi-domain learning model for action recognition. The proposed method inserts domain-specific adapters between layers of domain-independent layers of a backbone network. Unlike a multi-head network that switches classification heads only, our model switches not only the heads, but also the adapters for facilitating to learn feature representations universal to multiple domains. Unlike prior works, the proposed method is model-agnostic and doesn't assume model structures unlike prior works. Experimental results on three popular action recognition datasets (HMDB51, UCF101, and Kinetics-400) demonstrate that the proposed method is more effective than a multi-head architecture and more efficient than separately training models for each domain.
The performance of video action recognition has improved significantly in recent decades. Current recognition approaches mainly utilize convolutional neural networks to acquire video feature representations. In addition to the spatial information of video frames, temporal information such as motions and changes is important for recognizing videos. Therefore, the use of convolutions in a spatiotemporal three-dimensional (3D) space for representing spatiotemporal features has garnered significant attention. Herein, we introduce recent advances in 3D convolutions for video action recognition.
With the development of cameras and sensors and the spread of cloud computing, life logs can be easily acquired and stored in general households for the various services that utilize the logs. However, it is difficult to analyze moving images that are acquired by home sensors in real time using machine learning because the data size is too large and the computational complexity is too high. Moreover, collecting and accumulating in the cloud moving images that are captured at home and can be used to identify individuals may invade the privacy of application users. We propose a method of distributed processing over the edge and cloud that addresses the processing latency and the privacy concerns. On the edge (sensor) side, we extract feature vectors of human key points from moving images using OpenPose, which is a pose estimation library. On the cloud side, we recognize actions by machine learning using only the feature vectors. In this study, we compare the action recognition accuracies of multiple machine learning methods. In addition, we measure the analysis processing time at the sensor and the cloud to investigate the feasibility of recognizing actions in real time. Then, we evaluate the proposed system by comparing it with the 3D ResNet model in recognition experiments. The experimental results demonstrate that the action recognition accuracy is the highest when using LSTM and that the introduction of dropout in action recognition using 100 categories alleviates overfitting because the models can learn more generic human actions by increasing the variety of actions. In addition, it is demonstrated that preprocessing using OpenPose on the sensor side can substantially reduce the transfer quantity from the sensor to the cloud.
Shilei CHENG Mei XIE Zheng MA Siqi LI Song GU Feng YANG
As characterizing videos simultaneously from spatial and temporal cues have been shown crucial for video processing, with the shortage of temporal information of soft assignment, the vector of locally aggregated descriptor (VLAD) should be considered as a suboptimal framework for learning the spatio-temporal video representation. With the development of attention mechanisms in natural language processing, in this work, we present a novel model with VLAD following spatio-temporal self-attention operations, named spatio-temporal self-attention weighted VLAD (ST-SAWVLAD). In particular, sequential convolutional feature maps extracted from two modalities i.e., RGB and Flow are receptively fed into the self-attention module to learn soft spatio-temporal assignments parameters, which enabling aggregate not only detailed spatial information but also fine motion information from successive video frames. In experiments, we evaluate ST-SAWVLAD by using competitive action recognition datasets, UCF101 and HMDB51, the results shcoutstanding performance. The source code is available at:
This paper proposes a method for heatmapping people who are involved in a group activity. Such people grouping is useful for understanding group activities. In prior work, people grouping is performed based on simple inflexible rules and schemes (e.g., based on proximity among people and with models representing only a constant number of people). In addition, several previous grouping methods require the results of action recognition for individual people, which may include erroneous results. On the other hand, our proposed heatmapping method can group any number of people who dynamically change their deployment. Our method can work independently of individual action recognition. A deep network for our proposed method consists of two input streams (i.e., RGB and human bounding-box images). This network outputs a heatmap representing pixelwise confidence values of the people grouping. Extensive exploration of appropriate parameters was conducted in order to optimize the input bounding-box images. As a result, we demonstrate the effectiveness of the proposed method for heatmapping people involved in group activities.
Action recognition using skeleton data (3D coordinates of human joints) is an attractive topic due to its robustness to the actor's appearance, camera's viewpoint, illumination, and other environmental conditions. However, skeleton data must be measured by a depth sensor or extracted from video data using an estimation algorithm, and doing so risks extraction errors and noise. In this work, for robust skeleton-based action recognition, we propose a deep state-space model (DSSM). The DSSM is a deep generative model of the underlying dynamics of an observable sequence. We applied the proposed DSSM to skeleton data, and the results demonstrate that it improves the classification performance of a baseline method. Moreover, we confirm that feature extraction with the proposed DSSM renders subsequent classifications robust to noise and missing values. In such experimental settings, the proposed DSSM outperforms a state-of-the-art method.
Volleyball video analysis plays important roles in providing data for TV contents and developing strategies. Among all the topics of volleyball analysis, qualitative player action recognition is essential because it potentially provides not only the action that being performed but also the quality, which means how well the action is performed. However, most action recognition researches focus on the discrimination between different actions. The quality of an action, which is helpful for evaluation and training of the player skill, has only received little attention so far. The vital problems in qualitative action recognition include occlusion, small inter-class difference and various kinds of appearance caused by the player change. This paper proposes a 3D global and multi-view local features combination based recognition framework with global team formation feature, ball state feature and abrupt pose features. The above problems are solved by the combination of 3D global features (which hide the unstable and incomplete 2D motion feature caused by occlusion) and the multi-view local features (which get detailed local motion features of body parts in multiple viewpoints). Firstly, the team formation extracts the 3D trajectories from the whole team members rather than a single target player. This proposal focuses more on the entire feature while eliminating the personal effect. Secondly, the ball motion state feature extracts features from the 3D ball trajectory. The ball motion is not affected by the personal appearance, so this proposal ignores the influence of the players appearance and makes it more robust to target player change. At last, the abrupt pose feature consists of two parts: the abrupt hit frame pose (which extracts the contour shape of the player's pose at the hit time) and abrupt pose variation (which extracts the pose variation between the preparation pose and ending pose during the action). These two features make difference of each action quality more distinguishable by focusing on the motion standard and stability between different quality actions. Experiments are conducted on game videos from the Semifinal and Final Game of 2014 Japan Inter High School Games of Men's Volleyball in Tokyo Metropolitan Gymnasium. The experimental results show the accuracy achieves 97.26%, improving 11.33% for action discrimination and 91.76%, and improving 13.72% for action quality evaluation.
Tie HONG Yuan Wei LI Zhi Ying WANG
Head action recognition, as a specific problem in action recognition, has been studied in this paper. Different from most existing researches, our head action recognition problem is specifically defined for the requirement of some practical applications. Based on our definition, we build a corresponding head action dataset which contains many challenging cases. For action recognition, we proposed a real-time head action recognition framework based on HOF and ELM. The framework consists of face detection based ROI determination, HOF feature extraction in ROI, and ELM based action prediction. Experiments show that our method achieves good accuracy and is efficient enough for practical applications.
Shilei CHENG Song GU Maoquan YE Mei XIE
Human action recognition in videos draws huge research interests in computer vision. The Bag-of-Word model is quite commonly used to obtain the video level representations, however, BoW model roughly assigns each feature vector to its nearest visual word and the collection of unordered words ignores the interest points' spatial information, inevitably causing nontrivial quantization errors and impairing improvements on classification rates. To address these drawbacks, we propose an approach for action recognition by encoding spatio-temporal log Euclidean covariance matrix (ST-LECM) features within the low-rank and sparse representation framework. Motivated by low rank matrix recovery, local descriptors in a spatial temporal neighborhood have similar representation and should be approximately low rank. The learned coefficients can not only capture the global data structures, but also preserve consistent. Experimental results showed that the proposed approach yields excellent recognition performance on synthetic video datasets and are robust to action variability, view variations and partial occlusion.
We propose a feature for action recognition called Trajectory-Set (TS), on top of the improved Dense Trajectory (iDT). The TS feature encodes only trajectories around densely sampled interest points, without any appearance features. Experimental results on the UCF50 action dataset demonstrates that TS is comparable to state-of-the-arts, and outperforms iDT; the accuracy of 95.0%, compared to 91.7% by iDT.
Zhaoyang GUO Xin'an WANG Bo WANG Zheng XIE
In the field of action recognition, Spatio-Temporal Interest Points (STIPs)-based features have shown high efficiency and robustness. However, most of state-of-the-art work to describe STIPs, they typically focus on 2-dimensions (2D) images, which ignore information in 3D spatio-temporal space. Besides, the compact representation of descriptors should be considered due to the costs of storage and computational time. In this paper, a novel local descriptor named 3D Gradient LBP is proposed, which extends the traditional descriptor Local Binary Patterns (LBP) into 3D spatio-temporal space. The proposed descriptor takes advantage of the neighbourhood information of cuboids in three dimensions, which accounts for its excellent descriptive power for the distribution of grey-level space. Experiments on three challenging datasets (KTH, Weizmann and UT Interaction) validate the effectiveness of our approach in the recognition of human actions.
Yinan LIU Qingbo WU Linfeng XU Bo WU
Traditional action recognition approaches use pre-defined rigid areas to process the space-time information, e.g. spatial pyramids, cuboids. However, most action categories happen in an unconstrained manner, that is, the same action in different videos can happen at different places. Thus we need a better video representation to deal with the space-time variations. In this paper, we introduce the idea of mining spatial temporal saliency. To better handle the uniqueness of each video, we use a space-time over-segmentation approach, e.g. supervoxel. We choose three different saliency measures that take not only the appearance cues, but also the motion cues into consideration. Furthermore, we design a category-specific mining process to find the discriminative power in each action category. Experiments on action recognition datasets such as UCF11 and HMDB51 show that the proposed spatial temporal saliency video representation can match or surpass some of the state-of-the-art alternatives in the task of action recognition.
Chien-Quang LE Sang PHAN Thanh Duc NGO Duy-Dinh LE Shin'ichi SATOH Duc Anh DUONG
Depth-based action recognition has been attracting the attention of researchers because of the advantages of depth cameras over standard RGB cameras. One of these advantages is that depth data can provide richer information from multiple projections. In particular, multiple projections can be used to extract discriminative motion patterns that would not be discernible from one fixed projection. However, high computational costs have meant that recent studies have exploited only a small number of projections, such as front, side, and top. Thus, a large number of projections, which may be useful for discriminating actions, are discarded. In this paper, we propose an efficient method to exploit pools of multiple projections for recognizing actions in depth videos. First, we project 3D data onto multiple 2D-planes from different viewpoints sampled on a geodesic dome to obtain a large number of projections. Then, we train and test action classifiers independently for each projection. To reduce the computational cost, we propose a greedy method to select a small yet robust combination of projections. The idea is that best complementary projections will be considered first when searching for optimal combination. We conducted extensive experiments to verify the effectiveness of our method on three challenging benchmarks: MSR Action 3D, MSR Gesture 3D, and 3D Action Pairs. The experimental results show that our method outperforms other state-of-the-art methods while using a small number of projections.
Jianhong WANG Pinzheng ZHANG Linmin LUO
Nonnegative component representation (NCR) is a mid-level representation based on nonnegative matrix factorization (NMF). Recently, it has attached much attention and achieved encouraging result for action recognition. In this paper, we propose a novel hierarchical dictionary learning strategy (HDLS) for NMF to improve the performance of NCR. Considering the variability of action classes, HDLS clusters the similar classes into groups and forms a two-layer hierarchical class model. The groups in the first layer are disjoint, while in the second layer, the classes in each group are correlated. HDLS takes account of the differences between two layers and proposes to use different dictionary learning methods for this two layers, including the discriminant class-specific NMF for the first layer and the discriminant joint dictionary NMF for the second layer. The proposed approach is extensively tested on three public datasets and the experimental results demonstrate the effectiveness and superiority of NCR with HDLS for large-scale action recognition.
Local spatio-temporal features are popular in the human action recognition task. In practice, they are usually coupled with a feature encoding approach, which helps to obtain the video-level vector representations that can be used in learning and recognition. In this paper, we present an efficient local feature encoding approach, which is called Approximate Sparse Coding (ASC). ASC computes the sparse codes for a large collection of prototype local feature descriptors in the off-line learning phase using Sparse Coding (SC) and look up the nearest prototype's precomputed sparse code for each to-be-encoded local feature in the encoding phase using Approximate Nearest Neighbour (ANN) search. It shares the low dimensionality of SC and the high speed of ANN, which are both desired properties for a local feature encoding approach. ASC has been excessively evaluated on the KTH dataset and the HMDB51 dataset. We confirmed that it is able to encode large quantity of local video features into discriminative low dimensional representations efficiently.
Shijian HUANG Junyong YE Tongqing WANG Li JIANG Changyuan XING Yang LI
Traditional low-rank feature lose the temporal information among action sequence. To obtain the temporal information, we split an action video into multiple action subsequences and concatenate all the low-rank features of subsequences according to their time order. Then we recognize actions by learning a novel dictionary model from concatenated low-rank features. However, traditional dictionary learning models usually neglect the similarity among the coding coefficients and have bad performance in dealing with non-linearly separable data. To overcome these shortcomings, we present a novel similarity constrained discriminative kernel dictionary learning for action recognition. The effectiveness of the proposed method is verified on three benchmarks, and the experimental results show the promising results of our method for action recognition.
Ngoc Nam BUI Jin Young KIM Hyoung-Gook KIM
Current research trends in computer vision have tended towards achieving the goal of recognizing human action, due to the potential utility of such recognition in various applications. Among many potential approaches, an approach involving Gaussian Mixture Model (GMM) supervectors with a Support Vector Machine (SVM) and a nonlinear GMM KL kernel has been proven to yield improved performance for recognizing human activities. In this study, based on tensor analysis, we develop and exploit an extended class of action features that we refer to as gradient-flow tensor divergence. The proposed method has shown a best recognition rate of 96.3% for a KTH dataset, and reduced processing time.