Yaotong SONG Zhipeng LIU Zhiming ZHANG Jun TANG Zhenyu LEI Shangce GAO
Deep networks are undergoing rapid development. However, as the depth of networks increases, the issue of how to fuse features from different layers becomes increasingly prominent. To address this challenge, we creatively propose a cross-layer feature fusion module based on neural dendrites, termed dendritic learning-based feature fusion (DFF). Compared to other fusion methods, DFF demonstrates superior biological interpretability due to the nonlinear capabilities of dendritic neurons. By integrating the classic ResNet architecture with DFF, we devise the ResNeFt. Benefiting from the unique structure and nonlinear processing capabilities of dendritic neurons, the fused features of ResNeFt exhibit enhanced representational power. Its effectiveness and superiority have been validated on multiple medical datasets.
Jia-ji JIANG Hai-bin WAN Hong-min SUN Tuan-fa QIN Zheng-qiang WANG
In this paper, the Towards High Performance Voxel-based 3D Object Detection (Voxel-RCNN) three-dimensional (3D) point cloud object detection model is used as the benchmark network. Aiming at the problems existing in the current mainstream 3D point cloud voxelization methods, such as the backbone and the lack of feature expression ability under the bird’s-eye view (BEV), a high-performance voxel-based 3D object detection network (Reinforced Voxel-RCNN) is proposed. Firstly, a 3D feature extraction module based on the integration of inverted residual convolutional network and weight normalization is designed on the 3D backbone. This module can not only well retain more point cloud feature information, enhance the information interaction between convolutional layers, but also improve the feature extraction ability of the backbone network. Secondly, a spatial feature-semantic fusion module based on spatial and channel attention is proposed from a BEV perspective. The mixed use of channel features and semantic features further improves the network’s ability to express point cloud features. In the comparison of experimental results on the public dataset KITTI, the experimental results of this paper are better than many voxel-based methods. Compared with the baseline network, the 3D average accuracy and BEV average accuracy on the three categories of Car, Cyclist, and Pedestrians are improved. Among them, in the 3D average accuracy, the improvement rate of Car category is 0.23%, Cyclist is 0.78%, and Pedestrians is 2.08%. In the context of BEV average accuracy, enhancements are observed: 0.32% for the Car category, 0.99% for Cyclist, and 2.38% for Pedestrians. The findings demonstrate that the algorithm enhancement introduced in this study effectively enhances the accuracy of target category detection.
Xiaosheng YU Jianning CHI Ming XU
Accurate segmentation of fundus vessel structure can effectively assist doctors in diagnosing eye diseases. In this paper, we propose a fundus blood vessel segmentation network combined with cross-modal features and verify our method on the public data set OCTA-500. Experimental results show that our method has high accuracy and robustness.
The application of time-series prediction is very extensive, and it is an important problem across many fields, such as stock prediction, sales prediction, and loan prediction and so on, which play a great value in production and life. It requires that the model can effectively capture the long-term feature dependence between the output and input. Recent studies show that Transformer can improve the prediction ability of time-series. However, Transformer has some problems that make it unable to be directly applied to time-series prediction, such as: (1) Local agnosticism: Self-attention in Transformer is not sensitive to short-term feature dependence, which leads to model anomalies in time-series; (2) Memory bottleneck: The spatial complexity of regular transformation increases twice with the sequence length, making direct modeling of long time-series infeasible. In order to solve these problems, this paper designs an efficient model for long time-series prediction. It is a double pyramid bidirectional feature fusion mechanism network with parallel Temporal Convolution Network (TCN) and FastFormer. This network structure can combine the time series fine-grained information captured by the Temporal Convolution Network with the global interactive information captured by FastFormer, it can well handle the time series prediction problem.
Xianyu WANG Cong LI Heyi LI Rui ZHANG Zhifeng LIANG Hai WANG
Visual object tracking is always a challenging task in computer vision. During the tracking, the shape and appearance of the target may change greatly, and because of the lack of sufficient training samples, most of the online learning tracking algorithms will have performance bottlenecks. In this paper, an improved real-time algorithm based on deep learning features is proposed, which combines multi-feature fusion, multi-scale estimation, adaptive updating of target model and re-detection after target loss. The effectiveness and advantages of the proposed algorithm are proved by a large number of comparative experiments with other excellent algorithms on large benchmark datasets.
Chao XU Yunfeng YAN Lehangyu YANG Sheng LI Guorui FENG
The altered fingerprints help criminals escape from police and cause great harm to the society. In this letter, an altered fingerprint detection method is proposed. The method is constructed by two deep convolutional neural networks to train the time-domain and frequency-domain features. A spectral attention module is added to connect two networks. After the extraction network, a feature fusion module is then used to exploit relationship of two network features. We make ablation experiments and add the module proposed in some popular architectures. Results show the proposed method can improve the performance of altered fingerprint detection compared with the recent neural networks.
Hongzhe LIU Ningwei WANG Xuewei LI Cheng XU Yaze LI
In the neck part of a two-stage object detection network, feature fusion is generally carried out in either a top-down or bottom-up manner. However, two types of imbalance may exist: feature imbalance in the neck of the model and gradient imbalance in the region of interest extraction layer due to the scale changes of objects. The deeper the network is, the more abstract the learned features are, that is to say, more semantic information can be extracted. However, the extracted image background, spatial location, and other resolution information are less. In contrast, the shallow part can learn little semantic information, but a lot of spatial location information. We propose the Both Ends to Centre to Multiple Layers (BEtM) feature fusion method to solve the feature imbalance problem in the neck and a Multi-level Region of Interest Feature Extraction (MRoIE) layer to solve the gradient imbalance problem. In combination with the Region-based Convolutional Neural Network (R-CNN) framework, our Balanced Feature Fusion (BFF) method offers significantly improved network performance compared with the Faster R-CNN architecture. On the MS COCO 2017 dataset, it achieves an average precision (AP) that is 1.9 points and 3.2 points higher than those of the Feature Pyramid Network (FPN) Faster R-CNN framework and the Generic Region of Interest Extractor (GRoIE) framework, respectively.
Convolutional Neural Network (CNN) has made extraordinary progress in image classification tasks. However, it is less effective to use CNN directly to detect image manipulation. To address this problem, we propose an image filtering layer and a multi-scale feature fusion module which can guide the model more accurately and effectively to perform image manipulation detection. Through a series of experiments, it is shown that our model achieves improvements on image manipulation detection compared with the previous researches.
Yanyan ZHANG Meiling SHEN Wensheng YANG
We propose a target detection network (RMF-Net) based on the multi-scale strategy to solve the problems of large differences in the detection scale and mutual occlusion, which result in inaccurate locations. A multi-layer feature fusion module and multi-expansion dilated convolution pyramid module were designed based on the ResNet-101 residual network. The ability of the network to express the multi-scale features of the target could be improved by combining the shallow and deep features of the target and expanding the receptive field of the network. Moreover, RoI Align pooling was introduced to reduce the low accuracy of the anchor frame caused by multiple quantizations for improved positioning accuracy. Finally, an AD-IoU loss function was designed, which can adaptively optimise the distance between the prediction box and real box by comprehensively considering the overlap rate, centre distance, and aspect ratio between the boxes and can improve the detection accuracy of the occlusion target. Ablation experiments on the RMF-Net model verified the effectiveness of each factor in improving the network detection accuracy. Comparative experiments were conducted on the Pascal VOC2007 and Pascal VOC2012 datasets with various target detection algorithms based on convolutional neural networks. The results demonstrated that RMF-Net exhibited strong scale adaptability at different occlusion rates. The detection accuracy reached 80.4% and 78.5% respectively.
Shengzhou YI Junichiro MATSUGAMI Toshihiko YAMASAKI
Developing well-designed presentation slides is challenging for many people, especially novices. The ability to build high quality slideshows is becoming more important in society. In this study, a neural network was used to identify novice vs. well-designed presentation slides based on visual and structural features. For such a purpose, a dataset containing 1,080 slide pairs was newly constructed. One of each pair was created by a novice, and the other was the improved one by the same person according to the experts' advice. Ten checkpoints frequently pointed out by professional consultants were extracted and set as prediction targets. The intrinsic problem was that the label distribution was imbalanced, because only a part of the samples had corresponding design problems. Therefore, re-sampling methods for addressing class imbalance were applied to improve the accuracy of the proposed model. Furthermore, we combined the target task with an assistant task for transfer and multi-task learning, which helped the proposed model achieve better performance. After the optimal settings were used for each checkpoint, the average accuracy of the proposed model rose up to 81.79%. With the advice provided by our assessment system, the novices significantly improved their slide design.
Ye TAO Fang KONG Wenjun JU Hui LI Ruichun HOU
As an important type of science and technology service resource, energy consumption data play a vital role in the process of value chain integration between home appliance manufacturers and the state grid. Accurate electricity consumption prediction is essential for demand response programs in smart grid planning. The vast majority of existing prediction algorithms only exploit data belonging to a single domain, i.e., historical electricity load data. However, dependencies and correlations may exist among different domains, such as the regional weather condition and local residential/industrial energy consumption profiles. To take advantage of cross-domain resources, a hybrid energy consumption prediction framework is presented in this paper. This framework combines the long short-term memory model with an encoder-decoder unit (ED-LSTM) to perform sequence-to-sequence forecasting. Extensive experiments are conducted with several of the most commonly used algorithms over integrated cross-domain datasets. The results indicate that the proposed multistep forecasting framework outperforms most of the existing approaches.
Yuanbo FANG Hongliang FU Huawei TAO Ruiyu LIANG Li ZHAO
Speech based deception detection using deep learning is one of the technologies to realize a deception detection system with high recognition rate in the future. Multi-network feature extraction technology can effectively improve the recognition performance of the system, but due to the limited labeled data and the lack of effective feature fusion methods, the performance of the network is limited. Based on this, a novel hybrid network model based on attentional multi-feature fusion (HN-AMFF) is proposed. Firstly, the static features of large amounts of unlabeled speech data are input into DAE for unsupervised training. Secondly, the frame-level features and static features of a small amount of labeled speech data are simultaneously input into the LSTM network and the encoded output part of DAE for joint supervised training. Finally, a feature fusion algorithm based on attention mechanism is proposed, which can get the optimal feature set in the training process. Simulation results show that the proposed feature fusion method is significantly better than traditional feature fusion methods, and the model can achieve advanced performance with only a small amount of labeled data.
Wentao LYU Qiqi LIN Lipeng GUO Chengqun WANG Zhenyi YANG Weiqiang XU
In this paper, we present a novel method for vehicle detection based on the Faster R-CNN frame. We integrate MobileNet into Faster R-CNN structure. First, the MobileNet is used as the base network to generate the feature map. In order to retain the more information of vehicle objects, a fusion strategy is applied to multi-layer features to generate a fused feature map. The fused feature map is then shared by region proposal network (RPN) and Fast R-CNN. In the RPN system, we employ a novel dimension cluster method to predict the anchor sizes, instead of choosing the properties of anchors manually. Our detection method improves the detection accuracy and saves computation resources. The results show that our proposed method respectively achieves 85.21% and 91.16% on the mean average precision (mAP) for DIOR dataset and UA-DETRAC dataset, which are respectively 1.32% and 1.49% improvement than Faster R-CNN (ResNet152). Also, since less operations and parameters are required in the base network, our method costs the storage size of 42.52MB, which is far less than 214.89MB of Faster R-CNN(ResNet50).
Junxing ZHANG Shuo YANG Chunjuan BO Huimin LU
Vehicle logo detection technology is one of the research directions in the application of intelligent transportation systems. It is an important extension of detection technology based on license plates and motorcycle types. A vehicle logo is characterized by uniqueness, conspicuousness, and diversity. Therefore, thorough research is important in theory and application. Although there are some related works for object detection, most of them cannot achieve real-time detection for different scenes. Meanwhile, some real-time detection methods of single-stage have performed poorly in the object detection of small sizes. In order to solve the problem that the training samples are scarce, our work in this paper is improved by constructing the data of a vehicle logo (VLD-45-S), multi-stage pre-training, multi-scale prediction, feature fusion between deeper with shallow layer, dimension clustering of the bounding box, and multi-scale detection training. On the basis of keeping speed, this article improves the detection precision of the vehicle logo. The generalization of the detection model and anti-interference capability in real scenes are optimized by data enrichment. Experimental results show that the accuracy and speed of the detection algorithm are improved for the object of small sizes.
In this paper, we propose a deep model of visual recognition based on hybrid KPCA Network(H-KPCANet), which is based on the combination of one-stage KPCANet and two-stage KPCANet. The proposed model consists of four types of basic components: the input layer, one-stage KPCANet, two-stage KPCANet and the fusion layer. The role of one-stage KPCANet is to calculate the KPCA filters for convolution layer, and two-stage KPCANet is to learn PCA filters in the first stage and KPCA filters in the second stage. After binary quantization mapping and block-wise histogram, the features from two different types of KPCANets are fused in the fusion layer. The final feature of the input image can be achieved by weighted serial combination of the two types of features. The performance of our proposed algorithm is tested on digit recognition and object classification, and the experimental results on visual recognition benchmarks of MNIST and CIFAR-10 validated the performance of the proposed H-KPCANet.
Cheng XU Wei HAN Dongzhen WANG Daqing HUANG
In this paper, we propose a salient region detection method with multi-feature fusion and edge constraint. First, an image feature extraction and fusion network based on dense connection structure and multi-channel convolution channel is designed. Then, a multi-scale atrous convolution block is applied to enlarge reception field. Finally, to increase accuracy, a combined loss function including classified loss and edge loss is built for multi-task training. Experimental results verify the effectiveness of the proposed method.
Zheng FANG Tieyong CAO Jibin YANG Meng SUN
Salient region detection is a fundamental problem in computer vision and image processing. Deep learning models perform better than traditional approaches but suffer from their huge parameters and slow speeds. To handle these problems, in this paper we propose the multi-feature fusion network (MFFN) - a efficient salient region detection architecture based on Convolution Neural Network (CNN). A novel feature extraction structure is designed to obtain feature maps from CNN. A fusion dense block is used to fuse all low-level and high-level feature maps to derive salient region results. MFFN is an end-to-end architecture which does not need any post-processing procedures. Experiments on the benchmark datasets demonstrate that MFFN achieves the state-of-the-art performance on salient region detection and requires much less parameters and computation time. Ablation experiments demonstrate the effectiveness of each module in MFFN.
Jingjie YAN Guanming LU Xiaodong BAI Haibo LI Ning SUN Ruiyu LIANG
In this letter, we propose a supervised bimodal emotion recognition approach based on two important human emotion modalities including facial expression and body gesture. A effectively supervised feature fusion algorithms named supervised multiset canonical correlation analysis (SMCCA) is presented to established the linear connection between three sets of matrices, which contain the feature matrix of two modalities and their concurrent category matrix. The test results in the bimodal emotion recognition of the FABO database show that the SMCCA algorithm can get better or considerable efficiency than unsupervised feature fusion algorithm covering canonical correlation analysis (CCA), sparse canonical correlation analysis (SCCA), multiset canonical correlation analysis (MCCA) and so on.
Automatically recognizing pain and estimating pain intensity is an emerging research area that has promising applications in the medical and healthcare field, and this task possesses a crucial role in the diagnosis and treatment of patients who have limited ability to communicate verbally and remains a challenge in pattern recognition. Recently, deep learning has achieved impressive results in many domains. However, deep architectures require a significant amount of labeled data for training, and they may fail to outperform conventional handcrafted features due to insufficient data, which is also the problem faced by pain detection. Furthermore, the latest studies show that handcrafted features may provide complementary information to deep-learned features; hence, combining these features may result in improved performance. Motived by the above considerations, in this paper, we propose an innovative method based on the combination of deep spatiotemporal and handcrafted features for pain intensity estimation. We use C3D, a deep 3-dimensional convolutional network that takes a continuous sequence of video frames as input, to extract spatiotemporal facial features. C3D models the appearance and motion of videos simultaneously. For handcrafted features, we propose extracting the geometric information by computing the distance between normalized facial landmarks per frame and the ones of the mean face shape, and we extract the appearance information using the histogram of oriented gradients (HOG) features around normalized facial landmarks per frame. Two levels of SVRs are trained using spatiotemporal, geometric and appearance features to obtain estimation results. We tested our proposed method on the UNBC-McMaster shoulder pain expression archive database and obtained experimental results that outperform the current state-of-the-art.
Weicheng XIE Junxu WEI Zhichao CHEN Tianqian LI
Particle filter algorithm is an important algorithm in the field of target tracking. however, this algorithm faces the problem of sample impoverishment which is caused by the introduction of re-sampling and easily affected by illumination variation. This problem seriously affects the tracking performance of a particle filter algorithm. To solve this problem, we introduce a particle filter target tracking algorithm based on a dynamic niche genetic algorithm. The application of this dynamic niche genetic algorithm to re-sampling ensures particle diversity and dynamically fuses the color and profile features of the target in order to increase the algorithm accuracy under the illumination variation. According to the test results, the proposed algorithm accurately tracks the target, significantly increases the number of particles, enhances the particle diversity, and exhibits better robustness and better accuracy.