Hiroshi YAMAMOTO Keigo SHIMATANI Keigo UCHIYAMA
In order to support a person’s various activities (e.g., primary industry, elderly care), an IoT (Internet of Things) system supporting sensing technologies that observe the status of various people, things, and places in the real world is attracting attention. The existing studies have utilized the camera device for the sensing technology to observe the condition of the real world, but the use of the camera device has problems related to the limitation of the observation period and the invasion of privacy. Therefore, new IoT systems utilizing three-dimensional LiDAR (Light Detection And Ranging) have been proposed because they can obtain three-dimensional spatial information about the real world by solving problems. However, several problems exist with the use of 3D LiDAR in the deployment on the real fields. The 3D LiDAR requires much electric power for observing the three-dimensional spatial information. In addition, the annotation process of a large volume of point-cloud data for constructing a machine-learning model requires significant time and effort. Therefore, in this study, we propose IoT systems utilizing 3D LiDAR for observing the status of targets and equipping them with new technologies to improve the practicality of the use of 3D LiDAR. First, a linkage function is designed to achieve power saving for an entire system. The linkage function operates only a sensing technology with low power consumption during normal operation and activates the 3D LiDAR with high power consumption only when it is estimated that an observation target is approaching. Second, a self-learning function is built to analyze the data collected by not only the 3D LiDAR but also the camera for automatically generating a large amount of training data with the correct label which is estimated by analyzing the camera images. Through the experimental evaluations using a prototype system, it is confirmed that the sensing technologies can correctly be interconnected to reduce the total power consumption. In addition, the machine-learning model constructed by the self-learning function can accurately estimate the status of the targets.
Jia-ji JIANG Hai-bin WAN Hong-min SUN Tuan-fa QIN Zheng-qiang WANG
In this paper, the Towards High Performance Voxel-based 3D Object Detection (Voxel-RCNN) three-dimensional (3D) point cloud object detection model is used as the benchmark network. Aiming at the problems existing in the current mainstream 3D point cloud voxelization methods, such as the backbone and the lack of feature expression ability under the bird’s-eye view (BEV), a high-performance voxel-based 3D object detection network (Reinforced Voxel-RCNN) is proposed. Firstly, a 3D feature extraction module based on the integration of inverted residual convolutional network and weight normalization is designed on the 3D backbone. This module can not only well retain more point cloud feature information, enhance the information interaction between convolutional layers, but also improve the feature extraction ability of the backbone network. Secondly, a spatial feature-semantic fusion module based on spatial and channel attention is proposed from a BEV perspective. The mixed use of channel features and semantic features further improves the network’s ability to express point cloud features. In the comparison of experimental results on the public dataset KITTI, the experimental results of this paper are better than many voxel-based methods. Compared with the baseline network, the 3D average accuracy and BEV average accuracy on the three categories of Car, Cyclist, and Pedestrians are improved. Among them, in the 3D average accuracy, the improvement rate of Car category is 0.23%, Cyclist is 0.78%, and Pedestrians is 2.08%. In the context of BEV average accuracy, enhancements are observed: 0.32% for the Car category, 0.99% for Cyclist, and 2.38% for Pedestrians. The findings demonstrate that the algorithm enhancement introduced in this study effectively enhances the accuracy of target category detection.
2D and 3D semantic segmentation play important roles in robotic scene understanding. However, current 3D semantic segmentation heavily relies on 3D point clouds, which are susceptible to factors such as point cloud noise, sparsity, estimation and reconstruction errors, and data imbalance. In this paper, a novel approach is proposed to enhance 3D semantic segmentation by incorporating 2D semantic segmentation from RGB-D sequences. Firstly, the RGB-D pairs are consistently segmented into 2D semantic maps using the tracking pipeline of Simultaneous Localization and Mapping (SLAM). This process effectively propagates object labels from full scans to corresponding labels in partial views with high probability. Subsequently, a novel Semantic Projection (SP) block is introduced, which integrates features extracted from localized 2D fragments across different camera viewpoints into their corresponding 3D semantic features. Lastly, the 3D semantic segmentation network utilizes a combination of 2D-3D fusion features to facilitate a merged semantic segmentation process for both 2D and 3D. Extensive experiments conducted on public datasets demonstrate the effective performance of the proposed 2D-assisted 3D semantic segmentation method.
Sota MORIYAMA Koichi ICHIGE Yuichi HORI Masayuki TACHI
In this paper, we propose a method for video reflection removal using a video restoration framework with enhanced deformable networks (EDVR). We examine the effect of each module in EDVR on video reflection removal and modify the models using 3D convolutions. The performance of each modified model is evaluated in terms of the RMSE between the structural similarity (SSIM) and the smoothed SSIM representing temporal consistency.
Hidenori YUKAWA Yu USHIJIMA Toru TAKAHASHI Toru FUKASAWA Yoshio INASAWA Naofumi YONEDA Moriyasu MIYAZAKI
A T-junction orthomode transducer (OMT) is a waveguide component that separates two orthogonal linear polarizations in the same frequency band. It has a common circular waveguide short-circuited at one end and two branch rectangular waveguides arranged in opposite directions near the short circuit. One of the advantages of a T-junction OMT is its short axial length. However, the two rectangular ports, which need to be orthogonal, have different levels of performance because of asymmetry. We therefore propose a uniaxially symmetrical T-junction OMT, which is configured such that the two branch waveguides are tilted 45° to the short circuit. The uniaxially symmetrical configuration enables same levels of performance for the two ports, and its impedance matching is easier compared to that for the conventional configuration. The polarization separation principle can be explained using the principles of orthomode junction (OMJ) and turnstile OMT. Based on calculations, the proposed configuration demonstrated a return loss of 25dB, XPD of 30dB, isolation of 21dB between the two branch ports, and loss of 0.25dB, with a bandwidth of 15% in the K band. The OMT was then fabricated as a single piece via 3D printing and evaluated against the calculated performance indices.
Karin WAKATSUKI Chiemi FUJIKAWA Makoto OMODANI
Herein, we propose a volumetric 3D display in which cross-sectional images are projected onto a rotating helix screen. The method employed by this display can enable image observation from universal directions. A major challenge associated with this method is the presence of invisible regions that occur depending on the observation angle. This study aimed to fabricate a mirror-image helix screen with two helical surfaces coaxially arranged in a plane-symmetrical configuration. The visible region was actually measured to be larger than the visible region of the conventional helix screen. We confirmed that the improved visible region was almost independent of the observation angle and that the visible region was almost equally wide on both the left and right sides of the rotation axis.
Duc NGUYEN Tran THUY HIEN Huyen T. T. TRAN Truong THU HUONG Pham NGOC NAM
Distance-aware quality adaptation is a potential approach to reduce the resource requirement for the transmission and rendering of textured 3D meshes. In this paper, we carry out a subjective experiment to investigate the effects of the distance from the camera on the perceptual quality of textured 3D meshes. Besides, we evaluate the effectiveness of eight image-based objective quality metrics in representing the user's perceptual quality. Our study found that the perceptual quality in terms of mean opinion score increases as the distance from the camera increases. In addition, it is shown that normalized mutual information (NMI), a full-reference objective quality metric, is highly correlated with subjective scores.
Keiichiro TAKADA Yasuaki TOKUMO Tomohiro IKAI Takeshi CHUJOH
Video-based point cloud compression (V-PCC) utilizes video compression technology to efficiently encode dense point clouds providing state-of-the-art compression performance with a relatively small computation burden. V-PCC converts 3-dimensional point cloud data into three types of 2-dimensional frames, i.e., occupancy, geometry, and attribute frames, and encodes them via video compression. On the other hand, the quality of these frames may be degraded due to video compression. This paper proposes an adaptive neural network-based post-processing filter on attribute frames to alleviate the degradation problem. Furthermore, a novel training method using occupancy frames is studied. The experimental results show average BD-rate gains of 3.0%, 29.3% and 22.2% for Y, U and V respectively.
Kota SHIBA Atsutake KOSUGE Mototsugu HAMADA Tadahiro KURODA
This paper describes an in-depth analysis of crosstalk in a high-bandwidth 3D-stacked memory using a multi-hop inductive coupling interface and proposes two countermeasures. This work analyzes the crosstalk among seven stacked chips using a 3D electromagnetic (EM) simulator. The detailed analysis reveals two main crosstalk sources: concentric coils and adjacent coils. To suppress these crosstalks, this paper proposes two corresponding countermeasures: shorted coils and 8-shaped coils. The combination of these coils improves area efficiency by a factor of 4 in simulation. The proposed methods enable an area-efficient inductive coupling interface for high-bandwidth stacked memory.
He LI Yutaro IWAMOTO Xianhua HAN Lanfen LIN Akira FURUKAWA Shuzo KANASAKI Yen-Wei CHEN
Convolutional neural networks (CNNs) have become popular in medical image segmentation. The widely used deep CNNs are customized to extract multiple representative features for two-dimensional (2D) data, generally called 2D networks. However, 2D networks are inefficient in extracting three-dimensional (3D) spatial features from volumetric images. Although most 2D segmentation networks can be extended to 3D networks, the naively extended 3D methods are resource-intensive. In this paper, we propose an efficient and accurate network for fully automatic 3D segmentation. Specifically, we designed a 3D multiple-contextual extractor to capture rich global contextual dependencies from different feature levels. Then we leveraged an ROI-estimation strategy to crop the ROI bounding box. Meanwhile, we used a 3D ROI-attention module to improve the accuracy of in-region segmentation in the decoder path. Moreover, we used a hybrid Dice loss function to address the issues of class imbalance and blurry contour in medical images. By incorporating the above strategies, we realized a practical end-to-end 3D medical image segmentation with high efficiency and accuracy. To validate the 3D segmentation performance of our proposed method, we conducted extensive experiments on two datasets and demonstrated favorable results over the state-of-the-art methods.
Fairuz SAFWAN MAHAD Masakazu IWAMURA Koichi KISE
3D reconstruction methods using neural networks are popular and have been studied extensively. However, the resulting models typically lack detail, reducing the quality of the 3D reconstruction. This is because the network is not designed to capture the fine details of the object. Therefore, in this paper, we propose two networks designed to capture both the coarse and fine details of the object to improve the reconstruction of the detailed parts of the object. To accomplish this, we design two networks. The first network uses a multi-scale architecture with skip connections to associate and merge features from other levels. For the second network, we design a multi-branch deep generative network that separately learns the local features, generic features, and the intermediate features through three different tailored components. In both network architectures, the principle entails allowing the network to learn features at different levels that can reconstruct the fine parts and the overall shape of the reconstructed 3D model. We show that both of our methods outperformed state-of-the-art approaches.
Makoto OMODANI Hiroyuki YAGUCHI Fusako KUSUNOKI
We have proposed and developed e-Tile for wall decoration and ornaments for interior/exterior. A prototype of 2m×2m large energy-saving reflective panel was realized by arraying 400 e-Tiles on a flat plane. Prototypes of cubic displays were also realized by constructing e-Tiles to cubic shape. Artistic display effects and 3D impression could be found in these cubic prototypes. We hope e-Tile is a promising solution to extend the application field of e-Paper to decorative use including architectural applications.
Kotaro MATSUURA Chihiro TSUTAKE Keita TAKAHASHI Toshiaki FUJII
Inspired by the framework of algorithm unrolling, we propose a scalable network architecture that computes layer patterns for light field displays, enabling control of the trade-off between the display quality and the computational cost on a single pre-trained network.
Fairuz Safwan MAHAD Masakazu IWAMURA Koichi KISE
Neural network-based three-dimensional (3D) reconstruction methods have produced promising results. However, they do not pay particular attention to reconstructing detailed parts of objects. This occurs because the network is not designed to capture the fine details of objects. In this paper, we propose a network designed to capture both the coarse and fine details of objects to improve the reconstruction of the fine parts of objects.
Kenshiro TAMATA Tomohiro MASHITA
A typical approach to reconstructing a 3D environment model is scanning the environment with a depth sensor and fitting the accumulated point cloud to 3D models. In this kind of scenario, a general 3D environment reconstruction application assumes temporally continuous scanning. However in some practical uses, this assumption is unacceptable. Thus, a point cloud matching method for stitching several non-continuous 3D scans is required. Point cloud matching often includes errors in the feature point detection because a point cloud is basically a sparse sampling of the real environment, and it may include quantization errors that cannot be ignored. Moreover, depth sensors tend to have errors due to the reflective properties of the observed surface. We therefore make the assumption that feature point pairs between two point clouds will include errors. In this work, we propose a feature description method robust to the feature point registration error described above. To achieve this goal, we designed a deep learning based feature description model that consists of a local feature description around the feature points and a global feature description of the entire point cloud. To obtain a feature description robust to feature point registration error, we input feature point pairs with errors and train the models with metric learning. Experimental results show that our feature description model can correctly estimate whether the feature point pair is close enough to be considered a match or not even when the feature point registration errors are large, and our model can estimate with higher accuracy in comparison to methods such as FPFH or 3DMatch. In addition, we conducted experiments for combinations of input point clouds, including local or global point clouds, both types of point cloud, and encoders.
Zifen HE Shouye ZHU Ying HUANG Yinhui ZHANG
This paper presents a novel method for weakly supervised semantic segmentation of 3D point clouds using a novel graph and edge convolutional neural network (GECNN) towards 1% and 10% point cloud with labels. Our general framework facilitates semantic segmentation by encoding both global and local scale features via a parallel graph and edge aggregation scheme. More specifically, global scale graph structure cues of point clouds are captured by a graph convolutional neural network, which is propagated from pairwise affinity representation over the whole graph established in a d-dimensional feature embedding space. We integrate local scale features derived from a dynamic edge feature aggregation convolutional neural networks that allows us to fusion both global and local cues of 3D point clouds. The proposed GECNN model is trained by using a comprehensive objective which consists of incomplete, inexact, self-supervision and smoothness constraints based on partially labeled points. The proposed approach enforces global and local consistency constraints directly on the objective losses. It inherently handles the challenges of segmenting sparse 3D point clouds with limited annotations in a large scale point cloud space. Our experiments on the ShapeNet and S3DIS benchmarks demonstrate the effectiveness of the proposed approach for efficient (within 20 epochs) learning of large scale point cloud semantics despite very limited labels.
Anis Ur REHMAN Ken KIHARA Sakuichi OHTSUKA
In daily reality, people often pay attention to several objects that change positions while being observed. In the laboratory, this process is investigated by a phenomenon known as multiple object tracking (MOT) which is a task that evaluates attentive tracking performance. Recent findings suggest that the attentional set for multiple moving objects whose depth changes in three dimensions from one plane to another is influenced by the initial configuration of the objects. When tracking objects, it is difficult for people to expand their attentional set to multiple-depth planes once attention has been focused on a single plane. However, less is known about people contracting their attentional set from multiple-depth planes to a single-depth plane. In two experiments, we examined tracking accuracy when four targets or four distractors, which were initially distributed on two planes, come together on one of the planes during an MOT task. The results from this study suggest that people have difficulty changing the depth range of their attention during attentive tracking, and attentive tracking performance depends on the initial attentional set based on the configuration prior to attentive tracking.
The performance of video action recognition has improved significantly in recent decades. Current recognition approaches mainly utilize convolutional neural networks to acquire video feature representations. In addition to the spatial information of video frames, temporal information such as motions and changes is important for recognizing videos. Therefore, the use of convolutions in a spatiotemporal three-dimensional (3D) space for representing spatiotemporal features has garnered significant attention. Herein, we introduce recent advances in 3D convolutions for video action recognition.
Hyun-Ho KIM Sung-Gyun LIM Gwangsoon LEE Jun Young JEONG Jae-Gon KIM
The emerging three degree of freedom plus (3DoF+) video provides more interactive and deep immersive visual experience. 3DoF+ video introduces motion parallax to 360 video providing omnidirectional view with limited changes of the view position. A large set of views are required to support such 3DoF+ visual experience, hence it is essential to compress a tremendous amount of 3DoF+ video. Recently, MPEG is developing a standard for efficient coding of 3DoF+ video that consists of multiple videos, and its test model named Test Model for Immersive Video (TMIV). In the TMIV, the redundancy between the input source views is removed as much as possible by selecting one or several basic views and predicting the remaining views from the basic views. Each unpredicted region is cropped to a bounding box called patch, and then a large number of patches are packed into atlases together with the selected basic views. As a result, multiple source views are converted into one or more atlas sequences to be compressed. In this letter, we present an improved clustering method using patch merging in the atlas construction in the TMIV. The proposed method achieves significant BD-rate reduction in terms of various end-to-end evaluation metrics in the experiment, and was adopted in TMIV6.0.
Munekazu DATE Shinya SHIMIZU Hideaki KIMATA Dan MIKAMI Yoshinori KUSACHI
3D video contents depend on the shooting condition, which is camera positioning. Depth range control in the post-processing stage is not easy, but essential as the video from arbitrary camera positions must be generated. If light field information can be obtained, video from any viewpoint can be generated exactly and post-processing is possible. However, a light field has a huge amount of data, and capturing a light field is not easy. To compress data quantity, we proposed the visually equivalent light field (VELF), which uses the characteristics of human vision. Though a number of cameras are needed, VELF can be captured by a camera array. Since camera interpolation is made using linear blending, calculation is so simple that we can construct a ray distribution field of VELF by optical interpolation in the VELF3D display. It produces high image quality due to its high pixel usage efficiency. In this paper, we summarize the relationship between the characteristics of human vision, VELF and VELF3D display. We then propose a method to control the depth range for the observed image on the VELF3D display and discuss the effectiveness and limitations of displaying the processed image on the VELF3D display. Our method can be applied to other 3D displays. Since the calculation is just weighted averaging, it is suitable for real-time applications.