HDR-VDA: A Full Stage Data Augmentation Method for HDR Video Reconstruction

Fengshan ZHAO; Qin LIU; Takeshi IKENAGA

doi:10.1587/transinf.2024PCP0004

1. Introduction

Compared with capturing high dynamic range (HDR) video with a higher bit depth by costly sensors [1]-[5], HDR video reconstruction is an economical algorithm solution to generate high-quality HDR video. Specifically, several consecutive LDR frames taken with standard cameras are utilized to generate one HDR reference frame, and this process is repeated to reconstruct the entire HDR video. Recently, deep learning-based methods have made significant progress. Kalantari et al. [6] combine an optical flow estimation network and an encoder-decoder to generate vivid HDR video content. Chen et al. [7] propose a coarse-to-fine framework containing a deformable feature alignment-based fusion module to optimize the final HDR reconstruction. However, such methods need large-scale HDR video training datasets, which are hard to prepare.

Similar to the datasets for other image processing tasks, two types of HDR training datasets exist: the real dataset and the synthetic dataset. In real-world datasets, the inputs consist of actually captured images, and the ground truth is artificially synthesized, while the synthetic datasets follow the opposite processing approach. To make a real-world dataset like [7], several consecutive captured frames containing one still subject are merged to produce one HDR ground truth, where the center frame is designated as the reference LDR frame. After that, frames with the same subject moving back and forth are captured as the neighboring LDR frames. However, real-world HDR video datasets like [7] only covers limited indoor scenes with single human motion. Meanwhile, unintended object movements in the background also affect the quality of the dataset. Unlike image datasets [8], [9], the production of HDR video datasets using real-world data acquisition requires higher demands on scene controllability.

To address the related challenges and constraints, recent works on HDR video reconstruction use synthetic training datasets with the same video source as NTIRE dataset [10] as a substitute. Although a huge amount of training data with various scenarios and illumination conditions is achievable by converting existing HDR videos into alternating exposed LDR videos, such datasets still lack diversity compared to the datasets for high-level vision tasks [11], [12].

Nowadays, data augmentation has been widely used to avoid over-fitting problems and increase the generalization performance of the trained model. Mainstream image manipulation-based methods such as CutOut [13], CutMix [14], and MixUp [15], which cause distortions and information loss, are not suitable for pixel-level processing. In this paper, a full stage data augmentation method for HDR video reconstruction is proposed and shown in Fig. 1. During training, a local area-based mixed data augmentation method (LMDA) is utilized for synthetic datasets, enriching the samples with diverse exposure patterns and color modes. The motion and ill-exposure guided sample rank and adjustment strategy (MISRA) is used to increase the impacts of training samples lacking motion and ill-exposure information. The HDR-targeted test-time augmentation (HTTA) yields final reconstruction results by removing undesirable overexposed hallucinations. Proposed methods outperform conventional works with better metrics and more visually pleasing reconstructed HDR frames. This work is the extension of previous work accepted by ICIP ’23 [16] about LMDA. The contributions of this work are summarized as follows:

Fig. 1 Proposed full stage data augmentation method (HDR-VDA) for HDR video reconstruction.

A local area-based mixed data augmentation method for HDR video reconstruction is proposed to achieve better data diversity by providing more challenging samples including various exposure and color conditions.
A motion and ill-exposure guided sample rank and adjustment strategy to select training samples that need to be compensated with extra augmentation.
An HDR-targeted test-time augmentation method is proposed to enhance the reconstruction result especially for over-exposed regions.
Analysis of extensive experiments and ablation studies demonstrates the significance of different components in the proposed method.

Page top

2. Related Work

2.1 HDR Video Reconstruction

Many researchers focus on solving the problem of ghost artifacts in HDR imaging for dynamic scenes. Compared to traditional methods [8], [17]-[24], recent deep learning based approaches [25]-[30] show their superiority in deghosting and producing more faithful details in ill-exposed areas. However, these approaches are not designed for HDR video reconstruction due to the lack of temporal consistency and different exposure patterns.

Some existing HDR video reconstruction methods produce HDR videos with coded per-pixel [1], [2], scanline exposure/ISO [3], [31] and beam splitter [4], [32]. Applications of these methods are not widespread due to the requirement of specialized hardware with high costs. One way to reconstruct HDR videos is to transform LDR videos captured with alternating exposures. Kang et al. [17] propose the first HDR video reconstruction algorithm to merge aligned frames warped by optical flow. Mangiant et al. [33] propose a block-based motion estimation method coupled with a refinement stage. Kalantari et al. [34] propose a patch-based method to synthesize the missing contents at each frame. However, the above approaches still suffer from ghost artifacts and the processing is computationally expensive.

Nowadays, deep learning-based methods lead to less processing time and better performance. Kalantari et al. [6] use a flow network to align neighboring frames by the estimated optical flow, and merge LDR frames to generate HDR frames by an encoder-decoder module. Chen et al. [7] improve it by adding one refinement network including a deformable feature alignment module. However, the above supervised methods need a large amount of labeled data for training, which is quite time-consuming.

2.2 Training-Phase Data Augmentation and Test-Time Augmentation

Data augmentation has been effectively employed in the deep learning field. Although geometric transformations like flipping and rotation are able to preserve the label post-transformation [35], researchers have recently directed their efforts towards the exploration of erasing and mixing approaches. DeVries et al. [13] use randomly masking regions to force the model to learn more descriptive characteristics. Zhang et al. [15] mix different input images leading to decision boundaries that transition linearly from class to class. Yun et al. [14] combine mixing and regional masking to achieve better robustness and uncertainty of image classifiers. However, these approaches are specifically designed for high-level visual tasks, and are not applicable to low-level image processing. Yoo et al. [36] and Ishii et al. [37] blend training inputs and ground truths to enable the model to handle complex scenes. However, such approaches are not suitable for HDR reconstruction, given that the different bit depths between LDR and HDR hamper the learning of the reconstruction process.

For learning-based augmentation approaches, Tran et al. [38] propose an optimization method for GAN-based sample generation. Cubuk et al. [39] propose a search space containing existing augmentation operations and a related searching algorithm. However, such methods require considerable computational resources.

The current investigation on test-time augmentation can be classified into two main topics: the exploration of suitable test-time transformation sets and the effective ensemble technique. Kim et al. [40] propose a loss prediction module to enable the efficient selection of proper test-time transformations. Shanmugam et al. [41] propose a learning-based ensemble method for test-time augmented predictions. However, the above methods are designed for high-level visual tasks like image classification, where the ensemble entities are probability distributions and the cost of searching the suitable test-time augmentation set is quite small.

Page top

3. Proposed Algorithms

3.1 Overview of Training Data Preparation

To make the commonly used synthetic training datasets with open-source HDR videos, following [7], the conventional LDR input generation procedure \(\mathcal{G}\) is defined as:

\[\begin{align} & L_i\in\mathbb{R}^{W \times H \times C},\ H_i\in\mathbb R^{W \times H \times C},\ t_i\in\{T_1,\ldots ,T_n\} \tag{1} \\ & L_i = \mathcal{G} (H_i,t_i) = P\left(Clip\left(\left(H_i\times t_i\right)^{1/2.2}\right)\right) \tag{2} \end{align}\]

where \(H_i\) denotes the \(i\)-th ground truth HDR frame in a existing HDR sequence, \(L_i\) denotes the corresponding \(i\)-th LDR input frame, and \(t_i\) denotes the related exposure. \(n\) denotes the number of alternating exposures, which is generally set to 2 or 3. For convenience, here \(T_n\) with higher \(n\) denotes higher exposure. \(\mathcal G\) denotes the generation process. After re-exposing, HDR frames are converted to the LDR domain using a gamma curve with \(\gamma= 2.2\) and a value clipping function that \(Clip\) represents. \(P\) denotes the post-processing including noise addition. After that, such LDRs-HDR pair including consecutive LDR frames and one reference HDR frame is used as a training sample.

3.2 Local Area-Based Mixed Data Augmentation (LMDA)

As shown in Fig. 2, the local area-based mixed data augmentation (LMDA) is the combination of two sub-policies: local exposure augmentation (LEA) and local RGB permutation (LRP). Each policy will be introduced as follows.

Fig. 2 An illustrative example of the process of proposed local area-based mixed data augmentation.

After data preparation in Sect. 3.1, each area in one frame is exposed equally with the same exposure bias, and the features of regions with the same ill-exposure and information loss are also similar. Thus the trained CNN-based reconstruction model proposed in [6], [7] will apply a similar reconstruction process to such areas, resulting in the over-smoothing of reconstructed contents and amplifying the ghost artifact problem. Given the above-mentioned conditions, a regularization is desired to force the model to learn how to reconstruct different regions respectively [36]. This work proposes the local exposure augmentation method (LEA) defined as follows. The patches with the same position in LDR inputs are randomly selected and replaced with differently re-exposed LDR parts with a small extra time consumption.

\[\begin{align} & \mathbf{M}^e_i \in \{0,1\}^{W \times H}, t^e_i \ne t_i. \tag{3} \\ & \hat{L_i} =\mathbf{M}^e_i \odot \mathcal{G} \left(H_i,t^e_i \right)+ (\mathbf{1}-\mathbf{M}^e_i) \odot \mathcal{G} (H_i,t_i). \tag{4} \end{align}\]

where \(\mathbf{M}^e_i\) denotes the binary mask which is generated for the random local area , and \(t^e_i\) denotes the applied exposure different from the original. \(\odot\) denotes the Hadamard product operation. \(\hat{L}_i\) denotes the new augmented LDR input. By employing LEA, the trained model implicitly learns not only “how” but also “where” to reconstruct, and provides more fine-grained reconstruction for different regions. Such operation also avoids incurring the structural information of original frames.

Compared with patch mixing [14] and patch erasing [13] methods, RGB permutation and blending (adding a constant value for pixels) disturb the structure and texture minimally [36] and are usually applied to the whole image, causing limited augmented color modes. Therefore, a robust color processing capability is desired for HDR-related algorithms, especially when under-exposure and saturation occur. Like the LEA, the local RGB permutation (LRP) is also proposed and defined as follows. A random patch is selected like LEA, and its color channels are also shuffled randomly.

\[\begin{align} & \hat{L_i} =\mathbf{M}^c_i \odot \mathcal{C} (L_i)+ (\mathbf{1}-\mathbf{M}^c_i) \odot L_i. \tag{5} \\ & \hat{H_i} =\mathbf{M}^c_i \odot \mathcal{C} (H_i)+ (\mathbf{1}-\mathbf{M}^c_i) \odot H_i. \tag{6} \end{align}\]

where \(\mathbf{M}^c_i\) denotes the binary mask indicating the region with RGB permutation, and \(\mathcal{C}\) denotes the permutation function. \(\hat{H}_i\) denotes the related augmented HDR ground truth.

This work combines the local exposure augmentation and local RGB permutation as LMDA. Given the detailed disruption caused by patch mixing methods [14], two sub-policies do not interfere with extracted features from related augmented patches with each other if no overlap occurs, so they are applied simultaneously to achieve better robustness for the trained model. Note that, similar to [13]-[15], [36], the LMDA approach is introduced only during the training phase and is not utilized during testing. While maintaining consistency in the alternating exposure formats (2 alternating exposure and 3 alternating exposure), the random window selection in LMDA enhances the robustness of the training model, thereby enabling it to achieve better performance during the processing of normal inputs. Examples are shown in Fig. 3.

Fig. 3 Examples of (a) different image manipulation-based data augmentation methods and (b) proposed local area-based mixed data augmentation.

3.3 Motion and Ill-Exposure Guided Sample Rank and Adjustment Strategy (MISRA)

For the HDR reconstruction, the data diversity, especially the scene and camera motion bias, the exposure bias, and the semantic content bias should be considered in the training dataset [10], [42]. Thus this work provides a simple evaluation for current synthetic training datasets [6], [7]. The average mean absolute error between adjacent original HDR frames in one sample is taken as the degree of contained motion information. Due to the loss of information in overexposed areas, which is a major factor causing ghost artifacts, all HDR frames are also exposed with only over-exposure, and the potential proportion of over-exposed regions is calculated by a threshold. As shown in Fig. 4, both two sample distribution in the training datasets exhibits an extreme long-tail problem. It indicates that there are few training samples with both large motion and over-exposed area, which limit the model’s capability.

Fig. 4 Dataset evaluations. The horizontal axis represents the evaluation subjects, while the vertical axis represents the number of samples.

In order to enhance the impact of regular samples in the training phase, this paper proposes a motion and ill-exposure guided sample rank and adjustment strategy (MISRA) in collaboration with LMDA. The whole process is shown in Fig. 5. After fetching a batch from the training dataset with an online augmentation manner, all samples containing original HDR frames before the data preparation in Sect. 3.1 are ranked by a fusion contribution score calculation module. The original LMDA is applied to the high-ranking samples, while the adjusted version is used for the low-ranking samples. The detailed process of proposed fusion contribution score calculation module could be described as follows:

\[\begin{align} & S^s_i = \left(\sum_{\begin{subarray}{1}0<j\leq l_i\end{subarray}} \left(G(L_j) > \psi\right)\right) / l_i \tag{7} \\ & S^m_i = \frac{\sum_{\begin{subarray}{1}0<j<l_i\end{subarray}} 1-p\left(\tilde{f}\left( f(L_j)\right),\tilde{f}\left( f(L_{j+1})\right)\right)}{l_i-1} \tag{8} \\ & S_i = S^s_i+ \beta \cdot S^m_i \tag{9} \end{align}\]

Where \(G\) denotes the grayscale transformation and \(\psi\) denotes the threshold which judges the saturation. \(l_i\) denotes the length of sample which contains reference frame \(L_i\). \(f\) denotes the discrete Fourier Transform, while \(\tilde{f}\) denotes the function which shifts the zero-frequency component to the center of the frequency domain representation. \(p\) denotes the Pearson correlation coefficient after the vector flatten operation of inputs. \(\beta\) denotes the weight during calculating the sum. \(S^s_i\) denotes the average saturation area and \(S^m_i\) denotes the average motion degree. Thus this paper uses the average saturation area to represent the ill-exposure score, the average motion degree to represent motion score, and utilizes the weighted sum of two scores as the final score.

Fig. 5 Process of proposed motion and ill-exposure guided sample rank and adjustment strategy. For better viewing, parameters of LMDA are adjusted. Please refer to Sect. 4.2 for more details about parameters.

This process is also shown in Fig. 6. The saturation area in each frame is judged by the threshold. Considering that the situation which contains the intense motion of objects and subtle camera motion is also a challenging scenario, all HDR frames are transformed into the frequency domain to calculate the motion score. Due to the need to identify samples with less motion, \(\beta\) is set as a negative value.

Fig. 6 Process of the fusion contribution score calculation.

As illustrated in Fig. 7, the original LMDA randomly selects a fixed window for local exposure augmentation (LEA), whereas the adjusted version employs a sliding window with the same size (and also same position in the first frame). Such sliding window in consecutive frames of a sample slides in a random direction with same distance. This operation explicitly introduces the moving ill-exposed regions, thereby enhancing the impact of the augmented samples. Notice that the local RGB permutation (LRP) is not involved.

Fig. 7 Comparison of original local area-based mixed data augmentation (LMDA) and adjusted version. Only the local exposure augmentation (LEA) process changes.

3.4 HDR-Targeted Test-Time Augmentation (HTTA)

Despite the additional computational costs and latency caused by test-time augmentation, recent advancements in GPU-based parallel computing and hardware make it an increasingly valuable auxiliary technique for low-level vision tasks. This paper proposes an HDR-targeted test-time augmentation method (HTTA). As shown in the bottom part of Fig. 1, the original inference inputs denotes the original or unmodified inputs from testing samples in the evaluation dataset. The augmented inference input set is generated on the basis of original inference inputs through a predefined augmentation set. Given that larger memory and time costs of pixel-level restoration are required compared to high-level vision tasks, this work only utilizes 4 geometric transformations: rotation (\(90^\circ\), \(270^\circ\)) and flipping (horizontal, vertical). Each transformation is applied individually.

To perform the ensemble, the augmented inference output set is recovered to match the shape of the original inference output by applying inverse predefined augmentations. All outputs are combined with an ill-exposed outlier removal-based average ensemble method shown in Fig. 8. Similar to Sect. 3.3, all output HDR frames are applied with over-exposure and processed into LDR frames. The potential over-exposed areas are identified through thresholds. For such areas in the set, most different one judged by the pixel value are removed, then the remaining areas in the set are combined with the average ensemble. Well-exposed parts in the set are executed with the average ensemble. All results in different positions are mixed to get the final inference output. To mitigate information loss caused by saturation, the proposed method focuses on distinguishing over-exposed areas and ensuring stable and effective reconstruction. Similar to classification related test-time augmentations [40], [41], test-time augmentation also helps to improve the accuracy of “pixel classification” in the restoration tasks.

Fig. 8 Illustrative example of the process of the proposed ill-exposed outlier removal-based average ensemble.

Page top

4. Experiments

Experiments are performed to clarify the effectiveness of proposals. For convenience, abbreviations (e.g., LMDA) are used in the following sections. In figures and charts, \(\mathcal{L}\) denotes LMDA, \(\mathcal{M}\) denotes MISRA, and \(\mathcal{H}\) denotes HTTA.

4.1 Baselines, Datasets and Metrics

The SOTA HDR video reconstruction algorithm, Chen21 [7], is adopted as the baseline. Since the official codes of Chen21 [7] apply RGB permutation during dataset preparation instead of the batch training, this work specifies the baseline as the version without RGB permutation, and RGB Perm.* as the original implementation. Kalantari19 [6] is also used to test the generalization ability of proposals.

This paper utilizes the same synthetic training dataset [32], [43] as Chen21 [7] and keeps the same data processing approach, but Vimeo-90K dataset [44] containing only LDR images is discarded for a fair comparison. Synthetic testing sequences in Chen21 [7] and Kalantari19 [6] are used for the evaluation. The real-world dataset in [7] is also adopted for the generalizability test. Notably, the training and testing datasets [7], [32], [43] exhibit variations in camera parameters such as image size, frame rate, and CRF setting, aiming to significantly enhance the robustness of the trained model and provide thorough validation. This paper also maintains consistent data-related settings with [6] and [7]. The inputs in the training and testing datasets keep the same alternating exposure pattern (2 alternating exposures or 3 alternating exposures). In summary, the proposed full-stage augmentation method in this paper does not affect or depend on the original camera parameters of input video clips in the training or testing phases.

PSNR-T and HDR-VQM [45] are adopted as comparison metrics. The score of PSNR-T is computed in the tone-mapping domain by \(\mu\)-law with \(\mu=5000\). Computing settings of HDR-VQM are kept the same with [7].

4.2 Implementation Details

This paper keeps the same training and test configurations as Chen21 [7]. Considering that Vimeo-90K dataset is not adopted, three training stages I, II and III of Chen21 [7] take more epochs (25, 20 and 3). The setup of conventional augmentation works is the same as [36]. The patch number of local exposure augmentation is set to 10 and the width is set to 0.8% of training inputs, while the patch number of local RGB permutation is set to 1 and the width is set to 70% equal to patch-related methods in [36]. All methods except RGB Perm.* are applied during the batch training. The number of alternating exposures is set to 2. The Low-rank ratio in MISRA is set to 0.125. The threshold \(\psi\) for determining saturation areas is set to 0.8. \(\beta\) is set to \(-1\).

4.3 Evaluation Results and Analysis

Experiment results of different image manipulation-based data augmentation methods are shown in Table 1 and are divided into three groups for comparison: baseline and non-RGB permutation methods, RGB permutation-related methods, and the combinations. Visualizations are shown in Fig. 9. CutOut [13], CutMix [14], and MixUp [15] obviously degrade the quality of the reconstruction. In contrast, the proposed local exposure augmentation (LEA) improves most metrics on the basis of Chen21 [7], which demonstrates the significance of preserving the structure of original samples. In the second group, the proposed local RGB permutation (LRP) also outperforms all RGB-related methods. The blending severely disrupts underexposed areas, resulting in a minus HDR-VQM score and poor visualization. By combining LEA and LRP, LMDA achieves better performance. The integration of MISRA and HTTA leads to further improvements on final results. Notice that results about conventional works and LMDA are consistent with previous paper [16].

Table 1 Quantitative comparison of different image manipulation-based augmentation methods and proposed methods.

Fig. 9 Qualitative comparison of different augmentations and the proposed methods.

To demonstrate that the proposed methods achieve proper generalizability capability, the related experiment is conducted and shown in Table 2. The model of Kalantari19 [6] contains 2.99M parameters, while the model of Chen21 [7] contains 6.44M parameters. Evaluation results indicate that the proposed data augmentation method improves the performance of different algorithms on both the synthetic and real-world datasets.

Table 2 Evaluation results of the generalizability experiment.

The impact of the training dataset size on the performance is investigated and shown in Table 3. LMDA is effective with almost any data size. When the training data volume is small, the overly challenging samples of MISRA may hinder the model’s fitting. With the increasing training data, MISRA gradually brings better improvements. Thus different data augmentation strategies can be chosen based on specific situations in practical training.

Table 3 Evaluation results with different training dataset sizes. Only training-phase augmentations are included.

Experiments about different ensemble methods in test-time augmentation are also conducted. As shown in Table 4, compared with the conventional average ensemble, ll-exposed outlier removal- based average ensemble achieves better performance on Chen21 [7]’s sequences and similar results on the real-world dataset [7], which indicates the effectiveness of HTTA for HDR tasks.

Table 4 Quantitative comparison of different ensemble methods.

Page top

5. Discussion

5.1 Modularity and Generalization

In this paper, three proposals are presented, each of which is relatively independent and can be used separately in different algorithms according to specific requirements.

Similar to [36], [37], the proposed LMDA ensures the integrity of individual training samples, which is particularly essential for pixel-level restoration. Therefore, such local strategy can be applied to other domains such as image de-raining and image dehazing. The augmentation strategy based on ranking and adjusting existing data augmentation methods can also be applied to different restoration tasks. For example, in image depth estimation, samples with richer depth levels can be assigned with higher ranks and applied with different augmentations. Similarly, the proposed HTTA can be adjusted according to the corresponding situations for application in other domains without affecting the training.

Hence, the proposed sub-methods can be separately or jointly transferred to other fields with reasonable domain knowledge.

5.2 Parameter Setting and Combination Approach

To identify the ideal size and number of patches in the local exposure augmentation (LEA), related experiments are performed and shown in Table 5. \(s\) indicates the ratio of the width of each patch to the width of the entire frame. \(n\) indicates the number. Large LEA patches force the model to erroneously learn the mapping paradigm of transforming the mixed-exposed LDR to the HDR, instead of the desired mapping of the alternating-exposed LDR to the HDR. Contrarily, multiple small LEA patches can be regarded as additional exposure noise, which improves the robustness of the training model without sacrificing the information. In the future, AutoAugment [39] related methods can be used to determine the appropriate parameters efficiently.

Table 5 Quantitative comparison of different parameter settings in the local exposure augmentation method (LEA). Notice that two special conditions are not included: \(s=100\)% denotes that no augmentations are applied and \(s=0\)% denotes that there is no alternating exposed frames in the dataset.

The optimal combination approach for LEA and LRP within LMDA is also investigated. Various combinations are tested: a) each operation is applied separately for each random batch; b) LRP is applied and fused for each patch which have already been augmented with LEA with the same size; c) the original frame is divided into random partitions and different augmentation methods are applied to each partition following Yoco [46]; d) LEA is applied after LRP for each batch sequentially without the overlap. Related evaluation results are shown in Table 6. The sequential combination without the overlap achieves the best evaluations, which demonstrates that the suitable combination is desired to maintain the independence of individual augmentation techniques by employing a relatively random scheme.

Table 6 Quantitative comparison of different combination approaches in the local area-based data augmentation (LMDA). Sequential* indicates the sequential combination without the overlap. HTTA is not involved.

5.3 Fusion Contribution Score Calculation and Application Targets

Within the proposed MISRA, the fusion contribution score for ranking is decided by both motion and ill-exposure, especially the over-exposure. This paper also investigates the feasibility of utilizing a similar approach to score based on under-exposed regions and explores the optimal combination. As shown in Table 7, the introduction of over-exposure scores causes an obvious degradation in the reconstruction result. When the training samples contain an abundant amount of under-exposed regions, the introduction of under-exposure scores can be regarded as noise to the ranking process. The current strategy of scoring based on over-exposed regions and motion degrees has proven to be a better choice.

Table 7 Quantitative comparison of different fusion contribution score calculation within the motion and ill-exposure sample rank and adjustment strategy (MISRA). M refers to the adoption of motion scores during the ranking process, while OA and UA denote the adoption of over-exposure scores and under-exposure scores. HTTA is not involved.

In order to validate the effectiveness of MISRA, related experiments are performed by applying the adjustment LMDA augmentation to different samples within one batch. As shown in Table 8, applying the adjustment LMDA for all samples or random samples without the rank process leads to an obvious decline in the evaluation results. The possible reason is that the high-ranked samples generated from adjusted LMDA include excessively challenging complex conditions, hampering the learning process of the accurate reconstruction for existing deep-learning models. Thus, MISRA is proven to be a suitable strategy for current HDR video reconstruction algorithms and datasets.

Table 8 Quantitative comparison of different application targets of the adjusted LMDA augmentation operation within the motion and ill-exposure sample rank and adjustment strategy (MISRA). HTTA is not involved.

Page top

6. Conclusion

A full stage data augmentation method called HDR-VDA is proposed for HDR video reconstruction. During training, a local area-based mixed data augmentation method is proposed to provide diverse exposure and color patterns, which promotes the model to achieve better capability facing ill-exposed regions and complex color conditions. Combined with the motion and ill-exposure guided sample rank and adjustment strategy, additional information is supplemented into the training samples. To refine the result, an HDR-targeted test-time augmentation method is proposed to achieve a robust reconstruction. Overall method achieves a PSNR-T score of 38.91 dB and surpasses conventional works in terms of metrics and visual performance. Consequently, it offers the outlook for optimizing data augmentation in the HDR domain and other restoration tasks.

Page top

Acknowledgements

This work was supported by KAKENHI (21K11816).

Page top

References

[1] S.K. Nayar and T. Mitsunaga, “High dynamic range imaging: Spatially varying pixel exposures,” Proc. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000, vol.1, pp.472-479, 2000.
CrossRef

[2] A. Serrano, F. Heide, D. Gutierrez, G. Wetzstein, and B. Masia, “Convolutional sparse coding for high dynamic range imaging,” Computer Graphics Forum, vol.35, no.2, pp.153-163, 2016.
CrossRef

[3] I. Choi, S.-H. Baek, and M.H. Kim, “Reconstructing interlaced high-dynamic-range video using joint learning,” IEEE Trans. Image Process., vol.26, no.11, pp.5353-5366, 2017.
CrossRef

[4] M. McGuire, W. Matusik, H. Pfister, B. Chen, J.F. Hughes, and S.K. Nayar, “Optical splitting trees for high-precision monocular imaging,” IEEE Comput. Graph. Appl., vol.27, no.2, pp.32-42, 2007.
CrossRef

[5] Y. Kamataki, Y. Kameda, Y. Kita, I. Matsuda, and S. Itoh, “Lossless coding of HDR color images in a floating point format using block-adaptive inter-color prediction,” IEICE Trans. Inf. & Syst., vol.E104-D, no.10, pp.1572-1575, Oct. 2021.
CrossRef

[6] N.K. Kalantari and R. Ramamoorthi, “Deep HDR video from sequences with alternating exposures,” Computer Graphics Forum, vol.38, no.2, pp.193-205, 2019.
CrossRef

[7] G. Chen, C. Chen, S. Guo, Z. Liang, K.-Y.K. Wong, and L. Zhang, “HDR video reconstruction: A coarse-to-fine network and a real-world benchmark dataset,” Proc. IEEE/CVF International Conference on Computer Vision, pp.2482-2491, 2021.
CrossRef

[8] N.K. Kalantari and R. Ramamoorthi, “Deep high dynamic range imaging of dynamic scenes,” ACM Trans. Graph., vol.36, no.4, 144, 2017.
CrossRef

[9] K.R. Prabhakar, R. Arora, A. Swaminathan, K.P. Singh, and R.V. Babu, “A fast, scalable, and reliable deghosting method for extreme exposure fusion,” 2019 IEEE International Conference on Computational Photography (ICCP), pp.1-8, 2019.
CrossRef

[10] E. Pérez-Pellitero, S. Catley-Chandar, R. Shaw, A. Leonardis, R. Timofte, Z. Zhang, C. Liu, Y. Peng, Y. Lin, G. Yu, et al., “Ntire 2022 challenge on high dynamic range imaging: Methods and results,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.1008-1022, 2022.
CrossRef

[11] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments,” IEEE Trans. Pattern Anal. Mach. Intell., vol.36, no.7, pp.1325-1339, 2013.
CrossRef

[12] S. Vicente, J. Carreira, L. Agapito, and J. Batista, “Reconstructing PASCAL VOC,” Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp.41-48, 2014.
CrossRef

[13] T. DeVries and G.W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708. 04552, 2017.
CrossRef

[14] S. Yun, D. Han, S. Chun, S.J. Oh, Y. Yoo, and J. Choe, “CutMix: Regularization strategy to train strong classifiers with localizable features,” Proc. IEEE/CVF International Conference on Computer Vision, pp.6022-6031, 2019.
CrossRef

[15] H. Guo, Y. Mao, and R. Zhang, “MixUp as locally linear out-of-manifold regularization,” Proc. AAAI Conference on Artificial Intelligence, vol.33, no.1, pp.3714-3722, 2019.
CrossRef

[16] F. Zhao, Q. Liu, and T. Ikenaga, “HDR-LMDA: A local area-based mixed data augmentation method for HDR video reconstruction,” 2023 IEEE International Conference on Image Processing (ICIP), pp.2020-2024, 2023.
CrossRef

[17] S.B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High dynamic range video,” ACM Trans. Graph., vol.22, no.3, pp.319-325, 2003.
CrossRef

[18] L. Bogoni, “Extending dynamic range of monochrome and color images through fusion,” Proc. 15th International Conference on Pattern Recognition, ICPR-2000, vol.3, pp.7-12, 2000.
CrossRef

[19] K. Jacobs, C. Loscos, and G. Ward, “Automatic high-dynamic range image generation for dynamic scenes,” IEEE Comput. Graph. Appl., vol.28, no.2, pp.84-93, 2008.
CrossRef

[20] F. Pece and J. Kautz, “Bitmap movement detection: HDR for dynamic scenes,” 2010 Conference on Visual Media Production, pp.1-8, 2010.
CrossRef

[21] W. Zhang and W.-K. Cham, “Reference-guided exposure fusion in dynamic scenes,” Journal of Visual Communication and Image Representation, vol.23, no.3, pp.467-475, 2012.
CrossRef

[22] T.-H. Oh, J.-Y. Lee, Y.-W. Tai, and I.S. Kweon, “Robust high dynamic range imaging by rank minimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol.37, no.6, pp.1219-1232, 2015.
CrossRef

[23] B.-J. Yun, H.-D. Hong, and H.-H. Choi, “A contrast enhancement method for HDR image using a modified image formation model,” IEICE Trans. Inf. & Syst., vol.E95-D, no.4, pp.1112-1119, April 2012.
CrossRef

[24] M. Shimamoto, Y. Kameda, and T. Hamamoto, “HDR imaging based on image interpolation and motion blur suppression in multiple-exposure-time image sensor,” IEICE Trans. Inf. & Syst., vol.E103-D, no.10, pp.2067-2071, Oct. 2020.
CrossRef

[25] S.-W. Jung, H.-J. Kwon, D.-M. Son, and S.-H. Lee, “Generative adversarial network using weighted loss map and regional fusion training for LDR-to-HDR image conversion,” IEICE Trans. Inf. & Syst., vol.E103-D, no.11, pp.2398-2402, Nov. 2020.
CrossRef

[26] J. Wang, X. Li, and H. Liu, “Exposure fusion using a relative generative adversarial network,” IEICE Trans. Inf. & Syst., vol.E104-D, no.7, pp.1017-1027, July 2021.
CrossRef

[27] J. Wang, W. Wang, G. Xu, and H. Liu, “End-to-end exposure fusion using convolutional neural network,” IEICE Trans. Inf. & Syst., vol.E101-D, no.2, pp.560-563, Feb. 2018.
CrossRef

[28] Q. Yan, D. Gong, Q. Shi, A. van den Hengel, C. Shen, I. Reid, and Y. Zhang, “Attention-guided network for ghost-free high dynamic range imaging,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.1751-1760, 2019.
CrossRef

[29] Q. Yan, L. Zhang, Y. Liu, Y. Zhu, J. Sun, Q. Shi, and Y. Zhang, “Deep HDR imaging via a non-local network,” IEEE Trans. Image Process., vol.29, pp.4308-4322, 2020.
CrossRef

[30] Y. Niu, J. Wu, W. Liu, W. Guo, and R.W.H. Lau, “HDR-GAN: HDR image reconstruction from multi-exposed LDR images with large motions,” IEEE Trans. Image Process., vol.30, pp.3885-3896, 2021.
CrossRef

[31] S. Hajisharif, J. Kronander, and J. Unger, “Adaptive dualISO HDR reconstruction,” EURASIP Journal on Image and Video Processing, vol.2015, 41, 2015.
CrossRef

[32] J. Kronander, S. Gustavson, G. Bonnet, A. Ynnerman, and J. Unger, “A unified framework for multi-sensor HDR video reconstruction,” Signal Processing: Image Communication, vol.29, no.2, pp.203-215, 2014.
CrossRef

[33] S. Mangiat and J. Gibson, “High dynamic range video with ghost removal,” Applications of Digital Image Processing XXXIII, 779812, 2010.
CrossRef

[34] N.K. Kalantari, E. Shechtman, C. Barnes, S. Darabi, D.B. Goldman, and P. Sen, “Patch-based high dynamic range video,” ACM Trans. Graph., vol.32, no.6, 202, 2013.
CrossRef

[35] C. Shorten and T.M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, vol.6, 60, 2019.
CrossRef

[36] J. Yoo, N. Ahn, and K.-A. Sohn, “Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.8372-8381, 2020.
CrossRef

[37] Y. Ishii and T. Yamashita, “CutDepth: Edge-aware data augmentation in depth estimation,” arXiv preprint arXiv:2107.07684, 2021.
CrossRef

[38] N.-T. Tran, V.-H. Tran, N.-B. Nguyen, T.-K. Nguyen, and N.-M. Cheung, “On data augmentation for GAN training,” IEEE Trans. Image Process., vol.30, pp.1882-1897, 2021.
CrossRef

[39] E.D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q.V. Le, “Autoaugment: Learning augmentation strategies from data,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.113-123, 2019.
CrossRef

[40] I. Kim, Y. Kim, and S. Kim, “Learning loss for test-time augmentation,” Advances in Neural Information Processing Systems, vol.33, pp.4163-4174, 2020.
URL

[41] D. Shanmugam, D. Blalock, G. Balakrishnan, and J. Guttag, “Better aggregation in test-time augmentation,” Proc. IEEE/CVF International Conference on Computer Vision, pp.1194-1203, 2021.
CrossRef

[42] A. Torralba and A.A. Efros, “Unbiased look at dataset bias,” CVPR 2011, pp.1521-1528, 2011.
CrossRef

[43] J. Froehlich, S. Grandinetti, B. Eberhardt, S. Walter, A. Schilling, and H. Brendel, “Creating cinematic wide gamut HDR-video for the evaluation of tone mapping operators and HDR-displays,” Proc. SPIE 9023, Digital Photography X, 90230X, 2014.
CrossRef

[44] T. Xue, B. Chen, J. Wu, D. Wei, and W.T. Freeman, “Video enhancement with task-oriented flow,” International Journal of Computer Vision, vol.127, no.8, pp.1106-1125, 2019.
CrossRef

[45] M. Narwaria, M.P. Da Silva, and P. Le Callet, “HDR-VQM: An objective quality measure for high dynamic range video,” Signal Processing: Image Communication, vol.35, pp.46-60, 2015.
CrossRef

[46] J. Han, P. Fang, W. Li, J. Hong, M.A. Armin, I. Reid, L. Petersson, and H. Li, “You only cut once: Boosting data augmentation with a single cut,” International Conference on Machine Learning, pp.8196-8212, 2022.
URL

Page top

Authors

Fengshan ZHAO
Waseda University

received his B.E. degree in software engineering from Northeastern University, Shenyang, China, in 2019 and M.E. degree in software engineering from Nanjing University, Nanjing, China in 2021. Now he is a Ph.D. candidate in the Graduate School of Information, Production and Systems, Waseda University, Kitakyushu, Japan. His research interests include video processing and deep learning.

Qin LIU
Nanjing University

was born in Nanjing, China. He received the M.S. degree in software engineering from Nanjing University (NJU), China, in 2006, and the Ph.D. degree from Waseda University, Japan, in 2009. He is presently an associate professor in software institute of Nanjing University. His research interest includes High Dynamic Range Image and Video Processing. He is a member member of the Institute of Electrical and Electronics Engineers (IEEE).

Takeshi IKENAGA
Waseda University

received his B.E. and M.E. degrees in electrical engineering and Ph.D. degree in information & computer science from Waseda University, Tokyo, Japan, in 1988, 1990, and 2002, respectively. He joined LSI Laboratories, Nippon Telegraph and Telephone Corporation (NTT) in 1990, where he had been undertaking research on the design and test methodologies for high performance ASICs, a real-time MPEG2 encoder chip set, and a highly parallel LSI & system design for image understanding processing. He is presently a professor in the system integration field of the Graduate School of Information, Production and Systems, Waseda University. His current interests are image and video processing systems, which covers video compression (e.g. VVC, SCC), video filter (e.g. super resolution, high-dynamic range imaging), and video recognition (e.g. sport analysis, ultra-high speed and ultra-low delay vision system). He is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), a member of the Institute of Electronics, Information and Communication Engineers of Japan (IEICE) and the Information Processing Society of Japan (IPSJ).