Noisy Face Super-Resolution Method Based on Three-Level Information Representation Constraints

Qi QI; Zi TENG; Hongmei HUO; Ming XU; Bing BAI

doi:10.1587/transfun.2024EAL2027

1. Introduction

Face SR is an important research branch in the field of computer vision, mainly aimed at reconstructing high-resolution (HR) face image from one or more LR face images captured in the same scene. Early research has focused on interpolation methods [1], neighborhood embedding [2], and sparse representation [3]. Although these algorithms can achieve certain results, problems such as ringing, blurring, and artifact often occur in reconstructed images, severely limiting the quality of reconstructed images.

As deep learning methods gradually become the mainstream research direction in the field of computer vision, the successful application of deep neural networks in feature extraction and nonlinear mapping provides new solutions for solving the problem of face SR. For example, Cao et al. proposed a face SR algorithm based on attention perception mechanism [4], which uses deep reinforcement learning to find face patches participating in face SR. Ma et al. proposed an iterative collaborative convolutional neural network that utilizes two recurrent neural networks for face image reconstruction and key point prediction [5]. Wang et al. proposed applying generative facial priors to face SR problems and proposed a channel splitting spatial feature transformation model to fuse reconstruction features with prior features [6]. However, due to the fact that generative prior based methods can only recover face images with a fixed style (FFHQ [7] dataset style), in order to improve the effect of human visual perception, detail features unrelated to the original face are generated, and the generated results cannot effectively preserve identity information. In addition, existing face SR methods are conducted under the assumption that the input image is noise free. In practical application scenarios, input images contaminated by noise can lead to a sharp decline in model performance, and reconstructed face images may have obvious identity information confusion, which cannot meet practical application needs.

To solve the problem of existing methods being unable to effectively remove the interference of degradation factors such as strong noise and blur on the face reconstruction process, we propose a noisy face SR method based on three-level information representation constraints. (1) We design a feature distillation network to extract effective face information, which exploits statistical anti-interference model and latent contrast algorithm to remove invalid information. (2) We design a face reconstruction network, which utilizes the extracted face features to reconstruct HR face images. (3) We deploy a face identity embedding model and discrete wavelet transform model to further supervise the reconstruction of identity information and spatial structure from the hypersphere identity metric space and wavelet domain respectively.

Page top

2. Proposed Method

2.1 Network Architecture

As shown in Fig. 1, the proposed network mainly consists of four parts, namely feature distillation network, face reconstruction network, identity information embedding module, and discrete wavelet transform module, which is a three-level framework of feature level reconstruction, semantic level reconstruction, and pixel level reconstruction. (1) Firstly, the feature distillation network removes invalid information such as noise from face images by designing statistical anti-interference models (SAIB) and latent space feature comparison algorithms, thereby achieving high robustness in latent space feature level reconstruction. (2) Then, the identity information embedding module applies identity recognition constraints to reduce the identity difference between the reconstructed face image and the HR face image, achieving semantic level reconstruction of the face image. (3) Finally, the discrete wavelet transform module captures the spatial and frequency information, decouples the reconstructed face image and HR face into different frequency sub bands, and calculates the corresponding high-frequency band loss to achieve pixel level reconstruction of face high-frequency details. The feature distillation network and the face reconstruction network are composed of multiple residual attention blocks (RCAB) [8] and Transformer blocks [9]. The corresponding layers of the two sub networks are connected across layers to maximize the information flow between the convolution layers.

Fig. 1 Illustration of the proposed face SR network.

2.2 Latent Space Feature Distillation

Existing methods mainly constrain the reconstruction process of face information from the pixel level. However, in noisy LR face images, accurate reconstruction of pixel level information is very difficult, and simply relying on pixel level information constraints can easily lead to semantic distortion. To address this issue, our method directly calculates the loss in the latent space and constrains the latent features, effectively improving the reconstruction accuracy of latent features. At the same time, utilizing latent feature loss for backpropagation in the middle part of the network can better guide the training process of the encoder network and improve the feature extraction ability. The reconstructed latent features can also be better expressed in the decoder network, efficiently guiding the decoder to recover face features.

In order to avoid the interference of noise and fuzzy information while extracting core features in the distillation network, a statistical anti-interference block (SAIB) is used to add random Gaussian noise \(\delta\) with a mean of 0 and a variance of 1 to the extracted features. Two convolution layers of \(g(\cdot )\) are used to process the encoded features, which adds disturbance to improve the anti-interference ability, making the network focus on more critical information and obtain latent features with stronger representation ability. The feature distillation loss \(\mathcal{L}_{\textit{latent}}\) based on SAIB is defined as follow:

\[\begin{align} & F_t^{\textit{latent}}=\textit{SAIB} \left(\textit{Dis}\left(\theta_1,I_t\right)\right)= g\left(\textit{Dis}\left(\theta_1,I_t\right)+\delta\right), \tag{1} \\ & \mathcal{L}_{\textit{latent}}^i=\left\|F_{HR}^{\textit{latent}}-F_{SR}^{\textit{latent}} \right\|_1 \tag{2} \end{align}\]

where \(\theta_1\) is the parameter of the distillation network \(\textit{Dis}(\cdot)\). \(F_t^{\textit{latent}}\) denotes the latent feature, which is obtained by the distillation feature \(\textit{Dis}(\theta_1,I_t)\) and statistical anti-interference module \(\textit{SAIB}(\cdot)\). \(t\) represents that the input image \(I\) belongs to either HR or noisy LR faces.

2.3 Face Identity Embedding

As a specific application problem in the field of image SR, face SR can be guided by facial prior information in the reconstruction process, such as face component heatmap [10], face shape with key points [11], face wavelet coefficients [12], and face local structural prior information [13]. However, in high noise environments, the semantic prior information in noisy LR face image is difficult to be accurately estimated, so the reconstruction performance of these methods will also show a significant decline. Unlike these methods that directly use face semantic prior information, our method uses face identity prior information to guide the face image reconstruction with realistic texture details and unchanged identity information.

In order to accurately recover identity related face details, we propose an identity recognition constraint model to reduce the identity difference between reconstructed face images and HR face images. Firstly, due to the excellent performance of hypersphere space in face identity representation, we use it as an identity metric space and utilize the pretrained identity information embedding model LightCNNv9 [14] to extract identity related features. Then, Euclidean regularization is used to map identity features to a hypersphere space for identity loss calculating.

As shown in Fig. 2, there are significant differences in angle and amplitude between noisy LR face features and denoised HR face features in the hypersphere identity metric space. And with the decrease in resolution, the average angle and amplitude differences between low- and high-resolution face feature pairs will gradually increase. This indicates that the two indicators of angle and amplitude can reflect the degree of representation features degradation, and use them to predict the quality of face reconstruction images. Therefore, we propose a feature deconstruction based identity recognition loss, which calculates the loss of reconstructed face identity vectors and HR face identity vectors in terms of angle and amplitude in the hypersphere metric space \((\mathcal{L}_a, \mathcal{L}_m)\), and adds them up to obtain identity recognition loss \(\mathcal{L}_{id}\):

\[\begin{align} & \mathcal{L}_a=1-\frac{\left(F_{SR}^{id}\right)^T\left(F_{HR}^{id}\right)}{ \left\|F_{SR}^{id}\right\|_2\left\|F_{HR}^{id}\right\|_2}, \tag{3} \\ & \mathcal{L}_m=\left\|\,\left\|F_{SR}^{id}\right\|_2-\left\|F_{HR}^{id}\right\|_2\right\|_2, \tag{4} \\ & \mathcal{L}_{id}=\mathcal{L}_a+\mathcal{L}_m \tag{5} \end{align}\]

where \(F_{SR}^{id}\) and \(F_{HR}^{id}\) represent the identity feature vectors produced by the identity information embedding model LightCNNv9. By precisely constraining the reconstructed face identity vector in terms of angle and amplitude in the hypersphere metric space, our model can reconstruct HR face images with realistic texture details and unchanged identity information.

Fig. 2 Hypersphere metric space.

2.4 Wavelet Knowledge Distillation

Existing methods can accurately reconstruct low-frequency details of face images, but they often cannot effectively reconstruct high-frequency details, and this phenomenon is more pronounced in high noise environments. To solve this problem, we propose to use the discrete wavelet transform method to decouple the reconstructed face image and HR reconstructed face into sub bands of different frequencies. Then we calculate the L1 loss corresponding to the high-frequency band to constrain the reconstruction process of high-frequency face details.

The discrete wavelet transform method is commonly used as a mathematical tool for decoupling pyramid images. In our method, reconstructed face images and HR face images are decoupled into four sub bands, namely LL, LH, HL, HH, where LL represents the low-frequency sub band and the rest are the high-frequency sub bands. As shown in Fig. 3, if the discrete wavelet transform is defined as \(\Psi(\cdot)\), the high and low sub band images of image \(I\) can be represented as \(\Psi^H(\cdot)\) and \(\Psi^L(\cdot)\), respectively. Wavelet knowledge loss \(\mathcal{L}_{\textit{wav}}\) is defined as follows:

\[\begin{equation*} \hskip-1mm \mathcal{L}_{\textit{wav}}=\left\|\Psi^H\left(I_{\textit{HR}}\right)-\Psi^H \left(I_{\textit{SR}}\right)\right\|_1, \Psi^H(\cdot)=I^{\textit{LH}},I^{\textit{HL}}, I^{\textit{HH}} \tag{6} \end{equation*}\]

where \(I_{SR}\) and \(I_{HR}\) represent SR and HR face images, respectively. Compared to other frequency analysis methods such as Fourier transform, wavelet transform can more effectively capture spatial and frequency information in image signals [15]. The effectiveness of the wavelet knowledge distillation in recovering high-frequency details lies in its ability to decompose images into different frequency bands, capturing details at various scales. This decomposition enables the concentration of face SR model in high-frequency bands, where fine textures and details are effectively enhanced. Additionally, the localized processing nature of wavelet transform allows for precise analysis of local features, aiding in the restoration of facial details. Moreover, the reversibility of wavelet transform ensures that adjustments made in the wavelet domain can be accurately applied to reconstruct the original image, enhancing the face super-resolution performace, which enables our model to reconstruct face images with high-quality high-frequency detail textures.

Fig. 3 Discrete wavelet transform.

Page top

3. Experiments

3.1 Experimental Setup

Based on experimental experience, training with entire CelebA dataset [16] is unable to significantly improve the performance of face super-resolution network, but it will significantly increase the network training time. Therefore, we selected 40000 face images from the CelebA for training and use the next 1000 face images as the testset. To demonstrate that our method can reconstruct clear face images under noise and fuzzy interference, three degradation models were used in the experiment to synthesis LR face images. We use the bicubic operation to produce LR images with a scale factor of 8 (Bic). Then, Gaussian noise with a noise level of 15 is added to the 8-fold downsampled image to obtain LR face images with noise (BicN). Finally, in order to produce degraded face images that are simultaneously affected by noise and blurring factors, we first applied Gaussian kernel blur with a standard deviation of 1.5 and size of \(7 \times 7\) to HR faces, then perform bicubic downsampling with a scale factor of 8 on these images, and then add Gaussian noise with a noise level of 30 (BBicN).

In terms of training setting, Adam algorithm is used as the loss function optimizer. The batch size is 16. The model was trained on a TITAN X GPU. To evaluate the quality of face SR results, we employ the Learned Perceptual Image Patch Similarity (LPIPS) score [17] and Fr\(\acute{\mathrm{e}}\)chet Inception Distances (FID) [18] to assess the perceptual realism of generated faces, as pixel space metrics only measure local distortions and may not align with human perception.

3.2 Comparative Experiments

In order to evaluate the performance of our method, the most advanced face SR methods PaniniNet [19], DIC [5], GPEN [20], GFPGAN [6] were selected for comparative experiments on CelebA testsets under different degradation processes. Table 1 presents the quantitative evaluation results of our method and these state of the arts (SOTA) methods on the CelebA dataset. The quantitative comparison results show that the reconstruction results of our method are significantly lower than the current technical level in terms of LPIPS and FID performance. In order to further evaluate the visual effect of our method, a qualitative comparison was conducted under different degradation processes.

Table 1 Quantitative evaluation results of our method and SOTA methods.

As shown in Fig. 4, in the absence of noise and blur (Bic), existing methods can reconstruct ideal facial details while preserving identity information to a certain extent. However, under the influence of noise, the performance of existing methods has shown varying degrees of decline, and our method can still reconstruct clearer face images. As shown in Fig. 5, the face image reconstructed by our method has clearer face texture details and can better preserve identity information. Although GFPGAN combines generative facial priors with reconstruction features to reconstruct face images with high-frequency detail textures, it can only recover face images with fixed styles but fail in retaining identity information.

Fig. 4 Qualitative comparison results of our method and existing methods on the all the degradation models.

Fig. 5 Qualitative comparison results of our method and existing methods on the BicN degradation model.

As shown in Fig. 6, after introducing the fuzzy degradation factor, the contaminated LR input will cause a sharp decline in the performance of each model, and the reconstructed face image will have obvious identity information confusion, which cannot meet practical application needs. Benefit by the proposed three-level information representation constraints method, it can be clearly seen that although the degradation process is gradually becoming more complex, the visual quality of the face images reconstructed by our model has not significantly decreased, so it has good practicality.

Fig. 6 Qualitative comparison results of our method and existing methods on the BBicN degradation model.

Page top

4. Conclusion

In this paper, we designed a feature distillation network to extract effective facial information, which exploited statistical anti-interference model and latent contrast algorithm to removed invalid information such as noise. And we designed a face reconstruction network, which utilized the extracted face features to reconstruct HR face images. Finally, we deployed a face identity embedding model and discrete wavelet transform model to further supervise the reconstruction of identity information and spatial structure from the hypersphere identity metric space and wavelet domain respectively. The experimental results showed that the proposed method not only removed the noise from face in the high noise environment, but also improved the resolution of the face image effectively, which obtains better LPIPS and FID performance, and good practicability.

Page top

References

[1] F.N. Fritsch and R.E. Carlson, “Monotone piecewise cubic interpolation,” SIAM J. Numer. Anal., vol.17, no.2, pp.238-246, 1980.
CrossRef

[2] A. Liu, Y. Liu, J. Gu, Y. Qiao, and C. Dong, “Blind image super-resolution: A survey and beyond,” IEEE Trans. Pattern Anal. Mach. Intell., vol.45, no.5, pp.5461-5480, 2023.
CrossRef

[3] J. Yang, J. Wright, T.S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Trans. Image Process., vol.19, no.11, pp.2861-2873, 2010.
CrossRef

[4] Q. Cao, L. Lin, Y. Shi, X. Liang, and G. Li, “Attention-aware face hallucination via deep reinforcement learning,” Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp.690-698, 2017.
CrossRef

[5] C. Ma, Z. Jiang, Y. Rao, J. Lu, and J. Zhou, “Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.5569-5578, 2020.
CrossRef

[6] X. Wang, Y. Li, H. Zhang, and Y. Shan, “Towards real-world blind face restoration with generative facial prior,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.9168-9178, 2021.
CrossRef

[7] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.4401-4410, 2019.
CrossRef

[8] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” Proc. European Conference on Computer Vision (ECCV), pp.286-301, 2018.
CrossRef

[9] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” Advances in Neural Information Processing Systems, vol.34, pp.15908-15919, 2021.

[10] X. Yu, B. Fernando, B. Ghanem, F. Porikli, and R. Hartley, “Face super-resolution guided by facial component heatmaps,” Proc. European Conference on Computer Vision (ECCV), pp.217-233, 2018.
CrossRef

[11] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang, “FSRNet: End-to-end learning face super-resolution with facial priors,” Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp.2492-2501, 2018.
CrossRef

[12] H. Huang, R. He, Z. Sun, and T. Tan, “Wavelet-SRNet: A wavelet-based CNN for multi-scale face super resolution,” Proc. IEEE International Conference on Computer Vision, pp.1689-1697, 2017.
CrossRef

[13] J. Jiang, C. Chen, J. Ma, Z. Wang, Z. Wang, and R. Hu, “SRLSP: A face image super-resolution algorithm using smooth regression with local structure prior,” IEEE Trans. Multimedia, vol.19, no.1, pp.27-40, 2016.
CrossRef

[14] X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face representation with noisy labels,” IEEE Trans. Inf. Forensics Security, vol.13, no.11, pp.2884-2896, 2018.
CrossRef

[15] S. Mallat, A Wavelet Tour of Signal Processing, pp.83-85, Elsevier, 1999.
CrossRef

[16] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” Proc. IEEE International Conference on Computer Vision, pp.3730-3738, 2015.
CrossRef

[17] R. Zhang, P. Isola, A.A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp.586-595, 2018.
CrossRef

[18] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in Neural Information Processing Systems, p.30, 2017.

[19] Y. Wang, Y. Hu, and J. Zhang, “Panini-net: GAN prior based degradation-aware feature interpolation for face restoration,” Proc. AAAI Conference on Artificial Intelligence, vol.36, no.3, pp.2576-2584, 2022.
CrossRef

[20] T. Yang, P. Ren, X. Xie, and L. Zhang, “GAN prior embedded network for blind face restoration in the wild,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.672-681, 2021.
CrossRef

Page top