1. Introduction
Fish skin color not only serves as an indicator of the physical health of the fish but also plays a vital role in determining its market value [1], [2]. Notably, the distinct skin color attributes of Plectropomus leopardus have a significant impact on their economic value in aquaculture [3], [4]. Therefore, detecting the skin color of Plectropomus leopardus is valuable for optimizing coloring feeding, thereby reducing farming costs and accurately estimating economic returns in the farming area.
Currently, fish skin color detection has made significant progress over the years [5]. The conventional skin color detection method, visual observation, may introduce inaccuracies in detecting fish body color due to variations in human perception [6]. To standardize color metrics, instrumental detection methods employing tools have emerged (e.g., colorimeters) [7]. But this method demands a considerable investment in equipment, manpower, and time costs. Subsequently, an innovative method utilizing a computer vision system gradually gained popularity as a preferred approach for skin color detection [8]. Nevertheless, this method demands taking the fish from the water and placing it at a fixed shooting angle, which may negatively affect its health, particularly when anesthesia is utilized. Consequently, researchers turned to in-situ detection methods, aiming to collect fish images and detect body color without interference [8].
However, current in-situ detection research methods are conducted in controlled experimental environments with adjustable lighting. Therefore, to facilitate its widespread implementation in real breeding scenarios, the following challenges still need to be addressed:
- Image color cast. Seawater’s light absorption leads to a greenish color cast in images.
- The deployment on mobile devices. Simple, fast mobile detection is more conducive to assisting aquaculturists in making decisions.
In this paper, we utilize the proposed Variance Gray World Algorithm (VGWA) approach to address the image color cast, and subsequently apply Mosaic for image augmentation. Additionally, we incorporate Hybrid Spatial Pyramid Pooling (HSPP) to optimize the YOLOv5s network structure for feature extraction. The highlights of this paper are three-folds:
- Proposed the VGWA approach to restore fish skin color by correcting image color cast.
- Proposed the HSPP module to fuse multiscale features, which is made from cascade connection and parallel connection of the maxpooling.
- Deployed the VH-YOLOv5s model on the mobile phone, which can directly detect the skin color of the transmitted or real-time captured images.
2. Related Works
2.1 The Instrumental Color Detection Method
Instrumental color detection methods require the use of specialized equipment, such as a colorimeter. Ninwichian et al. [2] employed an SC80B® colorimeter to measure fish skin color intensity in hybrid catfish using five parameters: lightness, redness or greenness, yellowness or blueness, chroma, and hue angle. Hien et al. [9] used the CR200 colorimeter to measure the L*, a*, and b* values of the body skin, abdomen, and muscles of 12 fish through multiple repetitions. Pattanaik et al. [6] utilized a Hunter Lab Scan XE-colourimeter to measure the L*, a*, b*, and dE values for evaluating skin color through colorimetric analysis.
Those method requires the individual and repeated collection of fish skin color, demanding both human and equipment resources and consuming a considerable amount of time.
2.2 The Computer Vision System Method
The computer vision system method involves capturing fish body images from a fixed position and utilizing image analysis software for body color detection. Tu et al. [10] used Adobe Photoshop CC 2018 software to analyze fish skin stripes and four points of the background for fish body color detection. Yang et al. [7] selected oval area regions of interest (ROIs) from carassius auratus and used ImageJ 1.51 software to measure the non-weighted red-green-blue intensity scores of each ROI for skin color evaluation. G\(\ddot{\mathrm{u}}\)m\(\ddot{\mathrm{u}}\)ş et al. [11] analyzed the L*, a*, b* values from fish images and calculated the average for each fish using LensEye-NET software for the color analysis of Silurus glanis and Clarias gariepinus.
All the mentioned methods require taking the fish from the aquatic environment to capture images at a fixed position, relying on frequent manual operations and potentially affecting the fish’s health if anesthetized [12].
2.3 The In-Situ Color Detection Method
The in-situ detection method involves directly collecting data and detecting fish skin color without disturbing the fish. Nguyen et al. [8] introduced an in-situ vision-based measurement technique for detecting clownfish skin color. Yi et al. [13] used a fixed Q14 gray-scale at the middle of the bottom of the aquarium as a reference object for detecting fish skin color and measuring color changes.
While the aforementioned methods reduce the reliance on manpower, the majority of studies on fish skin color using in-situ detection have been conducted in laboratory settings [14]. Therefore, to implement the method in aquaculture, it is essential to correct the collected fish images for color cast.
2.4 Image Color Cast Correction
Presently, there are two types of image color cast correction methods: data-driven methods and image-restoration methods. On one hand, the data-driven method entails collecting datasets to train network models for color cast correction. J. Li et al. [15] constructed a water generative adversarial network model for color cast correction, capable of generating realistic monocular underwater images. Liu et al. [16] proposed a deep multiscale feature fusion net based on the conditional generative adversarial network to extract and augment multiscale features for image color correction. Although a data-driven method ensures color consistency, the large number of parameters makes it challenging to implement the fish skin color detection network. On the other hand, the image-restoration method relies on prior knowledge to restore the image to its original state before degradation. Zhang et al. [17] utilized a retinex-inspired method to effectively remove the color cast induced by underwater light scattering for color correction. C. Li & Zhang [18] integrated background light estimation with an automatic white balance method to overcome the limitations of the classical dark channel prior method for underwater image restoration and color cast correction. The image-restoration method can alleviate the color cast to a certain extent, but there is some variation in light intensity within the same image, and supplementing the channel with single indicator may result in over-correction or under-correction.
To solve the above problems, we have proposed the VH-YOLOv5s method to realize fish skin color detection in aquaculture.
3. Materials and Methodology
3.1 Dataset
3.1.1 Data Collection
Experimental data were collected using a data collection platform developed in the aquaculture workshop of Laizhou Mingbo Aquatic Co. in Shandong. Figure 2 (a) depicts a real breeding workshop, whereas Fig. 2 (b) shows the data-collection platform. The data-collection platform consists of a camera (Hikvision 3T86FWDV2-I3S, with 8 million pixels and a 4-mm focal length), camera bracket, camera memory card, computer, and other gears. Firstly, the camera was attached to a bracket at approximately 1.5 m above the breeding pond. Subsequently, the camera memory card was connected to the computer via a network cable, and the captured fish video data were imported into the computer for storage. The video data were frame-intercepted and cropped to create a skin color dataset containing 4300 images, all in RGB format with a pixel resolution of \(800 \times 1280\), and stored in JPG format.
3.1.2 Data Annotation
The fish skin color annotation provides the ground truth of fish skin color, enabling the evaluation of fish skin color detection models. In this paper, the skin color of Plectropomus leopardus was categorized into three classes based on the requirements for fish skin color detection and its relevance to the actual aquaculture environment: red Plectropomus leopardus with a red skin or red patches, pink Plectropomus leopardus with a slightly pink color, and black Plectropomus leopardus with a black skin or black patches [19]. The online labeling software, makesense.ai, was utilized to annotate the fish skin color in the images and generate the corresponding TXT files, as illustrated in Fig. 3. Number 1 symbolizes red Plectropomus leopardus, number 2 indicates pink Plectropomus leopardus, and number 3 represents black Plectropomus leopardus. The distribution of Plectropomus leopardus per color category in the Train and Validation datasets is presented in Table 1.
As shown in Table 1, the pink Plectropomus leopardus has a higher number of data samples, the red Plectropomus leopardus exhibits a more balanced distribution, and the black Plectropomus leopardus has a smaller number of data samples.
3.2 VH-YOLOv5s Network Architecture
VH-YOLOv5s is a one-stage target detection method with an efficient input, backbone, neck, and head part, designed to achieve high-speed and accurate object detection. It makes the following contributions: (1) The VGWA method is incorporated into the input part to effectively correct image color cast. (2) The Mosaic approach is applied at the input stage to effectively enhance image augmentation. (3) The BC-C3TR module, comprising the BottleneckCSP and C3TR modules, is strategically placed in the backbone and neck parts to enhance the model’s capacity for feature representation. (4) The HSPP module is positioned in the backbone part to capture multi-scale feature information. The skin color detection network structure is presented in Fig. 4, with the enhanced components highlighted by yellow, green, blue, and red bounding boxes, respectively.
3.2.1 Variance Gray World Algorithm
Due to the absorption and refraction of light in seawater, images may exhibit color cast, which can subsequently impact the accuracy of fish skin color detection. The proposed VGWA approach utilizes color space channel variance to optimize channel weights, enhancing fish skin color detection performance by compensating for channel discrepancies and correcting color cast. The VGWA approach calculates the gray value K by considering both variance and mean weighting, enabling efficient adjustment of color cast in the three channels, as shown in Eq. (1):
\[\begin{align} K=(C_{r} \ast R_{avg} +C_{g} \ast G_{avg} +C_{b} \ast B_{avg} )/3 \tag{1} \end{align}\] |
where \(K\) represents the gray value of the VGWA, \(C_{r}\), \(C_{g}\), and \(C_{b}\) are the weight coefficients for the mean values of \(R_{avg}\), \(G_{avg}\), and \(B_{avg}\), respectively.
Owing to the variability of light absorption by seawater, the captured images often exhibit a greenish or bluish color cast. Therefore, \(C_{r}\), \(C_{g}\), and \(C_{b}\) in Eq. (1) were separated into three instances to determine the weights. When the image turns blue, that is, \(B_{avg} > max (G_{avg}, R_{avg})\), the weight coefficients of the mean of \(R_{avg}\), \(G_{avg}\), and \(B_{avg}\) are shown in Eq. (2):
\[\begin{align} \left\{ \begin{array}{@{}l@{}} C_{r} =B_{var} /R_{var} \\ C_{g} =B_{var} /G_{var} \\ C_{b} =1 \end{array} \right. \tag{2} \end{align}\] |
where \(R_{var}\), \(G_{var}\), and \(B_{var}\) are the variations of pixel values of \(R\), \(G\), and \(B\), respectively. When the image exhibits a greenish color cast, that is, \(G_{avg} > max (R_{avg}, B_{avg})\), the weight coefficient for \(R_{avg}\), \(G_{avg}\), and \(B_{avg}\) are as shown in Eq. (3):
\[\begin{align} \left\{ \begin{array}{@{}l@{}} C_{r} =G_{var} /R_{var} \\ C_{g} =1 \\ C_{b} =G_{var} /B_{var} \end{array} \right. \tag{3} \end{align}\] |
When there is no channel color cast, the weight coefficients for \(R_{avg}\), \(G_{avg}\), and \(B_{avg}\) are as shown in Eq. (4).
\[\begin{align} C_{r} =C_{g} =C_{b} =1 \tag{4} \end{align}\] |
The channel adjustment factor of the VGWA approach is given by Eq. (5).
\[\begin{align} \left\{ \begin{array}{@{}l@{}} K_{r} =K/R_{avg} \\ K_{g} =K/G_{avg} \\ K_{b} =K/B_{avg} \end{array} \right. \tag{5} \end{align}\] |
where \(K_{r}\), \(K_{g}\), and \(K_{b}\) are the channel correction coefficients of \(R\), \(G\), and \(B\), respectively.
\[\begin{align} \left\{ \begin{array}{@{}l@{}} R'=K_{r} \ast R \\ G'=K_{g} \ast G \\ B'=K_{b} \ast B \\ \end{array} \right. \tag{6} \end{align}\] |
where \(R\), \(G\), and \(B\) are the pixel values of the three channels in the image, and \(R'\), \(G'\), and \(B'\) are the three-channel pixel values of the image after color cast correction.
3.2.2 Mosaic Approach
In this paper, data augmentation was achieved through the Mosaic approach, effectively improving the performance of skin color detection in Plectropomus leopardus [20]. It expands the dataset through splicing, scaling, and random cropping of multiple images to synthesize new images, effectively improving the model’s robustness. The pseudocode for implementing the data augmentation process using the Mosaic approach is depicted in Approach 1, and the visual effect is demonstrated in Fig. 5.
3.2.3 BC-C3TR
The BC-C3TR module, which combines BottleneckCSP and C3TR, contributes to enhancing the model’s feature representation capability, thereby achieving improved generalizability.
(1) C3TR Module
The C3TR module is an improved transformer encoder. The transformer is an encoder-decoder model that uses self-attention to accelerate model training. The transformer model extends the focus of the model to multiple regions of interest and fuses the feature information of multiple representation subspaces, benefiting the accurate implementation of target localization and feature extraction for the regions of interest. The encoder module consists of several identical sublayers, each incorporating multi-head attention and a feedforward network, as shown in Fig. 6.
1) Multi-Head Attention
Multi-Head attention is based on the self-attention map query, key, and value of multiple subspaces after linear changes, and concatenates the feature information from multiple spaces to obtain the final attention feature information. The implementation of multi-head attention is shown in Eq. (7).
\[\begin{align} & \mathit{MultiHead} (Q,K,V)=\mathit{Concat} (head_{1}, \cdots, head_{8} )W^{o} \tag{7} \\ & head_{i} = \mathit{Attention} (Q_{i}, K_{i}, V_{i} ),\quad i=1,\cdots, 8 \tag{8} \\ & Q_{i} =QW_{i}^{Q}, \ K_{i} =KW_{i}^{K}, \ V_{i} =VW_{i}^{V} \tag{9} \end{align}\] |
where \(Concat\) represents the concatenation of multiple heads, \(Attention\) represents the scaled dot-product attention, \(Head_{i}\) represents the \(i\)-th self-attention, \(Q\), \(K\), and \(V\) represent the query, key, and value matrices, respectively. \(W^{o}\) represents a linearly varying weight; and \(W_{i}^{Q}\), \(W_{i}^{K}\), and \(W_{i}^{V}\) represent the linearly varying weights of the query, key, and value matrices, respectively.
2) Feed-Forward Network
Except for the multi-head attention sublayer, each encoder layer has a feedforward network that consists of two linear transform layers and an activation function, i.e., ReLU, a detailed implementation of which is shown in Eq. (10).
\[\begin{align} \mathit{FFN}(x,W_{1}, W_{2}, b_{1}, b_{2} ) & \!=\! W_{2} (\mathit{ReLu} (x \! \times \! W_{1} \! + \! b_{1})) \! + \! b_{2} \notag\\ & \!=\! max (0,x \! \times \! W_{1} \! + \! b_{1}) \! \times \! W_{2} \! + \! b_{2} \tag{10} \end{align}\] |
where \(W_{1}\) represents the weight of the first linear transformation layer, \(b_{1}\) is the bias of the first linear transformation layer, \(W_{2}\) indicates the weight of the second linear transformation layer, \(b_{2}\) is the bias of the second linear transformation layer, and \(x\) represents the input.
(2) BottleneckCSP Module
The BottleneckCSP module initially utilizes the residual block applied in the deep network, namely Bottleneck, which reduces the possibility of gradient dispersion and retains more primitive information to a certain extent. Subsequently, the convolution layer, batch normalization and activation function are added after the concat layer to increase the depth of the network to learn the fused feature information. The structure of the BottleneckCSP module is illustrated in Fig. 7.
3.2.4 Hybrid Spatial Pyramid Pooling
The HSPP module is designed to efficiently extract both global and local features by combining multi-scale information, thereby achieving comprehensive extraction of fish skin color features. It consists of three main components: CBMe, multi-scale maxpooling layers, and Concat. This module is a hybrid spatial pyramid pooling network that combines cascade connection and parallel connection, enabling multi-feature fusion after dimension reduction. Moreover, the meta-acon activation function is utilized in place of the SiLU activation function to enhance the applicability of this module. Figure 8 shows the structure of the HSPP module.
In contrast to the acon series, which switches the linearity or nonlinearity of the activation function by changing the value of \(\beta\), meta-acon dynamically learns the value of \(\beta_{c}\) through the input feature map \(x\) (as shown in Eq. (11) to achieve the linearity or nonlinearity of the adaptive control function, which facilitates a generalization and improves the transfer performance.
\[\begin{align} Meta-acon=(p_{1} -p_{2} )x\cdot \sigma (\beta_{c} (p_{1} -p_{2} )x)+p_{2} x \tag{11} \end{align}\] |
where \(x\) represents the input feature map, \(p_{1}\) is the first-order derivative as \(x\) tends toward positive infinity, \(p_{2}\) is the first-order derivative as \(x\) tends toward negative infinity, and \(\beta_{c}\) is the activation function performance factor. In this study, we learn \(\beta_{c}\) using the channel space, as indicated in Eq. (12).
\[\begin{align} \beta_{c} =\sigma W_{1} W_{2} \sum\nolimits_{h=1}^H {\sum\nolimits_{w=1}^W {x_{c,h,w}}} \tag{12} \end{align}\] |
where \(W_{1}\) and \(W_{2}\) represent the input and output channel weights, \(W\) represents the width of the feature map, \(H\) represents the height of the feature map, \(c\) represents the number of channels, and \(\sigma\) represents the softmax activation function.
4. Experiment
4.1 Implementation Details
To validate the performance of the VH-YOLOv5s method, this study employed precision, recall, mAP@0.5, and mAP@0.5:0.95 as metrics. The specific equations used are as follows:
\[\begin{align} & Precision=\frac{T_{P} }{T_{P} +F_{P} } \tag{13} \\ & Recall=\frac{T_{P} }{T_{P} +F_{N} } \tag{14} \end{align}\] |
where \(T_{P}\) represents the number of samples that are actually positive and predicted to be positive, \(F_{P}\) represents the number of samples that are actually negative but predicted to be positive, and \(F_{N}\) represents the number of samples that are actually positive but predicted to be negative. All of the above indicators were calculated based on mAP@0.5.
The equations for mAP@0.5 and mAP@0.5:0.95 are as follows:
\[\begin{align} & AP=\int_0^1 P(R)dR \tag{15} \\ & mAP=\frac{\displaystyle\sum_{i=1}^N {AP_{i} } }{N} \tag{16} \end{align}\] |
where \(P(R)\) represents the PR (Precision-Recall) curve that varies with the threshold, \(AP_{i}\) represents the average precision for the i-th class, and \(N\) represents the number of fish skin color categories. mAP@0.5 refers to the average AP of all categories when the IoU is set to 0.5, and mAP@0.5:0.95 refers to the average mAP under different IoU thresholds. The IoU ranges from 0.5 to 0.95 in increments of 0.05.
The experiments were conducted in Python, with the image resized to \(640 \times 640\), momentum set to 0.937, initial learning rate set to 0.01, weight decay set to 0.0005, batch size set to 32, and a total of 200 iterations. The specific configurations of the experimental hardware as shown in Table 2.
4.2 Comparative Experiments
The YOLOv5 network adopts an end-to-end structure, consisting primarily of five basic network configurations with increasing depth and width, namely YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. Meanwhile, it is essential to consider both the detection accuracy and the model size to ensure the high accuracy of the detection model on the mobile phone. The specific results are shown in Table 3.
While the size of the five models gradually increases with depth and width, as shown in Table 3, the test results for mAP indicate that network structures with lower depth and width exhibit better performance. Among them, the YOLOv5s model achieves the highest recall, 0.7% higher than the YOLOv5x model, while the YOLOv5n model achieves the highest mAP [0.5:0.95], 0.5% higher than the YOLOv5x model. Additionally, the YOLOv5s model exhibits superior performance in terms of other parameters compared to the other network models, with only a slight increase in the size of 10.5 M compared to the YOLOv5n model. Hence, taking into account the superior detection performance and lightweight network requirements for deployment on mobile phones, the YOLOv5s network model was selected for the experiments conducted in this study.
The Gray World Algorithm (GWA) is an approach used to correct image colors, effectively eliminating color casts. Essentially, its purpose is to correct color imbalances by adjusting pixel values based on the computed channel means and scaling factors [21].
Differing from the GWA approach, which relies solely on channel means, the VGWA approach integrates both variance and mean values to achieve more effective image color correction. As shown in Table 4, the VGWA approach has notably improved Precision, mAP@0.5, and mAP@0.5:0.95, with increases of 0.6%, 0.9%, and 1.1%, respectively.
Compared with the CenterNet [22], AutoAssign [23], and YOLOX [24] models in recent years, the VH-YOLOv5s model achieves a better detection performance. The CenterNet network is equipped with the ResNet18 backbone network, the AutoAssign network supplied the ResNet50 backbone network. The experimental results are listed in Table 5.
Based on the experimental results presented in Table 5, the VH-YOLOv5s model demonstrates superior performance compared to the CenterNet, AutoAssign, and YOLOX models. Specifically, when compared to the YOLOX model, the VH-YOLOv5s model achieved a 1.7% improvement in mAP@0.5 and a 1.3% improvement in mAP@0.5:0.95. In comparison to the relatively lower performance of CenterNet, the VH-YOLOv5s model showed substantial improvement, with mAP@0.5 increased by 2.5% and mAP@0.5:0.95 improved by 2.7%. The skin color detection results of various methods on the same dataset are depicted in Fig. 9, where only prediction results exceeding a threshold of 0.5 are marked.
In Fig. 9 (a), 13 fish skin colors are marked, including 5 red Plectropomus leopardus, 1 black Plectropomus leopardus, and 7 pink Plectropomus leopardus. Figure 9 (b) displays the prediction results of the YOLOv5s model, with 12 accurate detections and 1 error detection. Figure 9 (c) shows the prediction results of the Centernet model, with four accurate detections and nine miss detections. Figure 9 (d) presents the prediction results of the AutoAssign model, with 12 accurate detections and 1 duplicate detection. Figure 9 (e) demonstrates the prediction results of the YOLOX model, with 12 accurate detections and 1 miss detection. Figure 9 (f) illustrates the prediction results of the VH-YOLOv5s model, with 13 accurate detections. Notably, accurate predictions were achieved at a threshold of 0.5.
The Centernet model’s utilization of fixed anchor box size and dimensions results in a higher miss rate in the target detection model. The AutoAssign model employs a fully dynamic positive and negative sample label assignment method for training the skin color detection model, which leads to improved detection performance. However, it comes with a larger number of parameters compared to the VH-YOLOv5s model. The YOLOX model, despite having no anchors and fewer parameters, exhibits poorer skin detection performance compared to the VH-YOLOv5s model. The VH-YOLOv5s model efficiently improves the skin color detection performance by preprocessing the dataset, extracting and fusing multi-scale feature information.
4.3 Ablation Study
In this paper, an ablation study is performed to showcase the enhancement in detection performance resulting from the skin color of Plectropomus leopardus on the VH-YOLOv5s network structure. The YOLOv5s model, the YOLOv5s\(+\)VGWA model, the YOLOv5s\(+\)VGWA\(+\)Mosaic model, YOLOv5s\(+\)VGWA\(+\)Mosaic\(+\)BC-C3TR model, and the VH-YOLOv5s model are trained on the fish skin color detection dataset, and the experimental results and specific data are depicted in Fig. 10 and Table 6, respectively.
As depicted in Fig. 10, the precision, recall, mAP@0.5, and mAP@0.5:0.95 metrics show a general upward trend with varying epochs, albeit with some fluctuations. However, the YOLOv5s model exhibited a less significant upward trend compared to the VH-YOLOv5s model. Additionally, with the integration of improvement points, the VH-YOLOv5s model exhibits evident advantages, especially in terms of the detection performance for mAP@0.5 and mAP@0.5:0.95 indicators, which are optimal.
After integrating the proposed VGWA approach into the YOLOv5s model, significant improvements were observed, with a 1.6% increase in precision, a 1.3% improvement in mAP@0.5, and a 0.9% boost in mAP@0.5:0.95, confirming the effectiveness of the VGWA method in enhancing the network model’s performance. With the addition of the Mosaic method, there is only a slight decrease in Precision, but the overall performance improves significantly. Following the integration of the BC-C3TR module, the model demonstrated consistent improvement, particularly in terms of precision, recall, mAP@0.5, and mAP@0.5:0.95. Moreover, the proposed HSPP module, serving as the final layer of the backbone network, enhances mAP@0.5, mAP@0.5:0.95, precision, and recall by 0.3%, 0.4%, 0.1%, and 0.5%, respectively. Overall, the VH-YOLOv5s network model improves the precision, recall, mAP@0.5, and mAP@0.5:0.95, by 2.3%, 1.5%, 3.3%, and 3.4%, respectively, compared to the YOLOv5s network model.
The detection performance of the three skin color categories (red, pink, and black) for Plectropomus leopardus is shown in Table 7.
As shown in Table 7, the VH-YOLOv5s model exhibited considerable improvement in the detection performance of all three skin color categories of Plectropomus leopardus when compared with the YOLOv5s model. In particular, the average detection performance of Plectropomus leopardus increased by 3.3%, with red, pink, and black Plectropomus leopardus showing improvements of 3.6%, 3.4%, and 2.8%, respectively.
The loss function utilized in the VH-YOLOv5s model remains consistent with that of YOLOv5s. Hence, this paper employed loss trends visualization to demonstrate a multidimensional comparison between the proposed VH-YOLOv5s method and the initial benchmark model, as illustrated in Fig. 11.
The total loss function of YOLOv5s is presented in Fig. 11 (a) and (b). It encompasses three types of loss functions: the classification loss function (cls_loss), the localization loss function (box_loss), and the confidence loss function (obj_loss), as shown in (c) and (d) in Fig. 11. From Fig. 11, it can be observed that the convergence of VH-YOLOv5s is similar to that of the YOLOv5s model, and both models achieve convergence faster. Additionally, the VH-YOLOv5s model exhibits better stability during both training and validation.
4.4 Deployment of VH-YOLOv5s Model
To effectively promote and apply the skin color detection model in aquaculture and assist farmers in detecting the skin color of fish from transmitted or real-time captured images, the VH-YOLOv5s model is deployed on a mobile phone, ensuring ease of operation and fast detection speed. The specific implementation process is shown in Fig. 12.
Figure 12 presents the mobile phone’s model detection interface, with “IMAGE” serving as the key to selecting the image for detection on the mobile phone. “SKIN_DETECT-CPU” is used to perform fish skin color detection with the CPU, and “SKIN_DETECT-GPU” employs the GPU for detection. It displays the image of Plectropomus leopardus to be detected after pressing “IMAGE”. The data details are further depicted in Fig. 13.
In Fig. 13, the markings represent detections with Intersection over Union (IoU) values greater than or equal to 0.5. It is evident that each fish was accurately identified, and the skin color and detection probability were effectively detected. Futhermore, the VH-YOLOv5s model is relatively lightweight and has a reduced memory usage on mobile phones.
5. Conclusions
In this paper, we proposed the VH-YOLOv5s method for detecting the skin color of Plectropomus leopardus and successfully deployed it on a mobile phone, which is highly significant for skin color monitoring in aquaculture. The VGWA approach effectively corrected image color cast and outperformed the baseline model by 1.3% on mAP@0.5. Additionally, the HSPP module, combining cascade connection and parallel connection structures, achieved efficient multi-scale feature fusion, resulting in a 0.3% improvement in mAP@0.5. The VH-YOLOv5s model was successfully integrated into mobile phones for real-time skin color detection from transmitted or captured images. Moreover, the successful implementation of skin color detection for Plectropomus leopardus serves as a valuable reference for future skin color detection requirements for other fish species.
Acknowledgments
This study was supported by the national key research and development plan project under Grant 2022YFD2001701 and the Shandong Province Major Scientific and Technological Innovation Project-Key technology research and creation of digital fishery intelligent equipment under Grant 2021TZXD006.
References
[1] A.D. Micah, B. Wen, Q. Wang, Y. Zhang, A. Yusuf, N.N.B. Thierry, O.S. Tokpanou, M.M. Onimisi, S.O. Adeyemi, J.-Z. Gao, and Z.-Z. Chen, “Effect of dietary astaxanthin on growth, body color, biochemical parameters and transcriptome profiling of juvenile blood parrotfish (Vieja melanurus ♀× Amphilophus citrinellus ♂),” Aquaculture Reports, vol.24, p.101142, 2022.
CrossRef
[2] P. Ninwichian, N. Phuwan, and P. Limlek, “Effects of tank color on the growth, survival rate, stress response, and skin color of juvenile hybrid catfish (Clarias macrocephalus × Clarias gariepinus),” Aquaculture, vol.554, p.738129, 2022.
CrossRef
[3] T. Maoka, W. Sato, H. Nagai, and T. Takahashi, “Carotenoids of Red, Brown, and Black Specimens of Plectropomus leopardus, the Coral Trout (Suziara in Japanese),” Journal of Oleo Science, vol.66, no.6, pp.579-584, 2017.
CrossRef
[4] X. Zhu, R. Hao, J. Zhang, C. Tian, Y. Hong, C. Zhu, and G. Li, “Dietary astaxanthin improves the antioxidant capacity, immunity and disease resistance of coral trout (Plectropomus leopardus),” Fish & Shellfish Immunology, vol.122, pp.38-47, 2022.
CrossRef
[5] K. Anantharajah, Z. Ge, C. McCool, S. Denman, C. Fookes, P. Corke, D. Tjondronegoro, and S. Sridharan, “Local inter-session variability modelling for object classification,” IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, pp.309-316, Steamboat Springs, CO, USA, IEEE, 2014.
CrossRef
[6] S.S. Pattanaik, P.B. Sawant, M. Xavier K.A., P.P. Srivastava, K. Dube, B.T. Sawant, and N.K. Chadha, “Dietary carotenoprotien extracted from shrimp shell waste augments growth, feed utilization, physio-metabolic responses and colouration in Oscar, Astronotus ocellatus (Agassiz, 1831),” Aquaculture, vol.534, p.736303, 2021.
CrossRef
[7] T. Yang, S. Kasagi, A. Takahashi, and K. Mizusawa, “Effects of background color and feeding status on the expression of genes associated with body color regulation in the goldfish Carassius auratus,” General and Comparative Endocrinology, vol.312, p.113860, 2021.
CrossRef
[8] C.-N. Nguyen, V.-T. Vo, L.-H.-N. Nguyen, H. Thai Nhan, and C.-N. Nguyen, “In situ measurement of fish color based on machine vision: A case study of measuring a clownfish’s color,” Measurement, vol.197, p.111299, 2022.
CrossRef
[9] T.T.T. Hien, T.V. Loc, T.L.C. Tu, T.M. Phu, P.M. Duc, H.T. Nhan, and P.T. Liem, “Dietary Effects of Carotenoid on Growth Performance and Pigmentation in Bighead Catfish (Clarias macrocephalus Günther, 1864),” Fishes, vol.7, no.1, p.37, 2022.
CrossRef
[10] N.P.C. Tu, N.N. Ha, N.T.T. Linh, and N.N. Tri, “Effect of astaxanthin and spirulina levels in black soldier fly larvae meal-based diets on growth performance and skin pigmentation in discus fish, Symphysodon sp.,” Aquaculture, vol.553, p.738048, 2022.
CrossRef
[11] E. Gümüş, A. Yılayaz, M. Kanyılmaz, B. Gümüş, and M. Balaban, “Evaluation of body weight and color of cultured European catfish (Silurus glanis) and African catfish (Clarias gariepinus) using image analysis,” Aquacultural Engineering, p.93102147, 2021.
CrossRef
[12] K. Zhou, K. Zhang, X. Fan, W. Zhang, Y. Liang, X. Wen, and J. Luo, “The skin-color is associated with its physiological state: A case study on a colorful variety, hybrid grouper (Epinephelus fuscoguttatus × Epinephelus lanceolatus),” Aquaculture, vol.549, p.737719, 2022.
CrossRef
[13] M. Yi, H. Lu, Y. Du, G. Sun, C. Shi, X. Li, H. Tian, and Y. Liu, “The color change and stress response of Atlantic salmon (Salmo salar L.) infected with Aeromonas salmonicida,” Aquaculture Reports, vol.20, p.100664, 2021.
CrossRef
[14] X. Lei, H. Wang, J. Shen, Z. Chen, and W. Zhang, “A novel intelligent underwater image enhancement method via color correction and contrast stretching,” Microprocessors and Microsystems, p.104040, 2021.
CrossRef
[15] J. Li, K.A. Skinner, R.M. Eustice, and M. Johnson-Roberson, “WaterGAN: Unsupervised Generative Network to Enable Real-time Color Correction of Monocular Underwater Images,” IEEE Robotics and Automation Letters, vol. 3, no. 1, pp.387-394, Jan. 2018.
CrossRef
[16] X. Liu, Z. Gao, and B.M. Chen, “MLFcGAN: Multilevel Feature Fusion-Based Conditional GAN for Underwater Image Color Correction,” IEEE Geosci. Remote Sens. Lett., vol.17, no.9, pp.1488-1492, 2020.
CrossRef
[17] W. Zhang, L. Dong, and W. Xu, “Retinex-inspired color correction and detail preserved fusion for underwater image enhancement,” Computers and Electronics in Agriculture, vol.192, p.106585, 2022.
CrossRef
[18] C. Li and X. Zhang, “Underwater Image Restoration Based on Improved Background Light Estimation and Automatic White Balance,” 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, Beijing, China, IEEE, 2018.
CrossRef
[19] T. Shimose and M. Kanaiwa, “Influence of the body color and size on the market value of wild captured coralgroupers (Serranidae, Plectropomus): Implications for fisheries management,” Fisheries Research, vol.248, p.106223, 2022.
CrossRef
[20] A. Bochkovskiy, C.-Y. Wang, and H.-Y.M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv: 2004.10934, 2020.
URL
[21] J. van de Weijer, T. Gevers, and A. Gijsenij, “Edge-Based Color Constancy,” IEEE Trans on Image Process, vol.16, no.9, pp.2207-2214, 2007.
CrossRef
[22] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv: 1904.07850, 2019.
URL
[23] B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li, and J. Sun, “AutoAssign: Differentiable label assignment for dense object detection,” arXiv preprint arXiv:2007.03496, 2020.
URL
[24] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “YOLOX: Exceeding YOLO series in 2021,” arXiv preprint arXiv: 2017.08430, 2021.
URL