Underwater Waste Recognition and Localization Based on Improved YOLOv5

,


Introduction
Plastic waste in water bodies has gained widespread attention due to its persistence and negative impact on aquatic ecosystems and human health [1]. Plastic accounts for over 80% of the artificial debris observed in rivers. Removing plastic waste from underwater environments would have a tremendous benefit to the underwater ecosystem. Research has demonstrated that removing plastic waste from underwater environments can significantly benefit the underwater ecosystem [2]. However, current methods for removing plastic waste from inland water surfaces rely mainly on mechanical equipment such as manual salvage ships, while underwater plastic waste removal is performed through manual manipulation of robotic arms, which is both time-consuming and inefficient. Fortunately, autonomous underwater vehicles (AUV) have introduced a new approach to underwater plastic waste removal. During the process of underwater waste cleaning with AUV, an excellent target detection and localization algorithm is crucial.
However, even minor changes in the environment can significantly alter the appearance of underwater objects. For example, changes in lighting can affect shallower waters, and turbid waters can make objects difficult to observe [3,4]. Traditional identification methods typically involve human visual counting [5,6], sampling with nets [7], or counting plastic samples within a fence in a specific area [8]. While Ge et al. [9] used laser radar to identify waste on the shore, and Lorenzo-Navarro et al. [10] proposed a method using the Sauvola threshold algorithm for plastic classification and counting. However, traditional methods require a significant amount of human or additional equipment, and the efficient is low [11].
Target detection is one of the core problems in computer vision, which involves identifying and locating objects of interest within images or videos and determining their size and position in the scene. With the development of deep learning, significant progress has been made in this area, bringing new advances to the recognition and detection of underwater plastic waste. By extensively training models, deep learning can extract target features from images and use them to complete object classification and recognition tasks. This approach has been widely employed in underwater target detection. In order to solve the problems of the complex underwater environment and insufficient underwater light in the detection of underwater plastic waste, Hu et al. [12] proposed an improved underwater plastic waste detection algorithm based on YOLOv5n. Kylili et al. [13] proposed a CNN algorithm for classifying floating plastic fragments in water bodies, but the algorithm requires object-centered cropped images. Liu et al. [14] improved the detection ability of underwater waste based on the YOLOv3 model using adversarial learning to enable the model to learn the features of the same target in different underwater environments.
In terms of underwater target localization, Xing et al. [15] propose a novel RGB-D camera and inertial measurement unit (IMU) fusion-based cooperative and relative close-range localization approach for special environments, and the efficiency of RGB-D camera for underwater application is validated. Yang et al. [16] proposed a novel vision-based underwater positioning system using a light detection and ranging (LiDAR) camera and an inertial measurement unit. The previous work has inspired us to employ RGB-D cameras for close-range underwater object localization.
Indeed, the previous studies mainly focused on the recognition or the localization of underwater waste without integrating these two tasks. Therefore, these studies have limited usefulness for AUV to collect underwater waste. Furthermore, due to the complexity of the underwater environment and the substantial attenuation of light in water, the images captured by the camera often exhibit blurriness, low contrast, and color inconsistencies. Plastic waste in the water is often small, making it challenging to discern. We aim to address these challenges to accurately provide AUVs with the types and threedimensional spatial coordinates of underwater plastic waste. Therefore, this article proposes an improved YOLOv5-based method for recognizing and localizing underwater waste. The main contributions of this article are as follows: 1. Introducing a CLAHE and Retinex-based weighted fusion algorithm to improve the quality of underwater images.
2. Designing a network model that combines CNN and Transformer based on the MobileViT backbone, introducing attention mechanism and CSP structure in the neck of the model, using Focal EIOU as the model's loss function, and adding a small object detection layer. The new network has higher recognition accuracy and detection speed.
3. Combining the MobileViT-YOLOv5 algorithm with the RealSense depth camera to achieve recognition and localization of underwater waste. 4. Adding underwater plastic waste images collected in real-world scenarios to the open-source DeepTrash underwater waste dataset and randomly adjusting the contrast and brightness of the images to expand the dataset.

Underwater Image Enhancement Algorithm Based on Weighted Fusion
This section mainly introduces several components of the weighted fusion algorithm for enhancing underwater images, including weighted logarithmic transformation, adaptive gamma correction, improved MSR algorithm, CLAHE algorithm, and fusion rules. The image enhancement process proposed in this article is then explained in detail.

Weighted Logarithmic Transformation
The logarithmic curve exhibits a steeper slope for lower value ranges and a flatter slope for higher value ranges. As a result, logarithmic transformation can expand the low grayscale values and compress the high grayscale values of an image [17]. However, the changes in brightness in the dark areas of the image after logarithmic transformation are not significant. Hence, this article proposes the use of a weighted logarithmic transformation [18] for brightness enhancement. A coefficient is added to the logarithmic transformation formula, which is equal to 1 when x = y or 0 otherwise, to enhance local brightness. The transformation formula can be expressed as: Here, s represents the output of the corresponding pixel (x, y) by weighted logarithmic transformation; m represents the number of rows of the image; n represents the number of columns of the image; e is the weighted logarithmic transformation coefficient; is the correction coefficient, usually set to 1; ∇ is the third-order Laplacian operator; τ represents the brightness level.

Adaptive Gamma Correction
The traditional Gamma correction algorithm is a global enhancement technique that modifies the distribution of pixel gray levels in an image using a nonlinear transformation function of gray values, resulting in nonlinear tone changes. This method can adjust images with excessively high or low gray levels, improving the overall brightness and contrast. The formula for Gamma transformation is: Here, T(l) represents the output of the Gamma transformation applied to each pixel with grayscale value l. l max represents the maximum grayscale value in the image, γ is an adjustment coefficient. If γ > 1, the transformation compresses the grayscale levels of the brighter parts of the image, resulting in an overall darker enhancement. If γ < 1, it enhances the contrast of the darker parts of the image, emphasizing details and resulting in an overall brighter enhancement.
Traditional correction algorithms use the same enhancement function for pixels with different gray levels, which can lead to contrast distortion. Moreover, the adjustment coefficient γ needs to be selected by the user based on the image situation, and cannot be adaptively changed according to the image. The adaptive gamma transformation can selectively adjust local correction coefficients for pixel neighborhoods, resulting in superior results compared to traditional gamma transformations. Therefore, this article employs an adaptive gamma transformation [19,20] to enhance the details of the image.

The Improved Multi-Scale Retinex Algorithm
The single-scale Retinex (SSR) algorithm [21] utilizes a Gaussian filter to estimate the illumination component, expressed as follows: The SSR algorithm has only one adjustable parameter σ , which controls the enhancement effect by adjusting its size. In contrast, the MSR algorithm [22] processes the image using different sizes of σ and weights each result to obtain the enhanced image. The formula for the MSR algorithm is as follows: Here, i ∈ {R, G, B}, N represents the number of scales. When N = 1, the MSR algorithm reduces to the SSR algorithm. To ensure that the MSR algorithm can take advantage of multiple scales, N is usually set to 3. Additionally, ω k represents the weighting coefficient for the k scale in the weighted summation. Empirically, when ω 1 = ω 2 = ω 3 = 1/3, the overall enhancement effect is better and the computation is simpler.
Bilateral filtering is a non-linear filtering algorithm proposed based on the Gaussian filtering algorithm. It consists of two functions: geometric spatial distance factor and pixel difference decision coefficient, which comprehensively consider the spatial domain and pixel range domain. It has strong edge-preserving, denoising, and smoothing capabilities, and compared with the Gaussian filter, it can retain more edge and detail information. The mathematical expression of bilateral filtering is: Here, (k, l) represents the central coordinates of the current convolved region, (i, j) represents the coordinates of neighboring pixels in the convolved region, σ d and σ r represent the standard deviation of the Gaussian function, and the function f (x, y) represents the pixel value of the image at point (x, y). In this article, bilateral filtering is used instead of Gaussian filtering to estimate the illumination component in the MSR algorithm.

Contrast Limited Adaptive Histogram Equalization
Histogram equalization (HE) is a non-linear transformation technique used to improve the contrast and clarity of an entire image by transforming the grayscale histogram of the image into a uniform distribution. However, the performance of HE is poor when there are areas in the image that are significantly brighter or darker than other areas. CLAHE [23] is an algorithm that can enhance the local contrast of an image. This method first divides the image into several blocks, applies HE to each block, and sets a threshold. When a certain grayscale value in the image histogram exceeds the threshold, the value is clipped, and the excess is evenly distributed among other grayscale levels. In this, the CLAHE algorithm is used to process images, which limits the excessive enhancement of contrast, avoids introducing unnecessary noise, and effectively enhances details in the images.

Fusion Rule
After preprocessing with the CLAHE and MSR algorithms, two enhanced underwater images are obtained, which need to be fused using a weighted fusion rule [24].
To start with, the R, G and B channel values of the two enhanced images are extracted, and the weight W pta is calculated as shown in Eq. (8): Then, calculate the weights W ptb of the images in the hue, saturation and value (HSV) color space using formula (9): Normalize the weights as shown in Eq. (10): Finally, the two preprocessed images I 1 and I 2 are weighted and fused to obtain the final enhanced image I res , as shown in formula (11): Here, R i , G i , and B i represent the red, green, and blue channel values of the image, and σ is the weight calculation parameter. Additionally, in the formula, H i , S i , and V i represent the H, S, and V channel values, andH,S, andV represent the average values of the H, S, and V channels, respectively.

Algorithm Process
In summary, the flowchart of the underwater enhancement algorithm proposed in this article is illustrated in Fig. 1. The algorithm flowchart for the underwater enhancement proposed in this article is shown in Fig. 1, and the algorithm steps are as follows: 1. Convert the original RGB image I to the HSV color space, apply a weighted logarithmic transformation to the V component, perform an adaptive gamma transformation, and process it using an improved MSR algorithm to obtain the first enhanced image I 1 .
2. Perform CLAHE and median filtering on the original image I to improve brightness and global contrast, eliminate noise, obtain the second enhanced image I 2 , and convert I 2 from the RGB color space to the HSV color space.
3. Weighted fusion is performed on I 1 and I 2 to obtain the final enhanced result I res .

MobileViT Model
The ViT model based on the Transformer architecture exhibits powerful performance in various computer vision tasks. However, the large number of parameters in the Transformer model and its slow inference speed make it unfriendly to common devices, requiring high device performance. Even when reducing the size of Transformer models to match the resource constraints of mobile devices, there may be a significant drop in model accuracy and performance that is notably lower than that of lightweight CNNs. To address this issue, researchers have attempted to combine CNN and Transformer structures and proposed new network models, such as Conformer [25], Mobile-Former [26], and CoTNet [27]. Building on this work, Mehta et al. [28] designed MobileViT, which uses the former to extract local features and the latter to extract global features. Compared to traditional lightweight CNNs under given parameter constraints, MobileViT has better performance, generalization ability, and robustness. The network architecture is illustrated in Fig. 2. The MobileViT network is mainly composed of mobilenet-v2 block (MV2) and MobileViT-block. The structure diagram of MobileVit-block indicates that the input feature map obtains both local and global information in the input image after being processed by MobileVit-block. As a result, compared to CNN networks with the same number of parameters, the feature maps output by MobileViT contain richer feature information.

SPPCSPC Module
Since the introduction of the spatial pyramid pooling (SPP) [29] module in YOLOv3, both YOLOv4 and YOLOv5 have continued to use this design. YOLOv5 improved on the SPP module and proposed the spatial pyramid pooling fast (SPPF) module to enhance its efficiency. In this article, we introduce spatial pyramid pooling cross stage partial (SPPCSPC) [30] based on the SPP module, illustrated in Fig. 3. The CSP structure [31] is incorporated into the SPP module, where the input is divided into three different branches, and the feature maps from each branch are fused to enrich the feature information.

Attention Mechanism
In recent years, attention mechanisms have been widely applied in deep learning to focus on specific parts of input information. The convolutional block attention module (CBAM) [32] is a simple and effective attention module based on spatial and channel attention, used to focus on local feature information. The transformer encoder block can capture global feature information, which is helpful in the field of object detection. Inspired by this, the CBAM module and transformer encoder block (TRE) [33] are introduced into the neck part of the original YOLOv5. The new structure can utilize attention mechanisms to dig some features and focus on more interesting target areas.
The CBAM module, illustrated in Fig. 4, is a module that combines both channel attention and spatial attention. The input feature map is processed sequentially by the channel attention submodule and the spatial attention submodule, which focus on information in the channel and spatial dimensions respectively. The corresponding weights are then fed back to the original input. This enables easy integration of the module into existing network structures for end-to-end training, and the computational cost of adding the module is negligible. Illustrated in Fig. 5, the TRE module mainly comprises two components: a multi-head attention block and a feedforward neural network structure. LayerNorm and dropout are employed to facilitate network convergence and mitigate overfitting. Multi-head attention allows the network to not only attend to the current position but also capture contextual semantic information, enabling it to extract more comprehensive and relevant features from the input.

Focal EIOU Loss Function
YOLOv5 uses CIOU-Loss as the localization loss function, which reflects the relative proportion difference of aspect ratio rather than the true difference of width and height. EIOU [34] separates the aspect ratio into width and height, calculates them separately for predicted and ground truth boxes based on CIOU. EIOU mainly consists of three factors: overlapping area, center point distance, and width and height differences. The formula is as follows: Here, c 2 w and c 2 h represent the width and height of the minimum enclosing rectangle of the predicted box and the ground truth box, respectively.
During the generation of predicted boxes, there is an issue of sample imbalance in the training process of box regression, where the number of high-quality anchor boxes with small regression errors is much smaller than the number of low-quality anchor boxes with large errors. To address this, focal loss is combined with EIOU to distinguish high-quality and low-quality anchor boxes. The formula is as follows: Here, γ is a parameter that controls the degree of suppression of outlier values. In focal EIOU loss function, the loss is greater for higher quality regression targets, as determined by their IOU values, acting as a weighting function. This helps address the issue of sample imbalance in the training process of box regression and improves the accuracy of regression.

Implementation of Improved Algorithm
The improved model architecture is illustrated in Fig. 6: The last fully connected layer and global pooling layer of the MobileViT network are not involved in feature extraction and are therefore discarded. The improved model uses the weights of the MobileViT network as the pre-training model. So, during training and detection, the input image of the MobileViT network is scaled to 256 × 256, which makes it difficult to extract feature information from small objects in the image. To address this, a small object detection layer is added, consisting of four object detection layers to mitigate the negative impact of object size changes in the image. The improved model uses a lightweight MobileViT network as the backbone, which reduces a significant amount of parameters compared to the original backbone. As a classification network, the MobileViT network only needs to extract semantic information from the image, not positional information. Positional information is more present in shallow feature maps, while semantic information is more present in deep feature maps. Object detection tasks require both positional and semantic information.
By concatenating shallow and deep feature maps (concatenating along the channel direction) and inputting the features extracted from each stage to path-aggregation network (PANet) for feature fusion, rich feature information can be obtained.

Figure 6: Improved YOLOv5 model
In PANet, to improve feature fusion, the original CSP2_X module was replaced with the TRE module, and a CBAM module was added after the TRE module to capture both global and local information and enhance the features. PANet utilizes both bottom-up upsampling and top-down downsampling, as well as attention modules for information extraction, which can enhance object detection for different object sizes.

Experiments and Results Analysis 4.1 Underwater Enhancement Algorithm
To demonstrate the effectiveness of the proposed underwater enhancement algorithm, underwater images were captured in the laboratory and compared with the CLAHE and the Multi-Scale Retinex with Color Restoration (MSRCR) [35] algorithms. The results are illustrated in Fig. 7.

Figure 7: Comparison of enhanced images
From a visual standpoint, the CLAHE algorithm does not significantly enhance the brightness of the image, but only improves the details, which is not ideal. The MSRCR algorithm performs well in terms of brightness and contrast, it does not perform well in terms of color restoration. In contrast, our proposed method yields a natural and smooth transition in brightness, improves detail information effectively, and produces natural and delicate colors.
The objective evaluation of underwater image enhancement is primarily carried out by the peak signal-to-noise ratio (PSNR), structural similarity (SSIM), underwater color image quality evaluation (UCIQE), and entropy. PSNR is utilized to measure the level of image distortion or noise, with higher values indicating better image distortion and noise levels. SSIM assesses the similarity between two images, taking into account brightness, contrast, and structure, with higher values indicating less image distortion. UCIQE evaluates the performance of images in terms of chromaticity, saturation, and contrast, with higher values indicating higher image quality. Lastly, entropy measures the richness of image details, with higher values indicating more abundant details. Table 1 shows the objective evaluation metric values of the enhancement results for different algorithms.

Object Recognition Algorithm 4.2.1 Dataset and Experimental Environment
In summary, compared to other classical algorithms, the proposed method in this article shows a more abundant enhancement effect in color details and better brightness adjustment. In terms of objective evaluation, it performs better than other algorithms in noise control, image quality, distortion level, and information entropy.
Based on the DeepTrash dataset shared by Gautam et al. [36], underwater plastic waste images were added as experimental datasets. The experimental dataset contains two categories: plastic and bottles. To increase the dataset size, the contrast and brightness of the original dataset images were randomly adjusted, and the dataset was scaled and flipped. Before training, the experimental data was divided into training-validation sets and test sets in 9:1 ratio. The training validation set was randomly divided into a training set and a validation set in 9:1 ratio. The dataset division is shown in Table 2. The experiment was performed using the Pytorch-GPU 1.7.1 deep learning framework, with an Intel (R) Xeon (R) Gold 6330 CPU @ 2.00 GHz processor and an NVIDIA RTX A5000 24 G graphics card. The experiment was conducted on an Ubuntu 20.04.4 operating system, with the NVIDIA driver version 470.103.01, CUDA version 11.3, and CUDNN version 8.2.1.
The experiment utilized the YOLOv5 pre-trained model and implemented transfer learning. The model was trained for 300 iterations (epochs) with an input size of 256 × 256. During the initial 60 iterations, only pre-trained weights were loaded, and the backbone network was frozen with a batch size of 64. For the subsequent 240 iterations, the backbone network was unfrozen with a batch size of 32. The network model training hyperparameters were set as follows: the optimizer utilized Stochastic gradient descent (SGD) with a momentum of 0.937, weight decay of 0.0005, and the SGD momentum was the trend of the loss function value during training. The maximum learning rate was set to 0.04, while the minimum learning rate was limited to 0.0016. Mosaic data augmentation and cosine annealing were employed. The loss function value changed during the model training process, illustrated in Fig. 8. The loss value gradually decreased during the first 60 iterations. The loss function value was increased suddenly from the 61st iteration, and it began to converge gradually from the 210th iteration.

Comparative Analysis of Different Detection Models
To demonstrate the superiority of our improved object detection algorithm, we conducted extensive experiments on the dataset and compared our results with the latest methods. The performance of the models is evaluated using the following metrics: precision (P), recall (R), average precision (AP), and mean average precision (mAP_0.5). In addition, the detection speed was assessed in terms of the number of plastic waste images detected per second (fps/s). The complexity of the models was quantified by the number of model parameters.
To verify the effectiveness of the proposed algorithm, experiments were conducted to train and evaluate its performance, as well as the performance of five other algorithms: YOLOv5-m, YOLOX-m, YOLOv4, YOLOv3, and faster region-based convolutional neural networks (Faster-RCNN). All six algorithms were trained and evaluated under the same software, hardware environment, and dataset, and their performance was analyzed.
The analysis results are shown in Table 3. Compared to the YOLOv5-m algorithm, Ours achieves an 18.84% increase in detection speed, a 2.9% improvement in detection accuracy, and a mere 5.99% increase in model parameters. In comparison to the anchor-free YOLOX-m algorithm, Ours improves detection speed by 22.60%, detection accuracy by 3.65%, and has a model parameter that is 88.9% smaller than YOLOX-m. Compared to YOLOv4, Ours increases detection speed by approximately 3.8%, improves detection accuracy by 4.3%, and reduces model parameters by 65.01%. Compared to YOLOv3, the detection accuracy improves by 5.9%, and the model parameters are only 36.35% of YOLOv3. Lastly, when compared to the two-stage Faster-RCNN algorithm, Ours is 1.88 times faster in detection speed and achieves an 18.1% improvement in detection accuracy. These results indicate that Ours has superior overall performance and can meet the requirements of underwater waste detection tasks.

Underwater Waste Detection Results
The test set was used to select several images for detection, and the results are illustrated in Fig. 9. The original images are shown on the left, while the recognition results are on the right. Fig. 9a presents a complex environment, with numerous small plastic bags in the image. It belongs to a scene that features a complex small target. Fig. 9b shows a strong light environment with direct sunlight. Fig. 9c depicts a dimly lit environment with low light. Fig. 9d showcases a murky environment in the laboratory, with water impurities that affect judgment. In all the scenarios mentioned above, the proposed algorithm was successful in detecting the targets. The detection results demonstrate that the model proposed in this article can accomplish underwater waste detection in complex small target scenes, strong light scenes, dimly lit scenes, and murky scenes.

Experimental Results on Recognition and Localization
RGB-D camera can not only capture real-time RGB images of the scene like a regular camera but also simultaneously capture the corresponding depth image of the scene. An RealSense D415 was used as the experimental RGB-D depth camera for image acquisition and localization recognition.
To verify the feasibility of the proposed method, experiments were conducted in a laboratory underwater environment with the device. Some of the detection results are illustrated in Fig. 10, where the left side of each image is the original image, and the right side shows the recognition result. The 3D coordinates of the detected object are approximated by the 3D coordinates of the center point of the object detection box. In the figure, the white solid point in the rectangular box is the center point of the object detection box. Its coordinates represent the coordinates of this point in the camera coordinate system obtained by transformation. The number after the category represents the distance from the center point of the object detection box to the camera, which is the depth value of this point.
This position accuracy is analyzed through 10 experiments, and results are shown in Table 4. According to Table 4, after determining the pixel coordinates and depth value of the underwater target's center point, its three-dimensional coordinates in the camera coordinate system can be calculated through the calibrated intrinsic and extrinsic parameters. Experimental results have shown that the measurement error between the measured and actual depth is within 0.008 m, indicating the system's high overall positioning accuracy that can meet practical needs.
In summary, the proposed method in this article can enhance underwater images and achieve realtime identification of underwater garbage. It can use RGB-D cameras to locate the garbage with high precision.

Conclusion
The underwater waste recognition and localization is studied in this article. Firstly, a weighted fusion-based underwater image enhancement algorithm is proposed to improve image quality. Experimental results show that the proposed algorithm has better enhancement effects on brightness, contrast, detail information, and color restoration, and the enhanced results have smoother transitions with better visual effects. Secondly, an improved YOLOv5-based algorithm is proposed. Experimental results show that the improved algorithm has higher detection accuracy and faster detection speed on the underwater waste dataset, which meets the requirements of real-time detection. Finally, the RGB-D camera, underwater image enhancement, and underwater detection and recognition tasks are combined. The Realsense-D415 camera is used to get the color and depth image. The center point coordinates of the detection box are obtained to complete the recognition and localization of underwater targets. The experimental results demonstrate the effectiveness of the proposed method for identifying and locating underwater plastic waste, and it has good recognition and localization accuracy. However, due to the influence of water flow, the underwater targets may undergo frequent movement, we will study on the identification and localization of dynamic targets in water subsequently.

Acknowledgement:
The authors would like to thank the anonymous reviewers and the editor for the very instructive suggestions that led to the much-improved quality of this article.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.