Water resources are an indispensable precious resource for human survival and development. Water quality prediction plays a vital role in protecting and enhancing water resources. Changes in water quality are influenced by many factors, both long-term and short-term. Therefore, according to water quality changes’ periodic and nonlinear characteristics, this paper considered dissolved oxygen as the research object and constructed a neural network model combining convolutional neural network (CNN) and long short-term memory network (LSTM) to predict dissolved oxygen index in water quality. Firstly, we preprocessed the water quality data set obtained from the water quality monitoring platform. Secondly, we used a CNN network to extract local features from the preprocessed water quality data and transferred time series with better expressive power than the original water quality information to the LSTM layer for prediction. We choose optimal parameters by setting the number of neurons in the LSTM network and the size and number of convolution kernels in the CNN network. Finally, LSTM and the proposed model were used to evaluate the water quality data. Experiments showed that the proposed model is more accurate than the conventional LSTM in the prediction effect of peak fitting. Compared with the conventional LSTM model, its root mean square error, Pearson correlation coefficient, mean absolute error and mean square error were respectively optimized by 5.99%, 2.80%, 2.24%, and 11.63%.

The importance of water to human life is self-evident, human survival and development are closely associated with water resources. With the rapid development of China’s social economy, many industrial wastes and pollutants have also been discharged into nature [

With the continuous development of Internet of Things technology and the Internet, the digital economy is developing rapidly, and big data and related technologies are increasingly altering our daily life. By analyzing the big data generated in production and life [

In recent years, with the rapid development of neural networks and deep learning algorithms, they have been widely used in all walks of life [

Among the various deep learning frameworks, CNN requires fewer parameters and is well suited for processing data with statistical stationarity and local correlation. LSTM neural networks are specially designed to learn time series data with long-term dependencies and have great advantages in learning long-term dependencies and temporality in higher-level feature sequences. These characteristics are very suitable for the needs of water quality prediction [

The rest of this article is organized as follows: An overview of related technologies is presented in Section 2. In Section 3, the author describes the design of the proposed CNN-LSTM hybrid model, its prediction scheme, model composition, and evaluation index. Section 4 contains the process of building the model and comparing the models. Section 5 summarizes the full text and prospects for further research.

CNN is one of the most successful deep learning algorithms, and its network structure is divided into 1 Dimensional CNN (1D-CNN), 2 Dimensional CNN (2D-CNN), and 3 Dimensional CNN (3D-CNN). 1D-CNN is usually used for sequence data processing, 2D-CNN is usually used for image and text processing, and 3D-CNN is usually used for video processing [

RNN is a kind of artificial neural network, first proposed by some famous foreign scholars such as Jordan and Pineda in the 1980s. Unlike the feedforward type, RNN is not limited by the input length and can use the internal state to process the input sequence. The output of each layer is fed back to the input of the previous layer, which provides the characteristics of the system with memory [

In the LSTM network, each LSTM module consists of an input gate i, an output gate o, a forgetting gate f, and a cell state c. The forget gate mainly determines which information to forget from the cell state; the input gate is to update which information from the cell state; and the output gate is to determine which information to output from the cell state, and the entire storage unit can be expressed by the following formula:_{t} is the hidden state at time t; C_{t} is the tuple state at time t; X_{t} is the input at time t; H_{t − 1} is the hidden state at t − 1 moment, and the hidden state at the initial moment is 0. i_{t}, f_{t}, g_{t}, and o_{t} are input gate, forget gate, select gate and output gate respectively; Sigma represents the Sigmoid activation function. In the transmission process of each unit, c_{t} is usually c_{t−1} transferred from the previous state plus some values, which changes slowly, while h_{t} has a wide range of values, so different nodes will have great differences [

Forgetting stage. This stage is mainly to selectively forget the input sequence of the previous node, which will “forget the unimportant and remember the important information”. That is, the value of f_{t} is used to control what needs to be remembered and what needs to be forgotten in the previous state c_{t−1}.

Selecting the memory stage. In this stage, the input sequence Xt is selectively “remembered”. The input of the current unit is the calculated i_{t}, which can be selectively output by g_{t}.

Output stage. This phase determines which states will be considered as outputs of the current state, controlled primarily by o_{t}, and scaled on c_{t} using the tanh activation function.

In this paper, the water quality monitoring data set is constructed by a web crawler, and the dissolved oxygen concentration prediction model is constructed by CNN-LSTM hybrid network. The schematic block diagram is shown in

The changes of water quality data with time show a certain periodicity, and will be affected by other external factors, which makes the data change nonlinearly. Therefore, it is difficult to directly predict changes in water quality. Using the LSTM model alone for prediction can affect the maximum and minimum values of the data results, introducing noise that is not relevant to the prediction. Using the CNN model alone can lead to an overfitting problem caused by the excessive proportion of parameters in the whole connection layer [

As can be seen from that in

Some commonly used evaluation indexes include mean square error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE), Pearson correlation coefficient (PCC) [

MSE is used to monitor the deviation between the predicted value and the actual value. It is an index to evaluate the performance of neural networks. The calculation formula of MSE is:

RMSE reflects the dispersion degree of systematic error and represents the deviation range between the predicted value and the real value. Different from MSE, MSE will change the dimension in the operation process, so the influence of dimension can be eliminated through RMSE. The larger the value of RMSE, the stronger the data volatility. The calculation formula of RMSE is:

MAPE reflects the ratio of actual error to the actual value, and essentially considers the ratio of actual error to the actual value. Its function is to use the same data to predict different models. The smaller the MAPE of the model prediction, the model will be better. The calculation formula of MAPE is:

Pearson correlation coefficient (PCC) is used to evaluate the correlation between real and predicted values. The closer the Pearson correlation coefficient is to 1, the better the restoration effect is. The calculation formula of PCC is:_{i} is the measured value; x_{i}′ is the predicted value of the model output; i is the sample number; n is the number of samples.

The water quality monitoring data are from China National Environmental Monitoring Station (

When we enter the water quality monitoring platform, we can find that the data on this page is loaded dynamically. Therefore, if a crawler directly requests the uniform resource locator (URL) of the current interface, the returned hypertext markup language (HTML) is only the source code of the current interface, not the complete source code of all interfaces. There are usually two ways to crawl data for web pages that dynamically load data. One is to analyze the data interface of dynamically loaded data and convert it to a Python object by reading the JavaScript object notation (JSON) format string from the file. The second method is to load the web page and parse the rendered source code using selenium and browser to simulate the habits of natural people. Because the second method consumes central processing unit (CPU) and memory and is slow and slow in performance, this paper uses the first method to read JSON format string from a file and convert it to a python object to process the data. By going into developer mode, we can view the web page’s source code and related files and analyze the data interface for dynamically loading data.

After extracting the required form through the POST request method, all the information could be crawled by changing the Pageindex and the PageSize. After analyzing the crawling data, a total of 5676 water quality monitoring data in the past three years, from July 1, 2018, to January 31, 2021, were summarized. The sorted data are shown in

Indicator name | Number of data | Range | Number of missing data |
---|---|---|---|

Dissolved oxygen | 5656 | 1.04–19.55 | 20 |

Water temperature | 5656 | 2.48–33.5 | 20 |

Conductivity | 5457 | 197.80–732.10 | 219 |

Turbidity | 5458 | 3.39–2571.46 | 218 |

Permanganate Index | 5022 | 1.67–12.33 | 654 |

Ammonia nitrogen | 5177 | −0.02–1.62 | 499 |

Total phosphorus | 5320 | 0.02–0.36 | 356 |

Total nitrogen | 5177 | 1.45–9.93 | 499 |

With the advent of the era of big data, the clutter, complexity, and fuzziness of raw data make the data processing face huge challenges in many aspects such as perception and calculation [

Data preprocessing comprises three parts: feature selection, data cleaning, and data conversion. The flow chart is shown in

Selection of eigenvalues. By observing the number of missing values in each indicator data of national monitoring sites in

Missing value handling. The causes of the missing value [

Outliers handling. In the process of data collection, there will be abnormal objects due to different types of data sources, data measurement, and collection errors. Abnormal objects are often called outliers. Outlier detection, also known as deviation detection and exception mining, is often used as an important part of data mining. Its task is to find objects significantly different from most data. Therefore, most data mining methods treat this difference information as noise [

Normalization. In order to make CNN-LSTM model converge faster and have higher stability in the training process, the data of dissolved oxygen are normalized to make the dissolved oxygen data between [0,1]. The normalization formula is:

Data dividing. In order to prevent the model from performing well in the training set but generally in the test set, the generalization ability is weak. Therefore, this paper resamples 915 data of dissolved oxygen concentration in chronological order by day and divides the data into training set and verification set in the proportion of 8:2 in the process of training. That is, 732 sample data from July 1, 2018, to July 1, 2020, are used for model training, and 183 sample data from July 2, 2020, to January 31, 2021, are used to verify the performance of the model.

Data restoring. When evaluating the model after training, to eliminate the impact of normalization on the prediction results, the predicted data needs to be restored to evaluate the error of the model’s predicted value. The inverse normalization formula is:

The CNN-LSTM model is mainly composed of the input layer, convolution layer, pooling layer, LSTM layer, fully connected layer, and output layer. In the windows environment, the deep learning library Keras 2.3.1 is used to build a neural network model. Google’s open-source artificial intelligence system Tensorflow1.1.0 is used as the back-end computing framework. Because adaptive motion estimation (Adam) has the advantages of fast convergence and better learning ability, the prediction models in this paper are trained by using Adam’s optimization algorithm and learning rate of 0.001. At the same time, to reduce the influence of human factors on the prediction model, we complete 100 epochs (one epoch means traversing all samples in a training set) to obtain the parameters of the CNN-LSTM model.

The construction of the model consists of two parts: the construction of the LSTM model and the construction of the CNN model. The first part is the determination of LSTM model parameters. To determine the most appropriate number of neurons, we need to put the test set data into the LSTM model for training. From

LSTM neurons | RMSE | PCC | MAPE | MSE |
---|---|---|---|---|

8 | 0.9549 | 0.7686 | 6.9338 | 0.9119 |

16 | 0.9504 | 0.7708 | 6.9304 | 0.9032 |

32 | 0.9477 | 0.7721 | 6.9278 | 0.8982 |

64 | 0.9516 | 0.7702 | 7.0461 | 0.9056 |

128 | 0.9527 | 0.7697 | 7.0889 | 0.9075 |

The second part is the determination of CNN model parameters. Because the CNN-LSTM prediction model uses the data extracted from the features of convolution layer and pooling layer as the input of LSTM. It is necessary to determine the parameters of the CNN convolution layer and pooling layer when determining the LSTM parameters. However, the size and number of various convolution kernels will also affect the actual effect. Therefore, this paper sets the number of convolution cores as four groups (8, 16, 32, 64) and the size of convolution cores as three groups (1, 2, 3). When the number of convolution cores is fixed, the effects of different convolution core sizes on the four prediction indexes are tested, and the parameter configuration and model evaluation index values are shown in

Number of convolution kernels | Convolution kernel size | RMSE | PCC | MAPE | MSE |
---|---|---|---|---|---|

8 | 1 | 0.9128 | 0.7885 | 6.8440 | 0.8333 |

2 | 0.9309 | 0.7801 | 6.8025 | 0.8667 | |

3 | 0.9314 | 0.7799 | 6.8192 | 0.8677 | |

16 | 1 | 0.9174 | 0.7864 | 6.7987 | 0.8416 |

2 | 0.9122 | 0.7889 | 6.7544 | 0.8321 | |

3 | 0.9141 | 0.7880 | 6.7718 | 0.8335 | |

32 | 1 | 0.9164 | 0.7869 | 6.7921 | 0.8399 |

2 | 0.9142 | 0.7879 | 6.8537 | 0.8359 | |

3 | 0.9223 | 0.7874 | 6.9869 | 0.8506 | |

64 | 1 | 0.9042 | 0.7925 | 6.7739 | 0.8176 |

2 | 0.8909 | 0.7937 | 6.7728 | 0.7937 | |

3 | 0.9114 | 0.7892 | 6.8178 | 0.8306 |

It can be seen from

Parameter | Parameter value |
---|---|

Training times | 100 |

The number of convolution kernels | 64 |

Convolution kernel size | 2 |

LSTM neurons | 32 |

The change curve of the loss function of the proposed CNN-LSTM model is shown in

In order to compare the prediction effect of the two models on the actual data, this experiment uses the LSTM model and the CNN-LSTM model to predict the dissolved oxygen concentration in the test sample respectively. At the same time, in order to more clearly compare the prediction effects of the two models, we performed a visual comparative analysis of the first 100 samples, as shown in

The four evaluation metrics MAPE, PCC, MSE, and RMSE of the CNN-LSTM model and LSTM model are shown in

Comparison of experimental results | RMSE | PCC | MAPE | MSE |
---|---|---|---|---|

LSTM | 0.9477 | 0.7721 | 6.9278 | 0.8982 |

CNN-LSTM | 0.8909 | 0.7937 | 6.7728 | 0.7937 |

In order to solve the problem of too many influencing factors and difficult prediction of water quality change, this paper proposed a CNN-LSTM combined model to forecast the dissolved oxygen data in water quality. In this model, the data features extracted by CNN can be stored in LSTM for long-term memory, highlighting the role of these data features in the prediction process, thus improving the accuracy of the model. Compared with the traditional LSTM model, the RMSE, PCC, MAE, and MSE respectively improved by 5.99%, 2.80%, 2.24%, and 11.63%. The drawbacks of the proposed method include too few selection factors and only considering the iterative method to solve the above tasks. With this in mind, the further research is necessary, including the following:

More input variables will be added to optimize the prediction model further. A multi-layer hidden layer will be used to improve the accuracy of the prediction of dissolved oxygen concentration.

Non-iteration methods will be used to incorporate the model, which further improves the accuracy of the model prediction [

While adding more input variables, some independent variables are related to each other so that we can eliminate collinearity by principal component analysis.

The authors would like to appreciate all anonymous reviewers for their insightful comments and constructive suggestions to polish this paper in high quality.