Research Paper (Yafra Khan)

Predicting and Analyzing Water Quality using
Machine Learning: A Comprehensive Model

Yafra Khana, Dr. Chai Soo Seeb
Faculty of Computer Science and Information Technology (FCSIT)
Universiti Malaysia Sarawak, 94300, Kota Samarahan, Sarawak, Malaysia
a
yafra.khan@gmail.com, bsschai@unimas.my
Abstract - The deteriorating quality of natural water waterborne diseases cause death of more than 1.5 million
resources like lakes, streams and estuaries, is one of the people each year, much greater than deaths caused by
direst issues faced by humanity. The effects of un-clean accidents, crimes and terrorism combined[2]. Therefore, it is
water are far-reaching, impacting every aspect of life. very crucial to devise novel approaches and methodologies
Therefore, management of water resources to optimize for analyzing water quality and to forecast future water
the quality is very crucial. The effects of water quality trends.
contamination can be tackled efficiently if data is In order to carry out useful and efficient water quality
analyzed and water quality is predicted beforehand. This analysis and predicting the water quality patterns, it is
issue has been addressed in many previous researches, significant to determine the inter-dependence among
however, more work needs to be done in terms of different water quality parameters. Different methodologies
reliability, accuracy as well as usability of the current have been proposed and applied for analysis and monitoring
water quality management methodologies. The goal of of water quality and its parameter interdependence in past.
this study is to develop a water quality prediction model The methodologies range from statistical techniques, visual
using Artificial Neural Network (ANN) by determining modeling, analysis algorithms and prediction algorithms and
dependency among different water quality parameters, decision making. Multivariate statistical techniques like
in order to assist in decision making. This research uses Principal Component Analysis (PCA) has been used in order
the water quality historical data taken from the United to determine relationship among different water quality
States Geological Survey (USGS). For this study, the parameters[3]. The geo-statistical techniques that have been
data includes 7 parameters which affect water quality. used include kriging, transitional probability, multivariate
For the purpose of evaluating the performance of model, interpolation, regression analysis etc.[4]. The algorithms for
the performance evaluation measures used are Mean- analysis and prediction might include Artificial Intelligence
Squared Error (MSE), Root Mean-Squared (AI) techniques like Bayesian Networks (BN), Artificial
Error(RMSE) and Regression Analysis. Previous works Neural Networks (ANN) [5] Neuro-Fuzzy Inference[3],
about Water Quality prediction have also been analyzed Support Vector Regression (SVR)[6], Decision Support
and future improvements have been proposed in this System (DSS) and Auto-Regressive Moving Average
paper. (ARMA)[7]. However, the non-linear nature of water quality
Keywords: - Artificial Neural Networks, Environmental data, as in this research, makes it very complex to map
Modeling, Machine Learning input-output data and predict future water quality[8].
In order to carry out efficient water quality analysis and
1. INTRODUCTION prediction, the dependence among different water quality
Natural water resources like groundwater and surface parameters must be determined. The basic idea of this
water have always been the cheapest and most widely research is to devise a comprehensive methodology that
available resources of fresh water. However, these resources predicts , analyzes and visualizes water quality of particular
are also most likely to become contaminated due to various regions with the help of certain water quality parameters.
factors including human, industrial and commercial These parameters include physical, biological or chemical
activities as well as natural processes. In addition to that, factors which influence water quality. There are certain
poor sanitation infrastructure and lack of awareness also quality standards set up by international organizations like
contributes immensely to drinking water contamination [1]. World Health Organization (WHO) and Environmental
The effects of water quality deterioration are far-reaching, Protection Agency (EPA), which serve as a benchmark for
impacting health, environment and infrastructure in a very determining the quality of water. In its document
adverse manner. According to United Nations (UN), “Parameters of Water Quality”, EPA mentions a total of 101
parameters which have an effect upon water quality in one
1
way or another [9]. However, some parameters have a .
greater and more visible effect on water quality than others.
This paper intends to address this issue by suggesting a
model based upon Machine Learning techniques in order to
predict the future water quality trends of a particular area
with the help of current water quality data and determine
relationships among different water quality parameters.
Artificial Neural Networks (ANN) model is used in order to
develop a comprehensive methodology for efficient water
quality prediction and analysis. This includes a correlation
analysis between different water quality parameters that
determines the dependency and relationship among different
water quality parameters.
2. DATA ACQUISITION AND STUDY AREA

Previous studies have shown that the richness and quality of
data determines the accuracy and reliability of analysis[10].
Since most of the water monitoring organizations have lack
of detail and insufficient observations [11], we have opted
for the acquisition of data from one of the most reliable
water resources in the world which is usually pre-processed
and frequently updated. The sample data for this research
has been acquired from U.S. Geological Survey’s (USGS)
National Water Information System (NWIS) which is an
open data repository supporting acquisition, processing and
long-term storage of water quality data across the U.S. The
data comprises of 7 water quality parameters including
temperature, turbidity, precipitation, nitrate, pH and salinity.
However, there would be a selected subset of these Figure 1. Area covered in the Island Park and Hog Island
parameters in order to make use of the parameters which Channel Monitoring Station
contribute the most towards water quality. To do this, we
typically use the hit and trial method and then come up with 3. THEORETICAL BACKGROUND OF APPLIED
the incremental contribution of a variable. The study area of METHODOLOGY
this research lies in Island Park village, situated in the South-
The methodology used in this study comprises of Machine
Western Nassau County with Latitude 40°36'31.8",
learning with training and testing data from USGS online
Longitude 73°39'22.0” in the state of New York (Figure 1).
data repository. The theoretical background of the
The measurements of the data for the monitoring station of
methodology is as follows:
Hog Island Channel have been used in this study, where
water samples are collected and monitored by USGS using
different techniques. For measurement, Satellite telemeter is
3.1. Artificial Neural Network
mainly used with readings collected from 1.6 Ft. above
ANN has been widely acknowledged as a
bottom. Data from 2014 with the time-interval of 6 minutes
methodology for classification of complex datasets
has been acquired in order to carry out an efficient
such as those of environmental processes. It has the
prediction process using this time-series data that includes
ability to efficiently describe the non-linear
date/time, parameters and their measurements along with
relationship of the complex water quality datasets
measurement units.
[12]. Moreover, it has strong adaptability to depict
the changes that might occur in the water
environment of a particular area. The algorithmic
architecture of ANN attempts to simulate the
structure and networks in a human brain, with an
input layer, hidden layer and output layer each
consisting of nodes. There might be one or more
hidden layers, depending upon the problem at hand.
2
Figure 2. Structure of Artificial Neural Network
In addition to that, there are connections between

the nodes with varying “weights”[13]. For this
particular research, a feed-forward and back-
propagation Neural Network with three layers has
been used, i.e. one input, one hidden and one output
layer. In the feed forward process, the weights are
multiplied by the inputs and the resultant value is
moved forward towards the next layer, until it more efficient, we need to train it with as much
reaches the output layer, as follows: examples as possible, while maintaining its
generalization. For this study, data was divided into
∑ training data (70%), testing data (15%) and
validation data (15%). It was made sure that data
for training, testing and validation is from the same
time period, as seasonal variations affect water
Where is the weight transferred from jth input quality. For better analysis, data was scaled to fall
to the ith node, is the input and zi is the between the ranges of [0,1].
summation of outputs of the ith node. In this study,
for correlation between the parameters, the input
layer initially consists of 10 units denoting the 4. RESULTS AND DISCUSSION
water quality parameters. The back-propagation
A test was conducted to determine and forecast the
process determines the error value by calculating
correlation and dependency among different water quality
the difference between estimated value and
parameters. This has been done using Artificial Neural
expected value, starting from output layer towards
Network (ANN). Some statistics about the selected water
the input layer[5]. It is denoted by the symbol δ(l)j,
quality parameters for the year 2014 were collected from
which is equal to error of node j in layer l. For a
USGS, including Minimum Value, Maximum Value and
training set (xj,yj), the error term is:
Mid-Range value, in order to depict the range of values
(Table 1). As chlorophyll concentration is directly
δ(l)j= zj - yj representative of the ecological state of a water body[11],
It is an iterative process, so after adjustment of the one of significant experiments involves the prediction of the
weights, the process is run repeatedly until value of chlorophyll concentration by supplying the model
convergence. In order to make an ANN model with the values of Temperature, Salinity, Nitrate
3
concentration, Turbidity, Dissolved Oxygen concentration quickly (57 epochs with best performance on epoch 51). The
and Specific Conductance. graph almost overlaps on MSE in the range of 10 -4 and the
Four ANN models have been created for this test for the MSE value decreases drastically at this point. Similarly,
parameters of Chlorophyll, Turbidity, Specific Conductance when we look at the Regression Analysis for Dissolved
and Dissolved Oxygen. In these tests, there are 6 input units Oxygen (DO), we can see data fitted well with the
with samples ranging from January to March 2014, with the function(R=0.994), with only a few outliers visible. The
seventh quality parameter being the target. A feed-forward MSE for DO shows the training and validation error almost
Neural Network has been used with the training algorithm of completely overlapping, hence there is less chance of over-
Scaled Conjugate Gradient (SCG) and the activation fitting. The Regression Analysis for Turbidity (Figure 6(a))
function of Log Sigmoid. After running the test, the shows relatively under-fitted data points, and hence the
performance parameters of Regression(R), Mean Squared regression value is 0.5. If we look at the graph of MSE
Error (MSE) and Root Mean Squared Error (RMSE) have (Figure 6(b)), we can see that it has 189 epochs; hence it has
been calculated. The performance is shown graphically with taken a relatively more time for the function to converge.
MSE and Regression analysis of four models. (Figure The performance measures and analysis can be verified by
3,4,5,6). The values of the performance measures for four looking at Table 2.
ANN models for training and testing processes are shown in Minimum Maximum Mid-Range
Parameter
the table (Table 2). Value Value Value
The graphs for Regression Analysis show how well the Temperature( oC) -1.0 28.4 14.7
data fits into the function, both for training and testing. The
closer the value of Regression is to 1, the better the function Specific 38900 49100 44000
fits and hence it indicates better accuracy. The graphs for Conductance(µS/cm)
MSE show the amount of epochs (iterations) it takes for the
function to converge and related MSE for training, testing
Salinity(PSU) 19.9 31.7 25.8
and validation. We can see in Figure 3(a) that most data for
Chlorophyll prediction fits into the range of 0 and 0.5,
though there are a lot of outliers. Here, Regression for both Nitrate(Mg/L) 0.01 1.0 0.505
training and testing is 0.6, hence, the training and testing
data does not fit that well. The graph of MSE (Figure 3(b))
shows that it takes 65 epochs, with best performance on Dissolved 3.6 18.0 10.8
epoch 59. We can see that the MSE for training, testing and Oxygen (Mg/L)
validation almost overlaps, hence the MSE value lies around
10-2 and the MSE decreases very slightly as the iterations
increase. On the other hand, if we see Regression Analysis Turbidity(FNU) <0.1 120 --
for Conductance (Figure 4(a)), we can analyze that the data
almost entirely fits the function with negligible amount of
Chlorophyll(µg/L) 0.7 140 70.35
outliers, hence the value of Regression is approximately
equal to 1 (0.998). The performance can also be seen in
MSE (figure 4(b)) where function converges even more Table 1: Characteristics of Water Quality Data for 2014
Figure 3(a) Regression Analysis for Chlorophyll Figure 3(b) Mean Squared Error for Chlorophyll
4
Figure 4(a) Regression Analysis for Conductance Figure 4(b) Mean Squared Error for Conductance
Figure 5(a) Regression Analysis for Dissolved Oxygen Figure 6(b) Mean Squared Error for Dissolved Oxygen
Figure 6(b) Regression Analysis for Turbidity Figure 6(b) Mean Squared Error for Turbidity
5
Training Data Testing Data
Parameters Unit Model No. of R MSE RMSE R MSE RMSE
Inputs
Chlorophyll µg/L ANN 6 0.665 0.0051 0.071 0.611 0.00504 0.070
Specific Conductance µS/cm ANN 6 0.998 0.00014 0.012 0.996 0.000141 0.011
Dissolved Oxygen mg/L ANN 6 0.994 0.00092 0.030 0.992 0.00119 0.034
Turbidity FNU ANN 6 0.534 0.00040 0.02 0.552 0.00025 0.015
Table 2 Performance measures with ANN
Interrelationships among Water Quality Parameters in

Recirculating Aquaculture System,” no. JUNE 2014, 2015.
5. CONCLUSION [4] Y. Wang, Y. Wang, M. Ran, Y. Liu, Z. Zhang, L. Guo, Y. Zhao,
and P. Wang, “Identifying Potential Pollution Sources in River
This paper analyzes water quality parameters to determine Basin via Water Quality Reasoning Based Expert System,” 2013
relationships and dependency among different parameters. Fourth Int. Conf. Digit. Manuf. Autom., pp. 671–674, 2013.
The data used has been acquired from USGS National
[5] S. Maiti and R. K. Tiwari, “A comparative study of artificial
Water Information System (NWIS), with data from the year neural networks, Bayesian neural networks and adaptive neuro-
of 2014. The specified monitoring station is a channel fuzzy inference system in groundwater level prediction,”
situated in the State of New York. The measurements of 7 Environ. Earth Sci., vol. 71, no. 7, pp. 3147–3160, 2013.
water quality parameters were scaled between 0 and 1 for [6] C. Min, “An Improved Recurrent Support Vector Regression
better data handling. Artificial Neural Network (ANN) has Algorithm for Water Quality Prediction,” vol. 12, pp. 4455–4462,
been used with Scaled Conjugate gradient (SCG) as training 2011.
algorithm. Four ANN models for the measurements of the [7] D. Hou, X. Song, G. Zhang, H. Zhang, and H. Loaiciga, “An
parameters of Chlorophyll, Specific Conductance, Dissolved early warning and control system for urban, drinking water
Oxygen and Turbidity have been developed and analyzed. quality protection: China’s experience.,” Environ. Sci. Pollut.
The performance measures that are used to depict the result Res. Int., vol. 20, no. 7, pp. 4496–508, 2013.
are Regression, Mean Squared Error (MSE) and Root Mean [8] A. J, “SAS Global Forum 2008 Data Mining and Predictive
Squared Error (RMSE). Modeling Data mining application of non-linear mixed modeling
The results of the conducted tests show that the parameter in water quality analysis SAS Global Forum 2008 Data Mining
and Predictive Modeling,” Forum Am. Bar Assoc., 2008.
which has the most dependence upon the other six
parameters is Specific Conductance with Regression value [9] The Environmental and Protection Agency, “Parameters of water
approximately reaching 1. It has the lowest MSE and RMSE quality,” Environ. Prot., p. 133, 2001.
values among all others, hence performance accuracy is [10] C. Leansing, T. Hartvigsen, and J. Reitan, “T He E Ffect of D
higher. Although the MSE value of Turbidity is low, Ata Q Uality on D Ata M Ining – I Mproving P Rediction a
indicating the performance accuracy, but it takes longer time Ccuracy By G Eneric D Ata,” Proc. 15th Int. Conf. Inf. Qual.,
2010.
to converge and does not fit well with the function. In future
studies, besides further improvements in prediction [11] Y. Park, K. H. Cho, J. Park, S. M. Cha, and J. H. Kim,
accuracy, there needs to be a more user-centric approach “Development of early-warning protocol for predicting
chlorophyll-a concentration using machine learning models in
towards tackling the water quality issues, by involving all freshwater and estuarine reservoirs, Korea.,” Sci. Total Environ.,
the relevant stakeholders, using user-friendly tools and an vol. 502, pp. 31–41, Jan. 2015.
interactive environment so that the solution actually benefits
[12] S. Song, X. Zheng, and F. Li, “Surface water quality forecasting
the target users in tackling water quality issues. based on ANN and GIS for the Chanzhi Reservoir, China,” 2nd
Int. Conf. Inf. Sci. Eng. ICISE2010 - Proc., pp. 4094–4097, 2010.
[13] D. Graupe, “PRINCIPLES OF ARTIFICIAL NEURAL

NETWORKS,” Advanced Series on Circuits and Systems, vol. 6.
World Scientific, University of lllinois, Chicago, USA, 2007.
REFERENCES
[1] P. Zeilhofer, “GIS applications for mapping and spatial modeling

of urban-use water quality: a case study in District of Cuiabá,
Mato Grosso, Brazil,” Cad. Saúde …, vol. 23, no. 4, pp. 875–884,
2007.
[2] UN water, “Clean water for a healthy world,” Development, pp.

1–16, 2010.
[3] A. Babatunde and A. Olusegun, “Interrelationships among Water

Quality Parameters in Recirculating Aquaculture System

Research Paper (Yafra Khan)

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Research Paper (Yafra Khan)

Hochgeladen von

Copyright:

Verfügbare Formate

Predicting and Analyzing Water Quality using

Machine Learning: A Comprehensive Model

2. DATA ACQUISITION AND STUDY AREA

In addition to that, there are connections between

Interrelationships among Water Quality Parameters in

[13] D. Graupe, “PRINCIPLES OF ARTIFICIAL NEURAL

[1] P. Zeilhofer, “GIS applications for mapping and spatial modeling

[2] UN water, “Clean water for a healthy world,” Development, pp.

[3] A. Babatunde and A. Olusegun, “Interrelationships among Water

Das könnte Ihnen auch gefallen