Research Paper (Team09)

1
Pump it Up: Data Mining the Water Table

Dr. Indrajeet Gupta, Himanshu Rathore, K.Shree Balaji, Manish Kumar and Jitendra Kumar
 the surface and below ground, with three times more renewable
Abstract— Water is critical to a country’s development, as water resources than Kenya and 37 per cent more than Uganda.
it is not only used in agriculture but also for industrial Despite the vast amounts of fresh water available, many
development. Like many poor nations around the world, Tanzanians are still faced with water shortages due to
Tanzania suffers from serious issues involving its people insufficient capacity to access and store it both in rural and
in regards to water. In a nation where one third of the urban areas.
The following statistics illustrate the magnitude of the
country is arid to semi-arid, it is very difficult for people
problem:
to find access to clean, sanitary water if they don't live
near one of the three major lakes that border the country.
 Access to water from a piped source all but stagnated
As a result, Tanzania's ground water is the major source over the past two decades. In 1991/92, 33.5 per cent of
of water for the nation's people; however it's not always the population had such access; this figure was 33.1 per
clean. Many of these ground water wells are located near cent in 2010. Despite this, Tanzania is doing better than
or next to toxic drainage systems, which leak into the fresh Uganda (15.3 per cent in 2006), at par with Kenya (34.3
ground water and contaminate it. Consequently, per cent in 2008-09) but far behind Senegal (68.7 per
Tanzanians turn to surface water which contains things cent in 2010);
 Urban areas witnessed a sharp deterioration in access to
like bacteria or human waste; and people have no choice
water from 77.8 per cent to 58.6 per cent. On the other
but to drink from, bathe in or wash their clothes in these hand, - rural areas experienced a slight improvement
areas. According to Tanzania National Website, water- from 19.2 percent to 24.1 per cent during the same
borne illnesses, such as malaria and cholera "account for period;
over half of the diseases affecting the population," because  A large majority of rural households (more than 70 per
people don't have access to sanitary options .Under these cent) were more than 15 minutes away from their main
circumstances people, particularly women and girls, spend water source in 2010;
a significant amount of time traveling some distance to  Only 3 per cent of total cultivated area in Tanzania was
under irrigation in 2010.
collect water. We are looking at the dataset of water
pumps in Tanzania to predict the operating condition of a
Taarifa with the help of Tanzanian Ministry of Water is
water point. By finding which water pumps are looking into the problem of the people. Taarifa is an online
functional, functional needs repairs, and non-functional, platform for crowd sourced reporting and triaging of
the Tanzanian Ministry of Water can improve the infrastructure related issues. It is a platform which allows
maintenance operations of the water pumps and make citizens to engage with the government and register their
sure that clean, potable water is available to communities issues with the local infrastructure of their area. This project
across Tanzania. While we weren’t able to identify all the aims to analyze the data collected by Taarifa and the
Tanzanian Ministry of Water.
pumps that need repair, our confidence in the ones we did
Pump it Up: Data Mining the Water Table is a
is high and we expect this to aid the maintenance process.
competition hosted by DrivenData for the betterment of
Tanzania. We want to help the Tanzanian Ministry of Water
Index Terms—Machine Learning Algorithm, Deep Neural
Networks, Random Forest, Adaboost, XGBoost, Linear SVM. in identifying these water pumps that are functional but need
repair so that an immediate action can be taken to keep them
running in a healthy state. By fixing these water pumps early,
I. INTRODUCTION the people of Tanzania could have improved and continuous
There is no doubt about the importance of water to human access to running water.
existence. People need clean water to survive and stay healthy.
Lack of clean water contributes to the high mortality rates in We are using the data from Taarifa and Tanzanian Ministry of
children around the world. Tanzania has been blessed, both on Water to predict which water pumps are functional, functional
2
needs repairs, and non functional. The data was collected

using handheld sensor, paper reports, and user feedback via
cellular phones. The dataset has features such as the location
of the pump, water quality, source type, extraction technique
used, and population demographics of pump location. The
training set has 59,401 rows and 40 features including an
output column. The output column specifies the status of the
water pump in the category of functional, functional needs
repairs, or non functional. Out of the 40 features in the data,
we have 31 categorical variables, 7 numerical variables, and 2
Figure 1. Class Imbalance
date variable.
Through this very project, we have encountered many Most common approach for this problem is classification using
challenges and risks that we tried to mitigate. Here are the machine learning algorithm like Random Forest. Random
risks and challenges we faced in our project: forests or random decision forests are an ensemble
 Missing values: learning method for classification, regression and other tasks
that operates by constructing a multitude of decision trees at
Our dataset contains many missing values in the features. For
training time and outputting the class that is the mode of the
example, the construction_year feature of the water pump classes (classification) or mean prediction (regression) of the
contained 20,709 missing rows making it hard for us to create individual trees Random decision forests correct for decision
a decay rate for the water pumps or add dynamic weather trees' habit of overfiting to their training set.
data. We also had 21,381 rows that contained missing Another approach for solving the problem was classification
population data. We mitigated the missing values risk by using Deep Neural Network. Deep Neural Network is
replacing them with the most occurred value of the feature. considered to be one of the most efficient classifier. Using Relu
 Repeated Values: Function in hidden layers and a softmax function at the outer
Our dataset contains many features that contain similar layer, it showed accuracy around 77 percent .Other approaches
representation of data presented in different grains. The for the same problem were Adaboost, Support Vector Machine
group of features of (extraction_type, extraction_type_group, (SVM). The previous approaches required a lot of
computational ability of the system and also the accuracy was
extraction_type_class), (payment, payment_type),
not upto the mark.
(water_quality, quality_group), (source, source_class), Our approach for this very Project is a machine learning
(subvillage, region, region_code, district_code, lga, ward), algorithm called XGBOOST. XGBoost is an optimized
and (waterpoint_type, waterpoint_type_group) all contain distributed gradient boosting library designed to be
similar representation of data in different grains. Hence, we highly efficient, flexible and portable. It implements machine
risk overfitting our data during training by including all the learning algorithms under the Gradient Boosting framework.
features in our analysis. We tried to avoid this risk by XGBoost provides a parallel tree boosting (also known as
identifying features in each group that contained the finer GBDT, GBM) that solve many data science problems in a fast
grain which held more information in the analysis or looked and accurate way.
at the correlation analysis across the features to see which
II. RELATED WORK
one is a better fit.
 Class Imbalance: Pump it Up : Data Mining the Water Table is basically a
Our dataset has severe class imbalance, with 32,259 data multi label Classification problem. So there has been various
approaches to tackle this problem. Some have chosen neural
points for functional water pumps, 4,317 data points for
networks while others chose various Machine learning
functional water pumps but needs repair, and 22,824 data Classifier to have a better result.
points for non functional water pumps as seen in Figure 1.Our Neural Networks are considered to be one of the best
focus is on functional pumps that need repair and can be Classifiers. As our dataset is non linear, we need to use deep
fixed. In this case, we want to increase the true positive rate neural network to introduce non linearity.This is one of the
that would essentially result in effective maintenance. At the simplest approaches to our Problem but it has not shown much
same time, we want to reduce false positives, i.e. accuracy. The highest score we got using this approach was
around 0.7749. And also tuning of a deep neural networks
misclassification of functional or nonfunctional pumps into
requires a good amount of time.
functional needs repair. This would reduce the unnecessary Another approach using machine learning algorithm is
expense. To mitigate the issue of class imbalance, we used Random Forest. Random forest is an ensemble
cost based approaches to rank the classifier performance. learning method which creates a lot of decision trees to
predict the class . By tuning the randomforest classifier
we increased our score around 0.8067. The main
3
disadvantage of Random forests is their complexity. B. Data Visualization

They are much harder and time-consuming to construct
than decision trees. Furthermore its tuning is difficult
and Overfitting can easily occur.
We have opted XGBoost classifier, a machine learning
algorithm which is providing us better results than other neural
and machine learning algorithms. eXtreme Gradient Boosting
t is an implementation of gradient boosted decision trees
designed for speed and performance. It is an implementation
of gradient boosted decision trees designed for speed and
performance.
The implementation of the model supports the features of the
scikit-learn and R implementations, with new additions like
regularization. Three main forms of gradient boosting are
supported: Here we could clearly see that water is soft then there is a very
 Gradient Boosting algorithm also called gradient boosting high probability of the water point being functional, while if it
machine including the learning rate. is salty then there are almost equal probability of functional
 Stochastic Gradient Boosting with sub-sampling at the row, and non-functional.
column and column per split levels.
 Regularized Gradient Boosting with both L1 and L2
regularization.
The implementation of the algorithm was engineered
for efficiency of compute time and memory resources. A
design goal was to make the best use of available resources to
train the model. Some key algorithm implementation features
include:
Sparse Aware implementation with automatic handling
of missing data values.
Block Structure to support the parallelization of tree
construction.
Continued Training so that you can further boost an
already fitted model on new data. Here we see that in some regions there is a very high
XGBoost algorithm can be used in both classification as probability of a water point being functional against non-
well as regression problems with giving best results functional.
among others. Status_group variation with different features.
III. METHODOLOGY
A. Preliminary Data Analysis
The first task for the project was to explore the dataset and try
to establish non linearity relationship between different
features of dataset and the labels & try to exclude those who
were not affecting our labels. There was also a major problem
as features like gps_height,population,latitude and longitude
had many missing data points, so We filled those missing data
points with the mean and median(as required) of the respective
feature in that particular district in which it was lying.
4
For Random Forest We increased our score to 0.8067.
We got the best result from the XGBoost classifier which gave
C. Model selection and training a score of 0.8114.
For further processing we dropped the unnecessary features to
ease our training procedure. We reduced our feature size to 18.
Now we have to choose the model to train our dataset. We
tried with different models to fufil our need. We tried with 5
different models and tried to tune them to maximize our result.
We first tried with linear SVM. It gave score around 0.7066.
Flow Chart:
For tuned Adaboost, we got score around 0.724 on drivendata.
For Deep Neural Network, we got the score of 0.7749.
IV. EXPERIMENTAL RESULTS

We had tried in 5 different algorithms models .in that random
forest is giving 0.8076 accuracy and then the linear svm
showing the accuracy 0.7016 and then the deep neural
networks showing the accuracy 0.7749 and then the adaboost
showing the accuracy 0.724. finally we had tried in the
XGboost its shows the best accuracy that is 0.8114 so we
taken the xgboost model has the best trained model for the
“pump it up”.
5
Here is the graphical representation of accuracies which we

got through various models:
V. CONCLUSION
Tanzania people are suffering very much due to unavailability
of Water resources and Water crisis. The objective of
the project was to predict the status of a water pump from
three possible status values (functional, functional needs repair,
non functional) based on a variety of quality and quantitative
water pump attributes and geographical values. Data was about
59,400 water pumps was used for training and validating five
distinct types classification models (Random forest, DNN,
Bootstrap Adaboost and XGboost), and the output of these
models shows the interactive geomap whereby the user can
view the predicted statuses of more than 14,850 additional
pumps whose functional status was not available as part of the
original data set..
REFERENCES
[1]. Water In Crisis :Tanzania

https://thewaterproject.org/water-crisis/water-in-crisis-tanzania
[2]. Predicting the Functional status of Pumps in Tanzania
https://towardsdatascience.com/predicting-the-functional-status-of-
pumps-in-tanzania-355c9269d0c2
[3]. Tanzania: Water is life, but access remains a problem
https://blogs.worldbank.org/africacan/tanzania-water-is-life-but-
access-remains-a-problem
[4]. Drivendata Problem Page
https://www.drivendata.org/competitions/7/pump-it-up-data-
mining-the-water-table/
[5]. Pump it Up: Data Mining the Water Table
https://medium.com/@vaibhavshukla182/pump-it-up-data-
mining-the-water-table-f903d4cfc7a8

Research Paper (Team09)

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Research Paper (Team09)

Hochgeladen von

Copyright:

Verfügbare Formate

1

Pump it Up: Data Mining the Water Table

needs repairs, and non functional. The data was collected

disadvantage of Random forests is their complexity. B. Data Visualization

For Random Forest We increased our score to 0.8067.

For Deep Neural Network, we got the score of 0.7749.

IV. EXPERIMENTAL RESULTS

Here is the graphical representation of accuracies which we

[1]. Water In Crisis :Tanzania

Das könnte Ihnen auch gefallen