Sie sind auf Seite 1von 16

2016. 11. 27.

Mining and visualising real­world data

Mining and visualising real-world data

About this module


An important part of Machine Learning is being able to manipulate and “get a feel for” the
data. This is an extremely critical step in any data analysis process as it aims to:

maximise the insight into the dataset and summarise its main characteristics

detect mistakes, missing values, outliers and anomalies

determine relationships among the input (explanatory) variables

uncover underlying structure and patterns

boost the data quality and avoid "garbage-in, garbage-out"

provide the basis and support the selection of appropriate Machine Learning tools to be
applied

In this module, you will be introduced to the rich set of Python-based tools for data
manipulation and visualisation. You will be using data from a publicly available case study
using an external file ( retail_data.csv ), which you will need to load and pre-process before
you can apply any Machine Learning algorithms. In this module you will learn how to do this.
You will also learn how to further explore and interpret your data through simple measures
and plotting.

Case Study
The dataset we will be using throughout today’s workshop is an aggregated and adapted
version of the Online Retail Case Study (https://archive.ics.uci.edu/ml/datasets/Online+Retail) from
the UCI Machine Learning repository, which is a great source for publicly available real-world
data for Machine Learning purposes. The dataset has been designed for this workshop with
the purpose of modelling the behaviour of customers ("returning" vs. "non-returning"
customers) based on their transaction activity (such as balance, max spent and number of
orders, among others).

http://beta.cambridgespark.com/courses/jpm/02­module.html 1/16
2016. 11. 27. Mining and visualising real­world data

Figure 1. Preview of the online retail dataset. A description of the columns of this table can be
found in data/features_description.md .

Quiz: Problem and data understanding


Before we can even think about classification, it is vital to first understand the problem
under study, define a clear Machine Learning objective and identify possible features
that might be useful to take into account.

What is the ML problem we will address?

What are the observations (samples)?

What are the features (attributes)?

What is the y (target/response/class) variable?

Answer:

In this case, the classification consists of modelling returning versus non-returning


customers.

The observations are the independent customers.

The features are relevant to the activity of the customers ( balance , max_spent ,
n_orders , etc.).

The target variable or class highlights whether the customer is returning or not ("yes"
vs. "no").

Loading the Python libraries


The libraries we will be using throughout the day for our hands-on activities include:

pandas : for high-performance, easy-to-use data structures and data analysis functions.

scipy : for routines such as numerical integration and optimisation.

numpy (NumPy): for its array data structure and data manipulation functions.

http://beta.cambridgespark.com/courses/jpm/02­module.html 2/16
2016. 11. 27. Mining and visualising real­world data

plotly : graphing library for interactive, publication-quality graphs.

sklearn (scikit-learn): for Machine Learning and model evaluation functions.

Start by opening the provided file jpm-vanilla.ipynb . At the very top of the file, you should
see all the various import statements we will be using throughout today’s workshop. These tell
the Python interpreter (the engine that runs the program) that these libraries are required for
the program to run. In this case:

PYTHON
# compatibility with python2 and 3
from __future__ import print_function, division

# numerical capacity
import scipy as scipy
import numpy as np
import pandas as pd

# matplotlib setup
%matplotlib inline
import matplotlib.pylab as plt
import seaborn as sns

# plotly setup
import plotly.plotly as py
from plotly.graph_objs import *
from plotly.tools import FigureFactory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode() (2)

# extra tools
from mpl_toolkits.mplot3d.axes3d import Axes3D
import visplots (1)

# the tools we will use from SKLEARN

# GENERAL SKLEARN TOOLS


from sklearn import preprocessing, metrics
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

# UNSUPERVISED LEARNING MODULE


from sklearn.cluster import KMeans
from sklearn.metrics.cluster import silhouette_score

# DTS and RFS MODULE


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# SVM MODULE
from sklearn.svm import SVC

1. The visplots library has been developed in-house, and provides additional plotting
functionality for the visualisation of the classifiers' boundaries and their performance.

http://beta.cambridgespark.com/courses/jpm/02­module.html 3/16
2016. 11. 27. Mining and visualising real­world data

2. Run at the start of every IPython notebook to use plotly.offline . This injects the
plotly.js source files into the notebook.

Importing the data


As a first step we load the dataset from the provided retail_data.csv file with pandas . To
achieve this you will use the .read_csv() method. We just need to point to the location of the
dataset and indicate under what name we want to store the data, i.e. retail , and pandas
will do the rest.

At a first stage, the data has only been loaded. Let’s have a look at the top few lines - we can
use the .head() method to achieve this.

PYTHON
# Import the data and explore the first few rows

retail = pd.read_csv('data/retail_data.csv', index_col='CustomerID')


header = retail.columns.values
retail.head()

Before you can feed data into a Machine Learning algorithm, you first need to convert the
imported DataFrame into a numpy array. It is also good practice to always check the
dimensionality of the input data using the shape command to confirm that you really have
imported all the data in the correct way (e.g. one common mistake is to get the separator
wrong and end up with only one column).

PYTHON
# Convert to numpy array and check the dimensionality

npArray = np.array(retail)
print(retail.shape)
> (1998, 11) (1)

1. These values tell us that our imported data consist of 1998 rows (samples) and 11 columns
(features).

Split the data into input features, X, and outputs, y


The input features, X , are the variables that you use to predict the outcome. In this dataset,
there are ten input features stored in columns 1-10 (index 0-9, although the upper bound is not
included so the range for indexing is 0:10), all of which have continuous values. The output
label, y, holds the information of whether the customer has returned or not ("yes" vs. "no"),
and is stored in the final (eleventh) column (index 10). To split the data, we need to assign the
columns of the input features and the columns of the output labels to different arrays:

http://beta.cambridgespark.com/courses/jpm/02­module.html 4/16
2016. 11. 27. Mining and visualising real­world data

PYTHON
# Split to input matrix X and class vector y

X = npArray[:,:-1].astype(float) (1)
y = npArray[:,-1]

1. To use the values in X as continuous (floating point) values, we need to explicitly convert
or “cast” them into the float data type (which is Python’s data type for representing
continuous numerical values).

Try printing the size of the input matrix X and class vector y using the shape function:

PYTHON
# Print the dimensions of X and y

print("X dimensions:", X.shape)


> X dimensions: (1998, 10)

print("y dimensions:", y.shape)


> y dimensions: (1998,)

These tell us that X is a 2-dimensional array (matrix) with 1998 rows and 10 columns, while y
is a 1-dimensional array (vector) with 1998 elements. Also, based on the class vector y, the
customers are classified into two distinct categories: "yes" for returning customers and "no"
for non-returning customers.

Exploratory Data Analysis


Visualisation is an integral part of Data Science. Exploratory data analysis (EDA) is the field
dealing with the analysis of data sets as a means of summarising their main characteristics,
most often using visual methods.

Plotly (https://plot.ly/) is an online collaborative data analysis and graphing tool that we will use
in order to construct fully interactive graphs. The Plotly API allows you to access all of the
library’s interactive functionality directly from Python (or other programming languages such
as R, JavaScript and MATLAB, among others). Crucially, Plotly has recently been made open-
source (https://plot.ly/javascript/open-source-announcement/), which now enables plotting without
requiring access to their API. Plotly Offline (https://plot.ly/python/offline/) brings interactive Plotly
graphs to the offline Jupyter (IPython) Notebook environment.

Investigate the y frequencies


An important aspect to understand before applying any classification algorithms is how the
output labels are distributed. Are they evenly distributed, or is one (or some, in cases where
the outcome is not just binary) of them rarer than the other(s)?

http://beta.cambridgespark.com/courses/jpm/02­module.html 5/16
2016. 11. 27. Mining and visualising real­world data

Imbalances in distribution of labels (classes) is a frequent problem when working with real-
world data, and it can often lead to poor classification results for the minority class even if the
classification results for the majority class are very good. This is something to bear in mind
when evaluating the performance of a model, since for imbalanced class distributions, the
overall accuracy can be high even when the model performs very poorly when classifying the
minority class. In order to investigate the class frequency, we will use the itemfreq()
function as follows:

PYTHON
# Print the y frequencies

yFreq = scipy.stats.itemfreq(y)
print(yFreq)
> [['no' 260]
['yes' 1738]]

In our current dataset, you can see that the y values are categorical (i.e. they can only take one
of a discrete set of values) and have a non-numeric representation, "yes" or "no". This can be
problematic for scikit-learn and plotting functions in Python, since they assume numerical
values, so we need to map the text categories to numerical representations using
LabelEncoder and the fit_transform function from the preprocessing module:

PYTHON
# Convert the categorical to numeric values, and print the y frequencies

le = preprocessing.LabelEncoder()
y = le.fit_transform(y)

yFreq = scipy.stats.itemfreq(y)
print(yFreq)
>[[ 0 260]
[ 1 1738]]

Visualising the data in some way is a good way to get a feel for how the data is distributed. As a
simple example, try plotting the frequencies of the output labels, 1 and 0, to see how they are
distributed using the Bar graphical object (trace) from Plotly.

http://beta.cambridgespark.com/courses/jpm/02­module.html 6/16
2016. 11. 27. Mining and visualising real­world data

PYTHON
# Display the y frequencies in a barplot with Plotly

# (1) Create the Data object


data = [
Bar(
x=['Non-returning customers', 'Returning customers'],
y=[yFreq[0][1], yFreq[1][1]],
marker=dict(color=['blue','orange'])
)
]

# (2) Create a Layout object


layout = Layout(
xaxis=dict(title = "Class"),
yaxis=dict(title = "Count"),
width=500
)

# (3) Create a Figure object


fig = dict(data=data, layout=layout)

# (4) Plot
iplot(fig)

In general, the Plotly syntax consists of 4 main steps:

1. Creating the Data object. The Data object may contain one or more graphical objects such
as Scatter, Box and Bar, often referred to as "traces". Data is in a list-like format, therefore
we must use square bracket notation "[ ]"

2. Creating a Layout object. Layouts and most of their individual arguments (such as the
xaxis and yaxis) are in dict formats.

3. Creating a Figure object. This is the step where we are combining the Data with the Layout.

4. Finally using the iplot command to plot offline with Plotly.

The code above gives you the following bar chart:

http://beta.cambridgespark.com/courses/jpm/02­module.html 7/16
2016. 11. 27. Mining and visualising real­world data

Figure 2. Bar chart showing frequencies of returning (class "1") and non-returning (class "0")
customers.

More examples on Plotly barplots can be found at https://plot.ly/python/bar-charts/. In


addition, a full list of arguments on barplots can be found at
https://plot.ly/python/reference/#bar/.

Data scaling
To avoid attributes with greater numeric ranges dominating those with smaller numeric
ranges, it is usually advisable to scale your data prior to fitting a classification model. Feature
scaling is generally performed as part of the data pre-processing.

In order to investigate the range and descriptive statistics of our features, we can apply the
describe() function from pandas to the original retail DataFrame (not the numpy
array!). For instance:

PYTHON
retail.describe()

The output of this function would result in the following table:

http://beta.cambridgespark.com/courses/jpm/02­module.html 8/16
2016. 11. 27. Mining and visualising real­world data

Figure 3. Descriptive statistics for each feature in the online retail dataset.

Despite the informative table we have just presented, visual aids are always preferable as they
enhance the interpretability of the actual findings. Boxplots are commonly used in order to
investigate the differences in ranges of the input features. Boxplots are a standardised way of
displaying the distribution of the data based on the "five number summary" (minimum, first
quartile, median, third quartile, and maximum). At this stage, let us start by constructing a
boxplot using the raw data:

PYTHON
# Create a boxplot of the raw data

nrow, ncol = X.shape

data = [
Box(
y=X[:,i], # values to be used for box plot
name=header[i], # label (on hover and x-axis)
marker=dict(color = "purple"),
) for i in range(ncol)
]

layout = Layout(
xaxis=dict(title="Feature"),
yaxis=dict(title="Value"),
showlegend=False,
)

fig = dict(data=data, layout=layout)

iplot(fig)

http://beta.cambridgespark.com/courses/jpm/02­module.html 9/16
2016. 11. 27. Mining and visualising real­world data

Figure 4. Boxplot highlighting the different feature ranges of the raw data.

There are many ways of scaling but the most common scaling mechanism is auto-scaling,
where for each column, the values are centred around the mean and divided by their
standard deviation. This scaling mechanism can be applied by calling the scale() function from
scikit-learn’s preprocessing module.

PYTHON
# Auto-scale the data

X = preprocessing.scale(X)

The outcome of the scaling can be once more represented in a boxplot using exactly the
previous plotting script:

PYTHON
# Create a boxplot of the scaled data (simple or enhanced)

#### WRITE YOUR CODE HERE ####

If you feel more adventurous, you can visit https://plot.ly/python/box-plots/ to find more Plotly
examples (also arguments (https://plot.ly/python/reference/#box)) and create even more advanced
boxplots such as the following:

http://beta.cambridgespark.com/courses/jpm/02­module.html 10/16
2016. 11. 27. Mining and visualising real­world data

Figure 5. Boxplot showing the scaled (pre-processed) data.

Transformations on the data


Scaling is only one of the transformations that can be applied. You can find more on
transforming data (including scaling) and pre-processing data on sklearn’s preprocessing
documentation page (http://scikit-learn.org/stable/modules/preprocessing.html)

Investigate the relationship between input features


To visualise the relationship between two features (such as balance and max_spent), the most
basic plot is the scatter plot. Try plotting the first two variables against each other. You can
also relate associations between features to their y classifications by making the colour of the
points dependent on the corresponding y value.

http://beta.cambridgespark.com/courses/jpm/02­module.html 11/16
2016. 11. 27. Mining and visualising real­world data

PYTHON
# Create an enhanced scatter plot of the first two features

f1 = 0
f2 = 1

# Returning customers (class "1") represented with orange x


trace1 = Scatter(
x=X[y == 1, f1],
y=X[y == 1, f2],
mode='markers',
name='Returning customers ("1")',
marker=dict(
color='orange',
symbol='x'
)
)

# Non-returning customers (class "0") represented with blue circles


trace2 = Scatter(
x=X[y == 0, f1],
y=X[y == 0, f2],
mode='markers',
name='Non-returning customers ("0")',
marker=dict(
color='blue',
symbol='circle'
)
)

layout = Layout(
xaxis=dict(title=header[f1]),
yaxis=dict(title=header[f2]),
height= 600,
)

fig = dict(data=[trace1, trace2], layout=layout)

iplot(fig)

1. f1 and f2 are used to specify the features that you wish to plot against each other.

2. Remember that we can concatenate more than one traces into a Data object using a list
format ("[ ]"), whereas the Layout object is in a dict format.

The code gives us the following plot:

http://beta.cambridgespark.com/courses/jpm/02­module.html 12/16
2016. 11. 27. Mining and visualising real­world data

Figure 6. Scatter plot of a combination of features against each other with the colour and
marker shape indicating whether the points are classified as returning or non-returning
customers.

Examples of Plotly scatterplots can be found at https://plot.ly/python/line-and-scatter/ (or for a


list of arguments refer to https://plot.ly/python/reference/#scatter/).

Try plotting di erent combinations of three features in the same plot.


The scatterplots we have seen so far investigated the relationship between two variables
(features). A three-dimensional graph lets you introduce a third axis, typically called the z axis,
and can help you understand the relationship between three variables. Plotly’s fully
interactive functionality allows you to plot, hover, zoom and rotate 3-dimensional scatterplots.
For a full list of arguments on 3d plots in Plotly click here
(https://plot.ly/python/reference/#scatter3d). Other examples on 3D scatterplots using Plotly can be
found at https://plot.ly/python/3d-scatter-plots/.

PYTHON
# Hint: Investigate the Scatter3d object from Plotly
# Axes in 3D Plotly plots work in little differently than in 2D:
# axes are bound to a Scene object (use help(Scene)).

#### WRITE YOUR CODE HERE ####

For instance, try to generate:

http://beta.cambridgespark.com/courses/jpm/02­module.html 13/16
2016. 11. 27. Mining and visualising real­world data

Figure 7. Three-dimensional scatterplot. The colour highlights whether the points are classified
as returning or non-returning customers like in the previous plot. In this case we look at the
relationship between the features mean_spent , balance and max_spent at the same time.

Try di erent combinations of f1 and f2 (in a grid/scatterplot matrix if you


can).
A scatterplot matrix shows a grid of scatterplots where each attribute is plotted against all
other attributes. For example, try to create a scatterplot matrix of the first four features such
as the one presented in the figure. You can find further information on how to create and
customise subplots with Plotly at https://plot.ly/python/subplots/.

PYTHON
# Hints: You may want to use nested loops that iterate through the
# rows and columns of the grid, and also import and make use of the
# make_subplots() function from Plotly

#### WRITE YOUR CODE HERE ####

The syntax of loops in Python can be found at http://www.learnpython.org/en/Loops.

An example of a scatterplot matrix using Plotly:

http://beta.cambridgespark.com/courses/jpm/02­module.html 14/16
2016. 11. 27. Mining and visualising real­world data

Figure 8. Scatter plot matrix where features are plotted against each other. The colour
highlights whether the points are classified as returning or non-returning customers like in the
previous plot. This kind of chart is handy when one needs to look at the relationships between
multiple variables simultaneously.

Further reading
http://numpy.scipy.org - To find out more about numpy.

http://scipy.org/Tentative_NumPy_Tutorial- A NumPy tutorial.

http://scipy.org/NumPy_for_Matlab_Users - A Numpy guide for MATLAB users.

http://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction - Basic Machine


Learning vocabulary used within scikit-learn.

http://beta.cambridgespark.com/courses/jpm/02­module.html 15/16
2016. 11. 27. Mining and visualising real­world data

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning - scikit-
learn supervised learning page.

https://plot.ly/python/ - Plotly Python Library

https://plot.ly/python/offline/ - Plotly Offline

Wrap up of Module 2
A vector can be represented by a NumPy array with one dimension.

A matrix can be represented by a NumPy array with two dimensions.

A NumPy array can only hold data of a single data type. If you try to change an element to a
value of a different type, there will be an error, and an error message will be displayed.

NumPy arrays are indexed from 0.

The colon can be used in the array indices to specify a range of array elements, e.g.
X[1 : 3 ]. This is known as index slicing. The index to the left of the colon (1) is included in
the range, while the index to the right of the colon (3) is not included in the range.

Most scikit-learn functions expect data to be presented as NumPy arrays.

Plotting two features (dimensions) against each other is a good way of identifying features
or feature combinations that might be useful in classification.

Last updated 2016-11-25 06:32:37 GMT

http://beta.cambridgespark.com/courses/jpm/02­module.html 16/16

Das könnte Ihnen auch gefallen