Sie sind auf Seite 1von 8

13/05/2019 Workshop 1

Classification using a Decision Tree

This notebook demonstrates generating a practice dataset, visualising it, and using a decision tree to classify
it. It is important that you generate a unique dataset so that you have different results to your peers so that
you can undertake a unique analysis. To achieve this, where a random number seed is specified you need to
replace the 0 with the last 3 digits of your student number.

You are required to document your work in markdown cells. Empty cells have been included, but you can add
more if you want for either code experimentation or futher explanation. Concise documentation for markdown
can be found at
( and
p/markdown-here/wiki/Markdown-Here-Cheatsheet (

Original notebook by Dr Kevan Buckley, University of Wolverhampton, 2019. This submission by your name
and student number

In markdown cells like this one explain the code or results below

In [1]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pydotplus
from sklearn import tree
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Make sure that you replace the zero with the last 3 digits of your student number. If this includes a
leading zero use the last 4 digits.

In [2]: features, target = make_classification(

n_samples=200, n_features=4, n_classes=3, n_clusters_per_class=1, random_state=275)

generating a classification problem

In [3]: features.shape

Out[3]: (200, 4)

returns dimensions of the array for features

In [4]: target.shape

Out[4]: (200,)

returns dimensions of the array for target

In [5]: features[0]

Out[5]: array([-0.20225961, -0.68772806, -0.01596298, 0.2098523 ])

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 1.ipynb 1/8

13/05/2019 Workshop 1

In [6]: target[0]

Out[6]: 2

In [7]: feature_names = ['feature_0', 'feature_1', 'feature_2', 'feature_3']


Out[7]: ['feature_0', 'feature_1', 'feature_2', 'feature_3']

setting names for the features

In [8]: features_df = pd.DataFrame(features, columns=feature_names)

creating the dataframe columns for the features dataframe

In [9]: features_df.head()

Out[9]: feature_0 feature_1 feature_2 feature_3

0 -0.202260 -0.687728 -0.015963 0.209852

1 -0.294325 -0.984434 0.045413 0.415847

2 -0.106930 -0.324252 0.156815 0.376904

3 -0.362694 -1.206135 0.085266 0.559605

4 -0.310437 -0.944848 0.440628 1.070669

showing the preview of the dataframe, head shows the first 5 rows

In [10]: target_df = pd.DataFrame(target, columns=['target'])

creating the dataframe columns for the target dataframe

In [11]: target_df.head()

Out[11]: target

0 2

1 2

2 2

3 2

4 2

showing the first 5 rows for the target dataframe

In [12]: dataset = pd.concat([features_df, target_df], axis=1)

concat or combine the two dataframes

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 1.ipynb 2/8

13/05/2019 Workshop 1

In [13]: dataset.head()

Out[13]: feature_0 feature_1 feature_2 feature_3 target

0 -0.202260 -0.687728 -0.015963 0.209852 2

1 -0.294325 -0.984434 0.045413 0.415847 2

2 -0.106930 -0.324252 0.156815 0.376904 2

3 -0.362694 -1.206135 0.085266 0.559605 2

4 -0.310437 -0.944848 0.440628 1.070669 2

showing the first 5 rows of the new dataframe

Below are scatter graphs which show relationships between two different features. Scatter graphs can be
used to visualise data and find trends.

In [14]: sns.relplot(
x='feature_0', y='feature_1', hue='target', style='target', data=dataset)

showing the relationship between the two variables in this case features_1 compared to features_0

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 1.ipynb 3/8

13/05/2019 Workshop 1

In [15]: sns.relplot(
x='feature_0', y='feature_2', hue='target', style='target', palette='prism', data=dat

showing the relationship between features_2 compared to features 0

In [16]: sns.relplot(
x='feature_0', y='feature_3', hue='target', style='target', palette='prism', data=dat

showing the relationship between features_3 compared to features_0

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 1.ipynb 4/8

13/05/2019 Workshop 1

In [17]: sns.relplot(
x='feature_1', y='feature_2', hue='target', style='target', palette='prism', data=dat

showing the relationship between features_2 compared to features 1

In [18]: sns.relplot(
x='feature_1', y='feature_3', hue='target', style='target', palette='prism', data=dat

showing the relationship between features_3 compared to features_1

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 1.ipynb 5/8

13/05/2019 Workshop 1

In [19]: sns.relplot(
x='feature_2', y='feature_3', hue='target', style='target', palette='prism', data=dat

showing the relationship between features_3 compared to features 2

In [20]: training_features, test_features, training_target, test_target = train_test_split(

features, target, random_state=275)

creating a training and testing dataset to train the model

In [21]: print(training_features.shape, test_features.shape)

(150, 4) (50, 4)

showing dimensions of the training and test array. Out of 200 values in the dataset 150 were used for training
and 50 were used to test the model.

In [22]: dtc = DecisionTreeClassifier(criterion='entropy')


Out[22]: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,

max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,

creating the decision tree classifier based on the entropy criteria

In [23]: model =, training_target)

fitting the decision tree classifier model

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 1.ipynb 6/8

13/05/2019 Workshop 1

In [24]: predictions = model.predict(test_features)


Out[24]: array([1, 0, 1, 1, 2, 1, 1, 0, 0, 1, 2, 1, 0, 0, 2, 1, 2, 1, 1, 0, 1, 1,
2, 0, 1, 2, 1, 0, 2, 0, 0, 2, 0, 2, 1, 2, 2, 0, 1, 1, 1, 2, 2, 2,
2, 1, 2, 0, 0, 1])

showing the array of the predicted values from the testing features.

In [25]: matrix = confusion_matrix(test_target, predictions)

a confusion matrix to see the accuracy of the predictions made against the actual test targets

In [26]: print(matrix)

[[14 0 0]
[ 0 17 4]
[ 0 3 12]]

This is a 3x3 confusion matrix showing that 43 accurate predictions were made and 7 false predictions were
made, a toal of 50 predictions were made. A confusion matrix is used to show or describe the performance a
classification model on a set of test data where the true values are known. The main diagonal from top left to
bottom right gives all the correctly predicted values; 14,17 and 12 whereas every other value was incorrectly

Even though there are no labels shown on this matrix, the true values would be labeled on the left and the
predicted values would be labeled on the top. Using an example for value 1 we can see that in 17 cases it
was predicted correctly as both the values from the true values and predicted values intersect showing 17
cases where it was predicted correctly. Similary we can look at value 1 again and see that in 4 cases it was
predicted to be value 2 which is incorrect.

For value 0 there were no incorrect predictions made, this can also be visualised on the first scatter plot
labelled feature 0 and feature 1 which show no anomalies as the plotted data shows a positive correlation
(the data points fit a line with a postive slope) between the two features being compared.

Also looking at the scatter graphs we can see that ownwards from the second scatter graph only the target
classes of 0 and 2 can be linearly seperable with a hyperplane that puts target 0 on one side and target 2 on
the other side. Meanwhile the data points of target 1 are on both sides.

In [27]: print(classification_report(test_target, predictions))

precision recall f1-score support

0 1.00 1.00 1.00 14

1 0.85 0.81 0.83 21
2 0.75 0.80 0.77 15

micro avg 0.86 0.86 0.86 50

macro avg 0.87 0.87 0.87 50
weighted avg 0.86 0.86 0.86 50

Classification report based on the confusion matrix metric. Precision also known as the positive predictive
value is the number of positive predictions that were made correctly. This is calculated by taking the true
positives and dividing it by the number of predicted positives. So it for value 1 it would be 17 which is true
postives predicted divided by 20 which is the total predicted positives, giving us the answer of 0.85 or 85% as
also shown in the report.

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 1.ipynb 7/8

13/05/2019 Workshop 1

Recall is the precentage of total relevant results correctly classified by the algorithm. Recall is how many of
the true positives were found or "recalled". In other words when the value is yes how often does the classifier
predict yes. Once again looking at the prediction for the value 1 we can add all the values, 17 which is the
true positive or correctly predicted plus 4 which is the false negative or incorrectly predicted. Then dividing
the true postive by the total so 17/21 which equals 0.809 or 0.81 rounded up as the report shows.

The f1 score is the weighted avarage of the other two metrics, precision and recall.

In [28]: dot_data = tree.export_graphviz(

model, out_file=None, feature_names=feature_names)
graph = pydotplus.graph_from_dot_data(dot_data)

Out[28]: True

Data School (2018) Making sense of the confusion matrix [online]. [Accessed 16 February 2019]. Available at: (

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 1.ipynb 8/8

Das könnte Ihnen auch gefallen