Sie sind auf Seite 1von 13

Seite 1, Printdate: 28/04/2020, 16:00 Uhr

(Kopfzeile)

Visualise and Validate of Machine Learning Data in


VS Code.
Subhead 1: Explainable models create trust

The further development of visualization in code has brought about some interesting and
promising innovations in recent years. This includes in particular the continuous integration of some
special technologies of machine learning mapping such as the integration of the Jupyter notebook
format in VS code, MS Power BI or the calling of tensorboard by TensorFlow to display and record
the training results. This illustrates how far the optimization of code visualization has already
progressed or could be.

by Max Kleiner
However, an immediate benefit is already clear today: Areas such as robotics, expert systems,
mathematical optimization, anomaly detection, feature reduction or model-based control would be
easier to explain if the model could show the features found for decision directly by means of a
corresponding graphic. The basics of this report serve this purpose.
The goal is to understand as exactly as possible why and how an AI makes certain decisions. With
image recognition algorithms, for example, a colored heat map shows the locations of an image that
are particularly relevant for its classification.

We start with a simple data set of a classification system and visualize the decision of the
classification with a confusion matrix and associated heat map. As an IDE, I use Visual Studio Code
with the two configuration files tasks.json and the project-specific settings.json including test units
and path details. Both files can be viewed as a listing below.
As an introduction to VS code with Python, I can recommend the tutorial [1], which Microsoft has
published with the current version March 2020 (version 1.44): "Tutorials for creating Python
containers and building Data Science models".
Now we start with the imported modules in Listing 1 and call our script logregclassifier2.py [2]
or [7] as a notebook.

Listing 1
// get the modules as we need

import matplotlib.pyplot as plt

import numpy as np

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report, confusion_matrix

Ende

The Dataset
After different ML projects, I wanted to write this article to share my experience and maybe
help some of you integrate Machine Learning with classification. The data itself is deliberately
neutral and simple, so that it is optimally clarified and understandable. I chose a (completely
Seite 2, Printdate: 28/04/2020, 16:00 Uhr

senseless) data series from 0 to 9 (samples) as training data to classify a target with 0 and 1 1. In the
case of known labels (target), one also speaks of supervised learning. So we want to train the system
so far that the low numbers are likely to be classified with 0 and the high numbers with 1 :.
X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

y=[0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

Listing 2
// arrays for the input (X) and output (y) values:

X = np.arange(10).reshape(-1, 1)

y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
Ende

We can use the np.arange (10) command to create an array that contains all the integers from 0
to 9. As a convention, I see X as a two-dimensional array (matrix) and y as a one-dimensional target
(vector). Reshape (-1.1) means we only have 1 feature as a column. Features are the feature carriers
which help the model to find unknown patterns.
Now I define the model that already trains the data with the fit method to create a relationship
between the influencing variables (determinants) and the target:

Listing 3
// Once you have input and output prepared, define your classification model.

model= LogisticRegression(solver='liblinear', random_state=0)

model.fit(X, y)

print(model)
Ende

Now the model is set up, and accordingly I can now use predict () to try a first classification
with a score and immediately create the confusion matrix to validate. Needless to say that the
implementation of ML-based solutions can lead to major cost savings, higher predictability, and the
increased availability of the systems.

Listing 4
print(model.predict(X))

print(model.score(X, y))

// One false positive prediction: The fourth observation is a zero that was

wrongly predicted as one.

print(confusion_matrix(y, model.predict(X)))
Ende
Real: [0 0 0 0 1 1 1 1 1 1]

Predict: [0 0 0 1 1 1 1 1 1 1]

Score: 0.9

Confusion Matrix:

0: [[3 1] :4

1: [0 6]] :6

1 It could also be patients 0 to 9 who are taking a medical test.


Seite 3, Printdate: 28/04/2020, 16:00 Uhr

And lo and behold, a false positive (false alarm) has crept in. The model mistakenly classified a
0 as 1, as if the system incorrectly activated a quiet situation as a fire alarm. The confusion matrix
shows this as a false alarm (false positive).
Ideally, with a 100% score, the matrix has the following picture:
0: [[4 0] :4

1: [0 6]] :6

The data set becomes an image


The next step is the visual preparation of the matrix in order to create an optical relationship
between the real data and the predicted ones.

Listing 5
plt.rcParams.update({'font.size': 16})

fig, ax = plt.subplots(figsize=(4, 4))

ax.imshow(cm)

ax.grid(False)

ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))


ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))

ax.set_ylim(1.5, -0.5)

for i in range(2):

for j in range(2):

ax.text(j, i, cm[i, j], ha='center', va='center', color='red')

plt.show()
Ende

Fig. 1: Konfusionsmatrix with Pyplot


file: logreg2cm2.png
Seite 4, Printdate: 28/04/2020, 16:00 Uhr

This graphic can also be made simpler and more modern with an additional library. We need
the Python Library Seaborn, which can best be installed directly in the VS Code with Pip Install
using the integrated command line shell. By the way validate and translate are 2 funny words.

Listing 6
import seaborn as sns

// get the instance of confusion_matrix:

cm = confusion_matrix(y, model.predict(X))

sns.heatmap(cm, annot=True)

plt.title('heatmap confusion matrix')

plt.show()

Ende

Fig. 2: Konfusionsmatrix with Seaborn


File: heatmapconfusionmatrix.png

Class 0 has 3 correct cases (true negative) and class 1 has 6 correct cases (true positive). User
accuracy also shows a single false positive result. The user accuracy (consumer risk versus producer
risk) is also referred to as transfer errors or errors of type 1, errors of type 2 are then false negative.
The .heatmap () function from the "seaborn" library defines the type of diagram I'm using. The
following arguments parameterize the appearance of the diagram. Let's take a look at the error
analysis, which is defined by the default threshold of probability at 0.5. The discrimination between
0 and 1 took place too early, so that our model classified a 0 too early as 1. Of course, these so-called
hyper parameters can be optimized to find a fairer distribution of the classification.
Seite 5, Printdate: 28/04/2020, 16:00 Uhr

It has to be said that the effect on discrete, dichotomous variables [0,1] cannot be explained and
verified with the method of the classic linear regression analysis.

Hyperparameter
The current distribution with the associated classification looks like this:

Fig. 3: The first 3 samples are counted as 0 and the rest as 1.


File: class_logplot2.png

Listing 7
sns.set(style = 'whitegrid')

sns.regplot(X, model.predict_proba(X)[:,1], logistic=True,

scatter_kws={"color": "red"}, line_kws={"color": "blue"})

#label=model.predict(X))

plt.title('Logistic Probability Plot')

plt.show()

Listing 7 contains the estimated probability as a target in the regplot function. Not every
classifier offers the internal probabilities. The Naive Bayes classifier, which is named after the
English mathematician Thomas Bayes, is also probabilistic 2; it is derived from the Bayes theorem.
The corresponding decision boundary is also visually recognizable for the analysis and helps to
interpret the result or to find a better solver (see below):

2 The basic assumption with the naive Bayes classifier is (therefore naive) to assume that the
characteristics used are strictly independent.
Seite 6, Printdate: 28/04/2020, 16:00 Uhr

Fig. 4: Decision Boundary with the false positive (blue dot in white area)
file: classifier_decision2.png

Imagine a medical research institute proposing a screening to screen a large group of people for
the presence of a particular disease (which is for the moment context-sensitive). An important
counter argument for such a screening are the false positive results, which we have to consider as a
conditional probability:

T precision recall f1-score support CM

0 1.00 0.75 0.86 4 [[3 1]


1 0.86 1.00 0.92 6 [0 6]]

Table 1: Classification Report

We can see from the table that there is 1 case of false positive and no case of false negative.
This means that only in 86% of all cases a positive result also corresponds to a disease, the precision
pays off as follows: True positive / (True positive + False positive) =

6 / (6+1) = 0.8571 = 0.86

It is therefore crucial to include the false positive cases in the accuracy of the tests (screening).
By the way, similar examples of conditional probability can be found on the website "Lies with
Statistics" [3]. Again, I calculated and visualized a case (from the field of mammography) and the
false positives look more complex:
Seite 7, Printdate: 28/04/2020, 16:00 Uhr

Fig. 5: Non-linear analysis of false positives in a hyperplane (Support Vector Machine)


File: cell_class_boundaries.png

Optimise with Optic


Now we want to bring the hyper parameters mentioned into play, some of which exist and
which are part of the model evaluation.

 C is a positive floating point number (1.0 by default) that defines the relative strength of the
regularization. Smaller values indicate a stronger regularization.
 Solver is a string ('liblinear' by default) that decides which solver to use to customize the model
and can be part of a kernel. Other options are 'newton-cg', 'lbfgs', 'sag' and 'saga'.
 max_iter is an integer (100 by default) that defines the maximum number of iterations through
the solver during model fitting.

Listing 8
model = LogisticRegression(solver='liblinear', C=1, random_state=0).fit(X, y)

// show more model details

print(model)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, max_iter=100, multi_class='warn',

n_jobs=None, penalty='l2', random_state=0, solver='liblinear',

tol=0.0001, verbose=0, warm_start=False)

The actual adjustment is simply and means to use a different solver:


model = LogisticRegression(solver='lbfgs', C=1, random_state=0).fit(X, y)

print(classification_report(y, model.predict(X)))
Seite 8, Printdate: 28/04/2020, 16:00 Uhr

In Listing 8 above we can see the preset model parameters, which can of course be changed.
However, I cannot directly determine the best value for a model hyper parameter in relation to a
specific problem. You can use empirical values, copy values that I have used for other problems, or
try to find the best value by trying. I mainly use the value C (regulator), the kernel or the solver to
optimize.

The difference between parameters and hyperparameters:

Model Hyper parameters have to be defined before the training and cannot be learned from the
model (e.g. learning rate, hidden layers, regulator).
Model parameters are then learned from the model and are derived from the data (e.g. word
frequency, weighting, bias, variance).

Hyper-parameters are those which we supply to the model, for example: number of hidden
Nodes and Layers,input features, Learning Rate, Activation Function etc in Neural Network, while
Parameters are those which would be learned by the machine like Weights and Biases.
In machine learning, a model M with parameters and hyper-parameters looks like,

Y≈MH(Φ|D)

where Φ are parameters and H are hyper-parameters. D is training data and Y is output data (class
labels in case of classification task). y≈MH(Φ|X)
A model hyper parameter is a configuration that is external to the model and whose value cannot be
estimated from data.

 They are often used in processes to help estimate model parameters.


 They are often specified by the practitioner.
 They can often be set using heuristics.
 They are often tuned for a given predictive modelling problem.

We cannot know the best value for a model hyper-parameter on a given problem. We may use
rules of thumb, copy values used on other problems, or search for the best value by trial and error.

Once we have all this information, it becomes possible to decide which modelling strategy fits best
with the available data and the desired output. The results are now optimal in terms of the quality of
the algorithm of our number series, which also withstand the optical comparison to a decision tree.
There are multiple modelling strategies for predictive maintenance and we will describe two of them
(that I worked almost on the most) concerning the question they aim to answer and which kind of
data they require for example in the domain of predictive maintenance:

1. Regression models to predict remaining useful lifetime (RUL)


2. Classification models to predict failure within a given time window

For this scenario, we need static and historical data, and that every event is labelled. Moreover,
several events of each type of failure must be part of the dataset. Ideally, we prefer to build such
models when the degradation process is linear [9].
Seite 9, Printdate: 28/04/2020, 16:00 Uhr

Fig. 6: Optimal decision of the classification


File: class_logplot3optsolver.png

The decision tree procedure in Listing 9 is a common option for regression or classification using a
multivariate data set. I can use the procedure, for example, to classify the solvency of customers or
to form a function to predict false reports3 or fake news.
In practice, however, the process presents data scientists with major challenges with regard to their
interpretation and overfitting (memorizing the trained examples), even though the tree itself offers
transparent and legible graphics. For this I use the installed Graphviz2.38 in VS Code and an
additional line in the code that directly sets the path information in the OS path. So I can configure
adjustments to another version or platform directly in the code.

Listing 9

from sklearn.tree import DecisionTreeClassifier


from converter import app, request

import unittest

import os

from sklearn.tree import DecisionTreeClassifier

from sklearn.tree import export_graphviz

os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

os.environ["PATH"] += os.pathsep + 'C:/Program Files/Pandoc/'

3 Fraud detection is a knowledge-intensive activity.


Seite 10, Printdate: 28/04/2020, 16:00 Uhr

Fig. 7: The confusion matrix no longer has any wrong ones.


File: heatmapconfusionmatrix_solver.png

Important: The dimensions of the confusion matrix are unfortunately not standardized. In the
example, the truth is "Real Actual" in the rows and the estimate "Predict" in the columns (from
Present to Target), but depending on the software used, the dimensions can be reversed. It seems
important to me to start the matrix at 0, i.e. to standardize True Negative at the top left, see Fig. 8.
And clearly for an N-class problem, the confusion matrix then consists of an NxN matrix, so it is not
limited to a binary classification.

Abb. 8: a standardized confusion matrix, File: cm_mock_template.png


Seite 11, Printdate: 28/04/2020, 16:00 Uhr

Jupyter Notebook in VS Code

Here is a look at the integration of Jupyter [6]. Jupyter (formerly IPython Notebook) is an
open source project, with which I can easily combine interactive markdown text and executable
Python source code on a canvas, which is known as a notebook. Visual Studio Code supports
working with Jupyter notebooks and Python code files, and my experience with debugging or code
metrics is also good.
An Anaconda environment in VS Code or another Python environment is required to work with
Jupyter notebooks, but a Jupyter package must be installed beforehand. This gives us the possibility
to directly integrate graphics, document or execute interactive code in VS Code:

Fig. 9: Work with Jupyter


File: vscode_jupyter_librosa_demo2.png

Fig. 10: With the terminal, images can also be controlled interactively in code!
File: vscode_jupyter_librosa_demo3.png
Seite 12, Printdate: 28/04/2020, 16:00 Uhr

Listing 10 viper2\.vscode\settings.json
{

"python.pythonPath":

"C:\\Users\\Max\\AppData\\Local\\Programs\\Python\\Python37\\python.exe",

"python.testing.pytestArgs": [

"freshonion"

],

"python.testing.unittestEnabled": false,

"python.testing.nosetestsEnabled": false,

"python.testing.pytestEnabled": false,

"python.testing.unittestArgs": [

"-v",

"-s",

"./freshonion",

"-p",

"*test.py"
],

"python.testing.promptToConfigure": false

}
Ende

Listing 11 \viper2\.vscode\tasks.json
{

// See https://go.microsoft.com/fwlink/?LinkId=733558

// for the documentation about the tasks.json format

// build from older win8.1. to win10.2 by max

"version": "2.0.0",

"tasks": [

"label": "buildpython",

"type": "shell",

"command":

"C:\\Users\\Max\\AppData\\Local\\Programs\\Python\\Python37\\python.exe",

"args": ["${file}"],

"showOutput":"always",

"problemMatcher": [],

"group": {

"kind": "build",

"isDefault": true

}
Ende
Seite 13, Printdate: 28/04/2020, 16:00 Uhr

Max Kleiner's professional environment lies in the areas of machine learning, e-learning, OOP,
UML and system architecture - including as a trainer, developer, consultant and publicist. His focus
is on training, IT security, databases and frameworks that work in an event-oriented manner. As a
lecturer and consultant at a university of applied sciences and on behalf of a company,
microcontrollers and IoT have also been added. His book "Patterns in C #", published in 2003, is still
up to date with the Clean Code Initiative.

https://basta.net/speaker/max-kleiner/

Links & Literature


[1] https://code.visualstudio.com/docs/python/data-science-tutorial
[2] http://www.softwareschule.ch/examples/logregclassifier2.py.txt
[3] https://de.statista.com/statistik/lexikon/definition/8/luegen_mit_statistiken/
[4] https://sourceforge.net/projects/cai/
[5] https://maxbox4.wordpress.com/blog/
[6] https://code.visualstudio.com/docs/python/jupyter-support
[7] https://github.com/maxkleiner/maXbox/blob/master/logisticregression2.ipynb

Literature of the Free Book:


[8] https://www.oreilly.com/programming/free/python-data-for-developers.csp
[9] https://towardsdatascience.com/how-to-implement-machine-learning-for-predictive-maintenance-
4633cdbe4860

Appendix Source package for MS PowerBI: PBIDesktop_x64.msi

News

Python Data for Developers


A Curated Collection of Chapters from the O'Reilly
Data and Programming Library

Get the free ebook


Data is everywhere, and not just for data scientists. Developers are increasingly seeing it enter their
realm, requiring new skills and problem solving. Python has emerged as a giant in the field,
combining an easy-to-learn language with strong libraries and a vibrant community. If you have a
programming background (in Python or otherwise), this free ebook will provide a snapshot of the
landscape for you to start exploring more deeply.

https://www.oreilly.com/

Das könnte Ihnen auch gefallen