Machine Learning Algorithms Applications and Practices in Data Science PDF

Machine Learning Algorithms, Applications and Practices in Data
Science
Kalidas Yeturu
August 30, 2019
Contents
1 Terminology 3
2 Compacted abstract 5
3 Introduction 5
4 Figures 7
5 Artificial Intelligence 23
5.1 Notion of state space & search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 State space - Search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Enumerative search methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.2 Heuristic search methods - Example A* algorithm . . . . . . . . . . . . . . . . . . . . 25
5.3 Planning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.4 Formal logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4.1 Predicate or Propositional logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.4.2 First order logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4.3 Automated theorem proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4.3.1 Forward chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4.3.2 Incompleteness of the forward chaining . . . . . . . . . . . . . . . . . . . . . 31
5.4.3.3 Backward chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Resolution by refutation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 AI framework adaptability issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Supervised methods 33
6.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2.1 Polynomial fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2.2 Thresholding and Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.4 Support Vector Machine - Linear Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.5 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.6 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1
6.6.1 Boosting algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.6.2 Gradient Boosting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.7 Bias Variance Trade off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.7.1 Bias Variance Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.8 Cross validation & Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.8.1 Learning curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.9 Multi class and multi variate scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.9.1 Multi variate linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.9.2 Multi class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.9.2.1 Multi class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.9.2.2 Multi-class logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.10 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.10.1 Regularization in gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.10.2 Regularization in other method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.11 Metrics in machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.11.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.11.2 Precision-Recall Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.11.3 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Practical considerations in model building 63

7.1 Noise in the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.2 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 Class imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.4 Model maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8 Unsupervised Methods 66
8.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.1.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.1.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.1.3 Density Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.2 Comparison of clustering algorithms over data sets . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.4 Principal component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.5 Understanding the SVD algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.5.1 LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.5.2 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.6 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.6.1 Multi Dimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.6.2 tSNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.6.3 PCA based visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.6.4 Research directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9 Graphical Methods 79
9.1 Naive Bayes Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2.1 E and M steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2
9.2.2 Sampling error minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.3 Markovian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.3.1 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.3.2 Latent Dirichlet Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
10 Deep learning 89
10.1 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.1.1 Gradient magnitude issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.1.2 Relation to ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.2.1 Vectorization of text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.2.2 Auto encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.2.3 Restricted Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.3 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
10.3.1 Filter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.3.2 Convolution layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.3.3 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.3.4 Fully connected layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.3.5 Popular CNN architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.4 Recurrent neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.4.1 Anatomy of simple RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.4.2 Training a simple RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.4.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.4.4 Examples of sequence learning problem statements: . . . . . . . . . . . . . . . . . . . . 103
10.4.5 Sequence to sequence mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.5 Generative adversarial network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.5.1 Training GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.5.2 Applications of GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11 Applications and Laboratory exercises 104

11.1 Automatic differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.2 Machine learning exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.3 Clustering exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.4 Graphical model exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.4.1 Exercise - Topics in text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.4.2 Exercise - Topics in image data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.4.3 Exercise - Topics in audio data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.5 Data visualization exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.6 Deep learning exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
12 Optimization 107
1 Terminology
Some of the terms and their meaning common in the machine learning world and used in the chapter are
enlisted in (Table 1) in ascending order.
3
Term Meaning
AI Artificial Intelligence
AE Auto Encoder
ANN Artificial Neural Network
AUC Area Under the Curve
BFS Breadth First Search
CNN Convolutional Neural Network
CV Cross Validation
DBSCAN Densitiy Based Spatial Clustering of Applications with Noise
DBN Deep Belief Network
DFS Depth First Search
DT Decision Tree
EM Expectation Maximization
FN False Negative
FP False Positive
FPR False Positive Rate
GA Genetic Algorithm
GAN Generative Adverserial Network
HMM Hidden Markov Model
LDA Latent Dirichlet Allocation
LOO Leave One Out (cross validation)
LSTM Long Short Term Memory
LU Lower Upper (triangular matrix decomposition)
MAE Mean Absolute Error
MDS Multi Dimensional Scaling
MSE Mean Squared Error
NB Naive Bayes algorithm
PCA Principal Component Analysis
PDDL Planning Domain Definition Language
PR Curve Precision Recall Curve
QR Orthonormal matrix decomposition
RELU Rectified Linear Unit (activation function)
RBM Restricted Boltzmann Machine
RL Reinforcement Learning
RMSD Root Mean Squared Deviation
RNN Recurrent Neural Network
ROC Receiver Operating Characteristic
SVD Singular Value Decomposition
SVM Support Vector Machine
TFIDF Term Frequency Inverse Document Frequency
TN True Negative
TP True Positive
TPR True Positive Rate
VAE Variational Auto Encoder
Table 1: Terminology and abbreviations
4
2 Compacted abstract
Data Science is an umberilla term used for referring to concepts and practices of subset of the topics under
Artificial Intelligence (AI) methodologies. AI is actually a framework to define notion of intelligence in
software systems or devices in terms of knowledge representation and reasoning methodologies. There are
two main types of reasoning methods deductive and inductive over data. The major class of machine
learning and deep learning methods come under inductive reasoning where essentially, missing piecess of
information are interpolated based on existing data through numerical transformations. However today
AI is mostly identified with deduction systems while it is actually a comprehensive school of thought and
formal framework. The AI framework offers rigor and robustness to the solutions developed and there is
still scope for onboarding today’s deep learning solutions and reap benefits of sturdiness. Data Science is
about end to end development of a smart solution that involves creation of pipelines for activities for data
generation, business decision making and solution maintenance with humans in loop. Data generation is a
cycle of activities involving collection, refinement, feature transformations, devising more insightful heuristic
measures based on domain peculiarities and iterations to enhance quality of data driven decisions. Business
decision making is pipeline of activities involving designing mappers from data to business decisions. The
mappers are typically machine learning methods which are fine tuned to give best possible performance in
a given period of study subject to business constraints. The mappers are fine tuned based on quality and
magnitudue of data and sub data. Solution maintenance is a critical component that involves setting up
alarms to detect when a given decision maker model no longer works as desired. The maintenance work calls
for repair actions such as identifying data to gather, comparative metrics of different models and monitoring
the patterns and trends in the input data.
3 Introduction
Artificial intelligence is a formal framework of system representation and reasoning that encompasses induc-
tive and deductive reasoning methodologies for problem formulation and solution design 1 [1]. The newly
emerged stream of study, Data Science refers to application of techniques and tools from statistics, optimiza-
tion and logic theories to ordered or unordered collections of data. The field has acquired a number of key
terms and the terminology is slowly growing based on which technique is prominently used and what type of
data it is applied to. Coming to the basic definitions, data is a time stamped fact, albeit noisy, recorded by
a sensor in the context of a process of a system under study. Each datum is effectively represented as a finite
number sequence corresponding to a semantic in the physical world. The science aspect of Data Science, is
all about symbolic representation of data, mathematical and logical operations over the same and relating
the findings back to the physical world scenarios. On the implementation front, the engineering systems that
store and retrieve data at scale both in terms of size and frequency are referred to as Big Data systems. The
nature of input data is closely tied to the context, field, use case and discipline and the word data is highly
analogous to the word signal in the field of signal processing, thereby inheriting a majority techniques in the
field of signal processing to the discipline of data science, with one popular example being machine learning
methodology.
Some of the characteristics of data and noise thereof pertain to the set theoretic notions, the nature of the
data sources and popular domain categories. The characteristics include missing values, ordered or unordered
sets and homogeneous or heterogeneous groups of data elements. The sources that generate data include
raw streams or carefully engineered feature transformations. Some of the popular domain categories of data
include numeric, text, image, audio and video. The word numeric is an umbrella term and it includes any
5
type of communication between state of the art computer systems. The techniques that operate on data
include statistical approaches, optimization formulation and automatic logical or symbolic manipulation.
For numeric type of data, the techniques essentially deduce a mapping function that optimally maps a given
input to the output represented as number sequences, typically of different sizes. The statistical approaches
used here, mainly fall into probabilistic generative and discriminative methodologies. The optimization tech-
niques used mainly involve discrete and continuous state space representation and error minimization. The
logical manipulation techniques involve determining rules and deducing steps to prove or disprove assertions.
This chapter, as it belongs in a broader theme of practices and principles for data science, elucidates the
mapping process of a given problem statement to a quantified assertion driven by data. The chapter focuses
on aspects of machine learning algorithms, applications and practices. The mapping process first identifies
characteristics of the data and the noise, followed by defining and applying mathematical operations that
operate on data and finally inferring the findings to relate back to the given problem scenario and its context.
Most of the popular and much needed categories of techniques as of today are covered to a good amount
of depth in the light of data science aspects within the scope of this chapter, while any domain specific
engineering aspects are referred to appropriate external content. Machine learning approaches covered here,
include discriminative and generative modelling methodologies such as supervised, unsupervised and deep
learning algorithms. The data characterization topics include practices on handling missing values, resolving
class imbalance, vector encoding and data transformations. The supervised learning algorithms covered in-
clude decision trees, logistic regression, ensemble of classifiers including random forest and gradient boosted
trees, neural networks, support vector machines, Nave bayes classifier and Bayesian logistic regression. The
chapter includes standard model accuracy metrics, ROC curve, bias-variance trade off, cross validation and
regularization aspects. Deep learning algorithms covered include Auto-encoders, CNN, RNN and LSTM
methods. Unsupervised mechanisms such as different types clustering and special category of reinforcement
learning methodologies and learning using EM in probabilistic generative models including GMM, HMM and
LDA are also discussed. As industry emphasizes heavily on model maintenance, techniques involving setting
alarms for data distribution difference and retraining, transfer learning and active learning methodologies are
described. A cursory description of symbolic representation and reasoning in the artificial intelligence topics
including state space representation and search, first order logic unification and deduction are also presented
with references to external text handling the topic to full depth. The workings of the algorithms are also
explained over case studies on top of popular data sets as of today, with references to code implementations,
libraries and frameworks. A brief introduction to currently researched topics including on automatic machine
learning model building, model explainability and visualization will also covered in the chapter. Finally the
chapter concludes with an overview of the Big Data frameworks for distributed data warehousing systems
available today with examples on how data science algorithms use the frameworks to work at scale.
6
4 Figures
Figure 1: Artificial Intelligence framework - A real world problem or task is represented as a symbolic
world problem or computer program composed of entities (for e.g. data structures) and operations (for e.g.
algorithms). Encoding is the first step, identifying and formulating a problem statement. Experiments,
inferences and metrics are iterated at much lower costs in the symbolic world. The inferences are related
back to the real world through decoding. A real world problem itself is recursively defined as composed of
decomposition and synthesis for creating a conglomeration of multiple sub-problems which themselves may
be a simulations as well and synthesis of results for a holistic inference.
Figure 2: Three types of convex error curves - hinge, logistic and squared loss functions are shown. Let t
be the true value and y be the predicted value. The curves are shown for the true value t = 1. The squared
loss ( (t − y)2 shown in Red) has a global minimum at 1. The hinge loss (max(0, 1 − y ∗ t) shown in green)
has a bend at 1. The logistic loss function, log(1 + e−t∗y ) shown in blue, penalizes a mismatch in the sign
of the prediction and the true value. The L1 loss function (|y − t| shown in light green) is not differentiable
though it is convex and it has a kink at 1. A custom loss function e|y−t| − 1 shown in pink color, has a kink
at 1 as well.
7
Figure 3: Polynomial fitting - An underlying function y = sin(x) + e−x is used for generation of the
2
data. The original function is shown in dashed (’-’) green curve. Random noise from uniform distribution
ξ ∈ U (−0.5, 0.5) is added to the curve. A data set {(x, y + ξ)} is constructed and is shown as green dots.
Polynomials using ridge regression with loss function, L(w) = ||(y − X T w)||2 + ||w||2 are fitted with varying
degrees 1, 5 and 15 and shown respectively in blue, navy and red colours. The illustration aims to show
overfitting nature of the higher degree polynomial, in red.
Figure 4: An example of the moons data set is shown here for 100 points, where 50% of them are one class
(red) and the remaining 50% are of the other class (blue). The points are generated along two half circles of
a circle and a random Gaussian noise is added. The upper and the bottom half circles are filled with points
of the two classes. The two half circles are then cut apart and shifted laterally by radius and also pushed
inwards to give a complex intertwining of the points to make a challenging data set for any classifier.
8
Figure 5: Separation of two classes of data points in the moons data set using a degree one polynomial (line).
The colours in the 2D plot ranges from red to blue proportional to the confidence score of the respective
coloured classes. The colors in the ambiguous regions are light red and light blue. Farther the region from
the intersection region of the two classes, the darker are the colours.
Figure 6: A decision tree classifier on the moons data set of two classes of points - the red and the blue
coloured ones. The X-axis and the Y-axis correspond to the dimensions of the points (here they are 2D).
Given that both the dimensions are numeric type, they are split about the mean value and one of the
partitions are selected and split about its mean and so on, on the similar lines to a binary search, however
in 2D and for splitting points. The whole area is split into rectangles (as it is 2D). It is important to note
that width and height of the rectangle are of the values, (∃k ∈ N ) : 21k of the total value ranges along each
dimension.
9
Figure 7: The red curve in the left and the right plots is the true sine curve from which data is sampled
uniformly to create (x,y) tuples. The problem is of regression type where given x coordinate, the task is to
predict the corresponding y coordinate. The left plot shows attempts to fit a degree 0 polynomial (i.e. y = b
type) and the right plot shows attempts to fit a degree 1 polynomial (i.e. of the form y = a0 ∗ x + a1 ). Each
of the left and the right plots have 100 models corresponding to the number of subsets sampled from the
true sine curve. In this plot, each of the 100 models has just 2 points in its training set. The task is to assess
the behaviour of the average model for its mean and variance.
Figure 8: For each column, i.e. the x-coordinate, average value is calculated from across the models. The
averaging mechanism, also termed as the average models for the degree 0 polynomials (left plot) and the
degree 1 polynomials (right plot) are shown in dark red colour. The average models are symmetric. One
can sense from the slope and the intercept of the line (degree 1 polynomial), that the net variation about
the line is minimized.
Figure 9: The plots show variation about the mean µ(x) along each column i.e. the x-coordinate. Each of
the 100 models (in either of the plots) is build with just 2 points in its training set. Standard deviation
σ(x) is calculated for each x-coordinate value in a set of values computed by each of the models. A pink
coloured line is drawn between µ(x) − σ(x) to µ(x) + σ(x) to illustrate the amount of variation. One can note
that, variation is higher in case of the degree 1 polynomial than the degree 0 polynomial. This is expected
behaviour because, the space explored by the degree 1 polynomials is higher, or in other words, capacity is
higher for the degree 1 polynomial.
10
Figure 10: The plots show the impact adding more training data points to each and every one of the models.
Each of the 100 models (in either of the plots) is built with 100 points (instead of just 2 as in Figure 9) points
in its training set. As the x-coordinates are uniformly sampled from a given range, the more the size of the
sample, the more the similarity between the sample sets. The reduction in variance is due to increasing of
the training set sizes of the models that essentially emit similar individual models from each subset.
Figure 11: Decision tree for regression problem for 1-dimensional points is shown here. For a given x-
coordinate, its corresponding y-coordinate is to be estimated using decision tree approach. In this figure, a
split is chosen along the x-axis where the sum of mean squared errors in either of the partitions is minimized.
A tree of depth k, splits the x-axis into 2k partitions, first splitting about the mean of the x-coordinate
values, and then recursively splitting each half. In this figure, the blue lines show a tree of depth 1, which
splits the x-coordinates into two zones. The light yellow lines show a tree of depth 2, which splits the x-axis
into 4 zones.
11
Figure 12: Contours of the curves for various norms are shown for - (|x|p +|y|p )1 /p = c for p = [0.5, 1, 2, 3, 100]
are respectively shown in maroon, green, red, blue and purple colours respectively. Note that for a given
value c, the contours for higher values of p in the Lp norm tend to take shape of a maximal square i.e. with
corners where all the dimensions have highest magnitude values. Closely observe that, though for p = 1, the
contour has square shape, it is still not maximal as just defined. A weaker reasoning why L1 begets sparsity
compared to L2 would be, to assume that optimization converges to solutions at the corners. (However, a
stronger reasoning is provided in the section ?? using partial derivatives.)
Figure 13: Curves fitted for sinusoidal data ŷ = sin(x) + 0.1 × U [0, 1] using L1 and L2 regularizations for
least squares error minimization are shown here. The x-coordinate is expanded in dimensionality to degree
∑
10 polynomial i.e. x ⊢ (x0 , . . . , x1 0) and predicted using the loss function, L(w) = x (w · x − ŷ)2 + Lp (w)
where w is also 10 dimensional. Actual data and predicted values using L1 and L2 regularizations are shown
in green, red and blue colours respectively. In this scenario, L2 is better, as in case of L1 , several components
of w-vector are zero.
12
Figure 14: Moons data for two class classification problem is generated using sklearn’s make moons() func-
tion invocation. A logistic regression classifier is fit using LBFGS gradient descent for optimization. The
classifier’s predictions are refined further by imposing thresholds on the predicted probabilities of points in
the test data set. The plot showing various values of precision and recall scores for various thresholds is
called PR-curve is depicted here.
Figure 15: The data set, the classifier and the process of applying thresholds is identical to the one described
in (Figure 14). However, here the plot is between False Positive Rate (F P R = F PF+T P
N ) and True Positive
TP
Rate (T P R = T P +F N ). This plot for various values of FPR and TPR values at various thresholds is also
called ROC curve. An important component of this curve is called Area Under the Curve (AUC) which is
also indicated in the plot here.
13
Figure 16: Learning curve for various training set sizes is shown here for the moons data set and the classifier
as used in (Figure 14). Cross Validation (CV) with stratified shuffling for 5 random splits for learning sub-set
and validation sub-set partitions in 80-20 % sizes is carried out on a given training data set. Each execution
of the CV results in a mean accuracy score of the learning and the validation sub-sets. CV is carried out
for random subsets of the training data set for different sizes of the points and mean accuracy scores for
the learning and the validation sets are shown in green and red colours respectively. The scores in general
increase as the learning set size increases.
Figure 17: Validation curve for various degrees of polynomial features expanding the input dimensionality
is shown here for the moons data set and the classifier as used in (Figure 14). The data that is expanded
for feature dimensionality is of the form (x1 , x2 , label), which is expanded into a higher dimension k as the
set of all distinct terms of the form, {xa1 × xb2 |∀(a, b) : 0 ≤ (a + b) ≤ k}. Accuracy scores are computed for
the learning and the validation sets and shown in green and red colours respectively for various values of the
expansion degree parameter from 1 to 6.
14
Figure 18: Confusion matrix for two class classification problem on the Moons data set is shown here. The
steps followed for data set generation and the classifier are identical to the process used in (Figure 14). For
the classes, 0 and 1, the X-axis in the plot is for the predicted class and the Y-axis is for the true class.
The true class elements of a row are spread across columns and the elements of the matrix are normalized
row-wise i.e. sum of fractions along a row sum to 1. The only true predictions are along the diagonal,
i.e. each of the i − ith element of the matrix and all other off diagonal elements along a row are wrong
predictions. The more the correctness of a class, the darker the blue hue it has in a cell of the plot of the
confusion matrix.
15
Figure 19: Comparison of classifiers on different types of data distributions is depicted here. The
data distributions are moons, concentric circles and linearly separable types generated using scikit-learn’s
make moons(), make circles() and make classifciation() functions and are shown along first three rows top
to bottom. The data sets are of the form, {(x1 , x2 , y)|y ∈ {0, 1}} to which a small amount of noise is purpose-
fully added to mimic a more realistic scenario of inseparability of a small fraction of points. For the moons
and the circles data points, a Gaussian noise with zero mean and variance of 0.3 is added to each of the
coordinates and for the linearly separable case a uniform noise is added from U [0, 2]. The first column shows
the raw data distribution with red and blue colours for points of either classes and the following columns
from left to right indicate Nearest Neighbours, Linear SVM, Neural Net, Decision Tree, Random Forest and
AdaBoost respectively.
16
Figure 20: Visualizations of some of the major clustering algorithms against different data distributions
(shapes) are shown here. Four types of data distributions in 2-d are considered (row-wise) - (i) S-curve, (ii)
two concentric circles, (iii) Moons and (iv) three blobs. Three clustering algorithms are evaluated on each
data set (column-wise) - (i) Agglomerative , (ii) K-Means and (iii) DBSCAN. Both Agglomerative and K-
Means are specified for determining 3 clusters a priori and for DBSCAN, a radius of 0.1 units and minimum
samples of 100 are specified. As for the data sets, a number of 1000 points are generated using scikit-learn’s
make s curve, make circles, make moons and make blobs utility functions with a noise of 0.1 for the first
three. It is important to note that a change in parameter configurations of each of the algorithms and the
data set distributions affects the outcome of the clustering and what is shown here is just an illustration
towards intuition behind these algorithms. The parameters of the algorithms are chosen by hand after several
iterations of visualization and it will not be the case in a real world scenario. One can note that arbitrary
shaped clusters are determined well by DBSCAN albeit a little bit of noise points. However determining
the parameters of the algorithm is the main challenge. K-Means recovers natural globular groups of points
when well separated. However it injects unwanted clusters when points are not globular. Agglomerative
clustering performs similar to K-Means, however it is sensitive to linkages between natural groups of points.
Though points are illustrated for the case of 2-dimensions, the concepts extend naturally to multi-dimensional
points as well. Feature expansion or kernel based pairwise distances are applicable as well for points in high
dimensional spaces.
17
Figure 21: Dendrogram is constructed for 10 points occurring in 3 groups of 3, 3 and 4 points each. The
dendrogram recovers natural grouping of the points by having leaves occurring in the above numbers, as
shown in 3 different coloured leaf nodes. Agglomerative clustering is carried out with single linkage clustering
option and for euclidean distance metric between pairs of points.
Figure 22: Principle Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) based
lower dimensional feature transformation of data points in 8 dimensions is depicted here. A set of
1000 data points are generated from 4 natural groups of points shown by 4 distinct colours (left most
figure). The points are then transformed into 8 dimensional space by the transformation (x1 , x2 ) ⊢
(R(), R(), x1 , x2 , R(), R(), R(), R()) where R() is a function that generates a Gaussian random number
1 + 0.5 ∗ N (0, 1). The input data is then transformed using PCA into a lower dimensional space by specifying
number of components as 2, i.e. to consider only top 2 eigen vectors (middle figure). Each of the eigen
vector’s dot product with an 8 dimensional point generates a coordinate. One can observe that the original
4 groups of points in the 8 dimensional space are recovered after PCA based feature reduction as indicated
by the 4 distinct coloured groups. Similar to the PCA exercise, NMF based factorization is carried out for 2
components i.e. X1000×8 = W1000×2 × H2×8 . Each of the data points is then transformed using the H matrix
and points are plotted for 2 dimensions. One can observe that both PCA and NMF are able to recover the
natural groups of points in the input data, despite addition of noisy dimensions.
18
Figure 23: Principle Component Analysis is used for determining major axis of a distribution of 1000 points
along a parallelogram. The top eigen vector is plotted as a line in the 2 dimensional space. Centre of the
data is shown by a red dot.
Figure 24: This figure illustrates application of Multi Dimensional Scaling (MDS) algorithm for recovering
structure of the input data. A sample of 100 points are generated in the form of an S curve using scikit-
learn’s make s curve function. All the points are then centered about their mean and then rotated by −60o
degrees. The rotated set of points is used for generating all pair distances which is fed to the MDS algorithm
to recover the structure of points in 2D given pairwise distances. Actual points are shown on the left most
figure, while the center figure shows rotated points, the last figure shows recovered structure of points. One
can appreciate the usefulness of the MDS algorithm for visualization of higher dimensional points in a 2D
plate to obtain intuition regarding nature of the data.
19
(A) (B)
Figure 25: In this figure, Markov and Hidden Markov Models are illustrated on of a hypothetical boy’s
feelings of being loved or not loved by his girl of dreams. The observed states are a phone call or a smile.
The hidden states are loved or not loved. The figure (A) on the left shows Markov model of transitions
between direct observations. The figure (B) on the right shows Hidden Markov model where hidden states
are introduced and shown by dotted outlines in blue and green.
Figure 26: Illustration of a neural network with 3 layers. A layer is only counted if it has weights to be
learned from data, therefore input layer is ignored though visibly 4 layers are seen. The notation, Hij
denotes ith hidden layer and its j th neuron. The output layer is denoted by Oj ’s where j is its j th neuron.
Each edge in the neural network carries a weight which is learned during the training process. In this neural
network, input layer is 4 dimensional, hidden layer 1 is 3 dimensional, hidden layer 2 is 5 dimensional and
the output layer is 2 dimensional.
20
Figure 27: Restricted Boltzmann Machine (RBM) for 4 dimensional input and one hidden layer with 3
neurons is depicted here. Each edge in the neural network corresponds to a parameter which is learned
during the training phase. The neurons Xj and Xj′ corresponds to actual input and the reconstructed
input after going through transformation by the hidden layer. The RBM algorithm minimizes the net
reconstruction error by the contrastive divergence mechanism.
(A) (B)
Figure 28: The figure (A) on the left shows basic cell in a Recurrent Neural Network (RNN). The context
from the previous part of the sequence is encoded by the previous hidden vector annotated as # A, HA . Input
vector is indicated by orange colored rectangle X1×i . The input is multiplied with a dimensionally compatible
matrix W xhi×h shown in light red colour to yield a hidden vector representation (# B annotation) HB1×h =
X × W xh. The hidden vector is updated by HC = HA + HB (shown with # C annotation) and is output as
hidden vector of this unit. The HC vector is then transformed to emit output, Y1×o = HC ×W hyh×o . Reader
to note that is many a customization of this basic unit and the depicted cell is only for illustration. The
figure (B) on the right shows, batch based mechanism of hidden vector propagation between input batches.
Dimensionality of the hidden vector is HB×N where B is the batch size and N is number of neurons. Each
batch has B input elements, where each input is I dimensional. The output is generated as one 1 × O vector
for each input of the batch, i.e. YB×O .
21
(A) (B)
Figure 29: In figure (A), Generative Adversarial Networks (GAN) overall architecture is depicted in this
figure. Both the generator (G) and the discriminator (D) neural networks are trained in the GAN framework
simultaneously, however using two different loss functions. The filters in the G and D neural networks are
are updated separately based on loss function values. The G network computes loss over the fake data points
being correctly recognized as fake by the discriminator i.e. the lower the loss, the more likely a fake data
point passes in as correct one. The D network computes loss over true data points being correctly classified
for their true labels, be it fake or real. The G network up-scales a random input vector to the data dimension,
shown as 1 × n to m × n transformation. In figure (B), an illustration of bed-of-nails approach of the up-scale
mechanisms is shown, in which input X2×2 data point is scaled to 5 × 4 data point and is convolved using
a F3×3 filter. The blank values are all padded by a specified value, typically zero. The filter is then trained
the same way as a regular convolution filter, however on the up-scaled 5 × 4 data. The spacing between data
points is 2 here corresponding to a stride of 2 along horizontal and vertical axes.
(A) (B)
Figure 30: In figure (A) an illustration of automatic differentiation of a multiplication operation over two
operands is shown. The task of differentiation itself is carried forward recursively to the sub-trees of operands.
In figure (B) an illustration of how a loop is handled in automatic differentiation is shown on the left. The
recurrence equations and updated values are shown in the other two boxes. The equations are y = x4 and
iterations are y = y × x + x2 in which ∂x ∂
y is determined. Recurrence equations for the value of y in k th
iteration y (k)
and the value of derivative of y in k th iteration as y ′(k) are defined. For a starting value of
x = 3, the derivative y ′(2) = 1491 as described in the intermediate steps.
22
5 Artificial Intelligence
Artificial intelligence is a broad spectrum of algorithms, software design paradigms and thinking method-
ologies where a notion of intelligence and automation is defined over software components and connected
devices to cater to the needs of a given domain (Figure 1). Intelligence is programmed over knowledge
representation and reasoning methodologies and is broadly categorized into two major types - deductive and
inductive reasoning. Knowledge representation involves symbols and defining operators to represent facts
about a given world and transformations. The symbolic representation and operators directly translates to
objects and operations of a system architecture or simulation set up that can then be executed to evaluate
and generate predictions. Deductive reasoning involves programmatic combination of facts to generate newer
facts subject to constraints of logic or a given set of rules mainly involving notion of a state and search. The
inductive reasoning caters to the need of interpolation of missing values based on numerical combination of
other elements. The whole of the machine learning comes under this category. When the missing values
to be interpolated correspond to specially named attribute called target or label, it leads to a major class
of methods called supervised learning. Artificial intelligence offers a framework on top of which diverse
methods from the deductive and the inductive reasoning schools of thought can interplay to provide solution
to a given problem scenario. Owing to the high cost of a label of a datum, it is important to automatically
understand structure of the data. The class of algorithms that compute patterns and sub patterns in data
without the need for any label information is called unsupervised learning. Unsupervised methods and some
of the supervised methods currently embraced deep learning to learn structure of data and map between
feature spaces. Since over a decade, deep learning is being predominantly used to automatically learn sub
structures in data. While the artificial intelligence provides a systematic conceptual framework for defining
intelligence and development of rigorous solutions, there is still a major scope to on-board continuous state
space representation systems such as deep learning in to this generic framework and reap the benefits of
robustness.
5.1 Notion of state space & search

The AI methodology starts with defining a state and notion of navigation of state space for any given domain.
Some of the states are desirable, called final states for which a search happens. A state can be thought of
as a snapshot of parameter values. For instance a weight vector or a matrix of values. A state space is a
conceptional notion of states and their neighbors. For instance notional connections between weight vectors
where elements are slightly modified. Navigation in state space corresponds to generating neighbor states
or the next states for a given current state and keeping track of visited states. A search corresponds to
identifying a desirable state satisfying domain conditions such as best heuristic score so far. While the very
thought of state space may trouble a beginner on how to store infinite number of states, in reality it is only
notional or conceptual. The states are only unveiled as required and typically a minute fraction of state
space is explored intelligently.
Example of continuous state space: One of the popular state space search methods is the gradient
descent algorithm. For instance consider the scenario where a hyper plane is fit through a collection points
in a k dimensional space. We are fitting a hyper-plane, w vector such that
y = w.x
where x is input data. The idea is to minimize w such that a loss function L(w) is minimized,
w∗ = arg min L(w)
w
23
The weight vector updates based on negative gradient direction of the loss function such that L(wnew ) >
L(wold ), where α denotes learning rate is,
wnew = wold − α ∗ ∇L(w)|w=wold
The AI framework considers each of the w vector as a state. The next state from a given state is computed
by the gradient descent step. The navigation will never fall into infinite loop as L(w) only decreases over
iterations of states. The search for optimal w stops after domain constraints, here it refers to number of
iterations or change in L(w) values. The example given here is also called hill climbing method where −L(w)
is the notional hill that iterations over w climb.
The above example is over a set of continuous parameters, i.e. each of the components of the w vector.
More common examples in the AI text books introduce via discrete state space. The discrete state space
involves variables that assume discrete values such as boolean or multinomial, countably finite number of
numeric values. Movement from one state to another state is equivalent to changing value of one discrete
variable between states. A set of states are designated as final states.
Example of discrete state space: For instance there is a set of aptitude test problems on river crossing.
One of the problems is moving Lion, Goat, Cabbage from the Left to the Right banks of a river. The boat
has limited capacity to accommodate only the man and any one of the objects above. The constraints are
to avoid situations where one entity eats some other entity such as - (i) if Goat and Cabbage are left alone,
the Goat will eat Cabbage and (ii) if the Lion and Goat are left alone, the Lion will eat the Goat.
Consider providing a state space representation. Let us abbreviate the entities L - Lion, G - Goat, C -
Cabbage and M - Man. The words, Left and Right denote left bank and right bank respectively. The state
of the system involves depicting Left and Right banks at a given point in time.
Initial state = (Left,L,G,C,M),(Right,-,-,-,-). Next state can be (Left,L,G,-,-),(Right,-,-,C,M). However this

state is not valid as it violates the constraint that Lion will eat the Goat if they are not accompanied by
the Man. We need final desired state of (Left,-,-,-,-),(Right,L,G,C,M). It is left as exercise to the reader to
identify correct sequence of states.
In certain cases for some other problems, a given final state may not be reachable at all from a given start
state or there may be a more optimal sequence of states. Here a cost may be associated with each move and
the idea is to find least cost sequence of moves. There are different types of methods for performing state
space search and each procedure can be customized for a given problem scenario.
5.2 State space - Search algorithms

Based on the notion of state space navigation and keeping track of the explored states, there are different
types of search algorithms as categorized below.
• Deterministic search
– Enumerative methods such as:

∗ Depth First Search (DFS)
∗ Breadth First Search (BFS)
∗ Depth First Iterative Descent (DFID)
– Heuristic methods such as:
24
∗ Best First Search (BeFS)
∗ Tabu Search (TabuS)
∗ Beam Search (BS)
∗ Hill Climbing (HC)
• Randomized search methods such as:
– Simulated Annealing
– Iterated Hill Climbing
– Genetic Algorithms
– Ant Colony Optimization
We will discuss some of the methods in each category here to give insight into the framework of AI that
has been talked about.
5.2.1 Enumerative search methods
Enumerative methods obtain list of next states from a given state and explore them systematically. Mecha-
nisms are ensured to avoid falling into infinite loop. Algorithms in this section are given out as pseudocodes
with characteristic symbolic notations as explained in Table 2.
The first algorithm to look at is Depth First Search (DFS). In this algorithm starting from a given state,
subsequent next states are identified. The search goes to one of the next states and the process repeats. Each
time a to-do list is maintained to keep track of which states are yet to process (in the Ω list). In the middle
of the processing if the final state is reached, the search terminates with returning of the same. Otherwise,
when the final state is never reachable, the search exhausts as the χ list would have covered all the states.
The pseudocode is provided in Algorithm 1.
The space complexity of the algorithm, expressed in terms of cardinality denoted by | · | of the set of open
states, |Ω| ∝ O(D × B) where |Ω| is the size of the open list, D is the depth of the current state and B is
the maximum branching factor.
The algorithm for Breadth First Search (BFS) is just modification of Line 13 of Algorithm 1. In case of
BFS, instead of prepending new states to the Ω list, the new states are appended. This line just becomes,
⊙
Ω ← Ω Nx . Such a change drastically changes the space complexity behavior to an exponential rate, now
the |Ω| ∝ O(B D ) where B and D are branching factor and depth of the current node respectively.
We need to note that only behaviour difference is in terms of |Ω|, the time to explore the full tree is same
in both the DFS and BFS algorithms. Also, the χ the size of the visited states is same in both the methods.
Based on the premise that final state exists not at the far end of the search space, but somewhere
in the middle, it is essential to combine the space-wise benefits of DFS with exploratory nature of BFS.
The combination leads to another algorithm called Depth First Iterative Descent (DFID) where repeated
invocation of DFS happens for increasing depths as iterations progress Algorithm 2.
5.2.2 Heuristic search methods - Example A* algorithm
DFS and BFS are blind searches where no importance is given to the assessment of where the final state
exists. As state spaces in a typical moderate and to even lesser complex models grow exponentially, choosing
the right neighbors that take the search closer to the final goal state becomes a major factor in practical
25
Table 2: Symbols and meaning for pseudocodes of AI algorithms
Algorithm notations
Notation Meaning
Small case letters States
x.π Predecessor state of x
Ω List of states yet to explore
χ Set of states already explored
Ω[i] Accessing ith element of the list
FINAL(x) Tests if x is final state
NEXT(x) Next states for state x
x.δ Depth of current staet
⊙
A B Concatenation of B to A
x.cost Cost of the path from start till x
edge(x, y) Cost of edge from x to y
Algorithm 1 Depth First Search

Require: s /*Start state*/
1: s.π ← N IL
2: Ω ← [s]
3: χ ← {}
4: while |Ω| > 0 do
5: x ← Ω[0]
6: if FINAL(x) then
7: return x
8: end if
9: Ω←Ω−x
10: Nx ← N EXT (x)
11: Nx ← Nx − χ − Ω
12: Set, (∀y ∈ Nx ) : y.π ← x
⊙
13: Ω ← Nx Ω
14: end while
26
Algorithm 2 DFID - Depth First Search Iterative Descent
Require: s /*Start state*/
1: f lag ← T rue /*To keep iterating as long as new states exists to explore*
2: ∆ ← 0 /*To denote depth to explore in the state space tree*/
3: while f lag == T rue do
4: ∆ ← ∆ + 1 /*Incrementally search for higher depths*/
5: f lag ← F alse /*Start pessimistically*/
6: s.π ← N IL
7: s.δ = 0 /*to denote current depth*/
8: Ω ← [s]
9: χ ← {}
11: x ← Ω[0]
12: Ω←Ω−x
13: if x.δ > ∆ then
/*Ignore more deep states*/continue
14:15: end if
16: if FINAL(x) then
17: return x
18: end if
19: Nx ← N EXT (x)
20: Nx ← Nx − χ − Ω
21: f lag ← (|Nx | > 0) /*Evidence that new states yet exist*/
22: Set, (∀y ∈ Nx ) : y.π ← x and y.δ ← x.δ + 1
⊙
23: Ω ← Nx Ω /*Prepend*/
24: end while
25: end while
27
usability of the solutions.
One of the major types of heuristic search is the A* algorithm [2]. Here the best path so far in terms
of least cost, is stored as a chain of predecessor links for each state from the start state. Among the next
states, the node with least estimated cost is chosen. However, the heuristic function always underestimates
the actual cost to the goal state from any given state. Also it is required that all edges have positive weights
in order to alleviate the need to maintain χ set. The A* algorithm is shown in Algorithm 3.
An example of a heuristic function H() can be any machine learning model that takes a vector represen-
tation of state and outputs a score indicative of how close it is to the goal state. A heuristic function can be
hand composed metric over the individual elements of a state representation.
Algorithm 3 A* Algorithm
Require: s, H() /*Start state and Heuristic function*/
1: s.π ← N IL
2: s.cost ← 0
3: Ω = [s]
5:
x∗ ← arg min(x.cost + H(x))

x∈Ω
6: Ω ← Ω − {x}
7: if FINAL(x) then
8: return x
9: end if
10: N ← N EXT (x)
11: for y ∈ N do
12: v = x.cost + edge(x, y)
13: if v < y.cost then
14: y.cost = v
15: y.π = x
⊙
16: Ω=Ω y
17: end if
18: end for
19: end while
5.3 Planning algorithms

Planning can be defined as identifying a series of activities or steps that results in achieving final goal
state from a starting state [3, 4]. The representation of states in the context of the planning algorithms is
different from the state representation in the previous algorithms we have seen. Each state is represented
as a collection of facts over the configuration of the system in question. A given problem or domain is
modeled as a set of boolean variables where each variable captures one minute aspect of the system. Any
state of the system can be represented as a boolean algebraic expression. Any algebraic expression can
be transformed into a disjunctive normal form which is a logical OR of several clauses where clause is a
28
conjunction of literals.
Example of a state: Consider a hypothetical problem of attending a meeting by 9 AM in the office. The
first task is to identify all pertinent boolean variables in the domain. Some of the variables are enumerated
below.
• Wake up
• Freshen up
• Have breakfast
• Start driving from home
• Reach office
• Attend meeting
There can be other variables as well such as below.
• Read news paper
• Drop kids at school
• Buy vegetables
• Go to gym
The idea is to define truth value for each of the variables .

If the representation of activities and actions are all states, the planning problem corresponds to finding
paths that reduce a cost function. In the space of planning algorithms, all edges are labeled by actions.
Optionally there is a cost associated with each edge or a state. There are two approaches to identify the
target goal state from the start state in the context of planning algorithms - one is called forward state space
search and the other is called backward state space search. In the planning algorithms, much like the cost,
the notion of valid path is that start state exists on the path.
Forward state space search (FSSP): Algorithm starts from the start state and exhaustively explores
neighbors till the goal state is either reachable or not reachable. This approach is simple to understand and
at any point in time, only a valid path from the start state exists. However it is costly when the state space
is large.
Backward state space search (BSSP): In this approach, the algorithm starts from the goal state and
works backwards to find the start state. The idea is to have more focus on the goal state by excluding states
that are not relevant to the goal state. While the idea of focus on the goal is evident, this approach results
in several temporary paths to the goal state which may not be valid, meaning they cannot be linked back
to the start state.
Goal stack planning: This algorithm combines the benefits of validity from the FSSP algorithm and the
benefit of focus on the goal from the BSSP algorithm by introducing a notion of state stack. The stack
corresponds to keeping track of states that are originating from the goal state and pruned based on validity
from the start state as given in the recursive Algorithm 4 [5].
29
Algorithm 4 Recursive Goal Stack Planning Algorithm (RGSP)
Require: S, G /*start and goal states*/
1: Π = [] //empty plan
2: if G ⊆ S then
3: return Π
4: end if
5: X = G − S
6: for g ∈ X do
7: (∃a) : (a.Add ∩ X ̸= ϕ) ∧ (a.Del ∩ G = ϕ)
8: Πa = RGSP (S, a.P re)
9: S = P rogress(S, a.P re) //S to Pre-conditions
10: S = P rogress(S, Πa ) //Pre-conditions to Plan upon selected action
11: S = P rogress(S, a) //Upon selected action to new state
12: Π = Π + Πa + a
13: end for
14: return Π
5.4 Formal logic

The foundations of the AI reasoning system is the computational logic theory [6]. A state of the system is
considered a set of truth statements about various observations. The recorded observations are assumed to
be true. As the system moves from one state to the other, the observations changes over time. The facts
may get added or removed over time as in case of the planning algorithms [4]. However the facts may also
assumed to be true perpetually. New facts are deduced from older facts by combination of the facts using
a set of rules of inference. Facts are expressed in mechanistic fashion with specific ordering of words rather
than a natural language form. The positions of the words have meanings and a clear pattern is intended
to be explicitly expressed. Such an expression of explicit patterns of word ordering is called predicate logic.
Calculations and modifications of components of the word patterns is called predicate calculus.
5.4.1 Predicate or Propositional logic
Propositional logic is a form of ordering the words such that, the form is explicit and clearly a commonality
is seen across diverse truth statements.
For instance consider the fact that Sun rises in the east and sets in the west. The same is expressed as,
• rises(sun,east)
• sets(sun,west)
For instance, a student reads a text book for a course. Here the facts capture two students - studentA
and studentB; three books - bookA, bookB and bookC; three courses - course1, course2 and course3; and
enrollments studentA in course1 and course2. Now as studentA has not enrolled for course3, he does not
read the text book for course3.
• reads(studentA,bookA).
• reads(studentA,bookB).
• textbook(course1,bookA).
30
Table 3: Rules of inference
Rule name Operators
Modus ponens [A-¿B, A] -¿ B
Modus tollens [A-¿B, B] -¿ A
Conjunction [A,B] -¿ A ∧ B
Disjunctive syllogism [A—B, A] -¿ B
Addition [A] -¿ A — B
Simplification [A,B] -¿ A
Hypothetical syllogism [A-¿B,B-¿C] -¿ A-¿C
Constructive Dilemma [A-¿B,C-¿D,A,C] -¿ B — D
Destructive Dilemma [A-¿B,C-¿D, B— D] -¿ A — C
• textbook(course2,bookB).
• textbook(course3,bookC).
• enrolled(studentA,course1).
• enrolled(studentA,course2).
5.4.2 First order logic
As we have seen in the previous case, the predicate logic requires each and every fact to be recorded and
processed. However this is not feasible and scalable when a common predicate word ordering form may be
written in a reduced form. Examples of first order logic involve variables as below.
• A student reads text book of the course he/she has enrolled for
• enroll(s,c), textbook(c,t) -¿ reads(s,t)
• For all students ’s’, for all textbooks ’t’ and for all courses ’c’
5.4.3 Automated theorem proof
Given a knowledgebase of truth statements and a query truth statement, the task is to prove or disprove.
There are two ways of accomplishing this - (i) Forward proof method and (ii) Backward proof method.
5.4.3.1 Forward chaining A set of rules are provided to the automated theorem proof algorithm. The
algorithm systematically expands the knowledgebase by combinations of the pieces of knowledge to produce
derived knowledge.
Some of the rules of inference is shown in (Table 3). Here, − > denotes implication, denotes negation,
| denotes logical OR, , or ∧ denotes logical AND, [] denote grouping.
The rules are verifiable via truth table. There can be more rules based on combinations in truth table.
5.4.3.2 Incompleteness of the forward chaining The rules of inference may not be sufficient for
proving a given query in the forward method. New rules need to be amended for each special query that
cannot be addressed.
For instance, the following rule is not possible to deduce from a given of rules (Table 3).
31
[B ∧ D, A — C] -¿ (A ∧ B) — (C ∧ D)
In order to consider this, one can augment this as new rule. However, it poses a sense of uncertainty for
being applicable to all the queries.
5.4.3.3 Backward chaining In backward chaining, the query (or goal) is considered along side the
knowledgebase and deductions are evaluated.
One approach is, to augment the query to the knowledgebase and to prove the whole group to be true for
some assignment of values to the variables. However, this reduces to evaluating for all possible values and
identifying a particular combination of variable assignments leading to truth. The scenario quickly becomes
satisfiability problem which has exponential time complexity.
Another way to prove is a query is deducible is by a method of contraction which is presented next.
5.5 Resolution by refutation method

In order to circumvent the need to exhaustively search for truth values of variables when a goal is augmented
to the knowledgebase, alternative approach is to consider a goal with negation and arrive at a contradiction.
A resolution is a special rule of inference, improving on top of disjunctive syllogism (Equation 1).
[A|B, C| B]− > A|C (1)
• Negate the goal and augment to the knowledge base
• Arrive at contradiction (or refutation)
• Apply the resolution rule repeatedly
The prolog system [7] uses backtracking and the resolution method. Further the system is expressed as
Horn clauses and the method used is SLD algorithm.
In order to convert a natural language sentence to a formal language statement the skolemization is used
[8]. The method used for symbolic matching and substitution is called unification algorithm [9].
5.6 AI framework adaptability issues

There had been tremendous amount of prior work in the AI systems including symbolic logic, unification
theory, inference mechanisms, state space search, planning systems and the prolog inference engine. However,
only bits and pieces of concepts are used in the state of the art systems and none at scale as regards to
implementational systems such as PDDL [10] and Prolog [7]. For instance we would encourage the reader
to try to program a simple river crossing problem in PDDL. For instance try to program image captioning
algorithm using Prolog. It is practically impossible to use the AI implementations in their current form.
However concepts can be carried forward to bring in robust abstractions.
One major limitation of state space search algorithms is, they are discrete and non-differentiable. There
is no notion of continuous state space search in the traditional systems, however the real word problems
are modeled routinely as optimization problems that necessitate gradient in most cases. If no gradient
information is available, a randomized method such as genetic algorithm may be used, however still operating
in continuous state space regime.
Another major limitation of the traditional AI systems, is the need for the heuristic function. For
instance, in case of the A* algorithm, there are seven theorems pounding on the correctness of the algorithm
32
if the heuristic is admissible i.e. underestimating type. The question reduces to who bells the cat? that is
who will give the heuristic function. In a practical scenario, an engineer coming to work will never be given
a heuristic function, and is typically given bigger business problem vaguely stated or sometimes exemplars
or data sets at the maximum. In such a situation the traditional AI framework becomes inapplicable, if not
undesirable.
There is major scope today to marry the benefits of schools of thought evolved over decades in the state
space regime and the formal logic and the current machine learning methodologies including deep learning
to reap the benefits of both the strength areas.
6 Supervised methods
Supervised learning methods are a class of machine learning algorithms where known correspondences be-
tween data and expected outcome are provided as examples [11, 12, 13]. The examples constitute a data set
which is also referred to as ground truth or gold standard. The input and output are both in the form of
vectors.
The problem of supervised learning can be expressed mathematically as in (Equation 2) where a data
point in d dimensional space is mapped to a data point in m dimensional space. The primary aspect of
supervised learning, a known set of such correspondences are given. The task is to learn the mapping
function M . However the mapping need not be one-to-one, it can be many-to-one as well. The data set of
known mappings is provided as a data set S, (Equation 3).
M : Rd → Rm (2)
S = {(x1 , y1 ), . . . , (xN , yN )|xi ∈ Rd , yi ∈ Rm } (3)
We fundamentally need a goodness metric for assessing the quality of the mapping function. The goodness
metric is formulated as a loss formulation (Equation 4), where L(·, ·) is a loss function.
M ∗ = arg min L(M (xi ), yi ) (4)

M
The task of determining a minimizing M is computationally hard to express unless we bring in simplifying
assumptions. Parameterize the mapping function, M in terms of certain parameters Θ and minimize over
the parameters (Equation 5).
Θ∗ = arg min L(M (Θ, xi ), yi ) (5)

Θ
Based on the semantics and form of the M function and the parameters Θ, different classes of learning
algorithms have come into existence. The two primary types of supervised methods are - (i) M (·, ·) is a
differentiable function and (ii) M (·, ·) is not a differentiable function. The Θ values may be learned over
iterations by gradient descent formulation or through a discrete mechanism of hill climbing.
For instance, as we will see in the sections to come, the parameter improvement in case of a decision tree,
is choosing the right split every time a new subtree is constructed. In case of parameter update by gradient
33
Table 4: Examples of loss functions
Loss function Description
∑
xi ∈X ||M (Θ, xi ) − yi ||
1 2
Mean Squared Error (MSE) |X|
∑
x ∈X ||M (Θ, xi ) − yi ||
1
Mean Absolute Error (MAE) |X|
∑ i
x ∈X max{(2 ∗ yi − 1) · (2 ∗ M (Θ, xi ) − 1), 0}
1
Hinge loss |X|
∑ i
x ∈X −((2 ∗ yi − 1) · pi ))
1
Logistic loss |X|
∑ i
x ∈X −(yi · log(pi ) + (1 − yi ) ∗ log(1 − pi ))
1
Cross entropy loss |X|
∑ i
xi ∈X (yi · log(yi ) − yi · log(pi ))
1
Information gain loss |X|
Table 5: Synthetic data sets
Name Description
Circles Concentric circles, where each circle is of particular
class
Moons Yin-yan type of semi overlapping concentric moons,
where each moon is of a particular class (Figure 4)
Blobs Globular clusters of data
Classification Classification type of data where points are gener-
ated in multi-dimensional space
descent, the function needs to be differentiable and takes parameter changes in decreasing direction of the
loss function.
The loss function, L(·, ·) computes an error value for deviation between the actual yi vector and the
predicted M (Θ, xi ) vector. Examples of loss functions is tabulated in (Table 4). The loss functions for single
dimensional points are illustrated in 2. In these functions let us denote the probability of prediction as
required in case of logistic and cross entropy loss functions, the probability that j th element of the predicted
output vector be on is given by,
eM (Θ,xi )[j]
pi [j] = ∑k=n
M (Θ,xi )[k]
k=1 e
6.1 Data sets

The supervised learning methodology requires a data set to operate. Though unsupervised learning problems
also require data set, there is a stark difference. There is an (input,output) pair in supervised methods and
typically the output is fixed dimensional. However when we study sequence-to-sequence models, there also
input and output pairs occur. In case of the sequence-to-sequence problems, the input and output sizes not
fixed in size or there is no such constraint required.
There are a number of publicly available data sets for supervised learning methods of varying complexity
and characteristics. Each data sets challenges a particular method and works well on some and does not
work well on others. Some of the data sets are real and some are synthetic. The scikit learn library provides
a number of synthetic data sets, some of which are shown in (Table 5).
/
34
6.2 Linear Regression
Linear regression is the fundamental regression algorithm where we need to predict the output y coordinate
from the input x. Imagine the scenario where there are N data points in 1 dimension (i.e number of features
is just one). Each data point has the corresponding y coordinate. The task is to predict for a given x input,
what could be the y coordinate. There are several possible ways in which the given task can be accomplished.
• Let the data set be D = {(xi , yi )(∀i ∈ [1..N ])}
• Let y = H(b, x) = b where b is a constant number that we will learn
• We need to define an error function and the compute error for each input and the cumulative error of
all points in the data set
∑i=N
• Let the error be a function of the model parameters as here E(b) = i=1 (yi − H(b, xi ))2 . This is also
called squared error.
The task is to identify a function H that minimizes the error. It is difficult to directly operate on the
space of functions, although it is easier to determine the best operating parameters of a given function.
Therefore the problem is recast as identifying the model parameters that minimize the error. The solution
is to start with some value of b and over iterations move the value along the negative of the gradient descent
direction. Gradient is computed as in (Equation 6).
∑
i=N ∑
i=N
E(b) = (yi − H(b, xi ))2 = (yi − b)2
i=1 i=1
∗
b = arg min E(b)
b
bnew ← bold − ∇E(b)|b=bold
∂E(b) ∑
i=N
∇E(b) = = 2 ∗ (yi − b) ∗ −1 (6)
∂b i=1
Lets take a look at analytical solution for determining b although it is not practically feasible to write
down one, when dealing with thousands of dimensions. Setting the gradient (Equation 6) to zero would yield
optimal b value analytically.
∑
i=N
∇E(b) = 0 ⇒ 2 ∗ (yi − b) ∗ −1 = 0
i=1
1 ∑
i=N
⇒b= yi
N i=1
As we can see, it is simply the mean value of the y coordinates when the error function is squared error.
The problem of determining an optimal b is same as finding a horizontal line that best passes through
all the points. However the problem complexity can be slightly increased to consider fitting a line which
has a slope as well. The task of fitting a line with slope now reduces to identifying the pair of variables
35
together i.e. (m,c) that minimizes total error as in the previous case. Minimizing corresponds to moving in
the negative direction of the gradient vector (Equation 7).
y = H(m, c, x) = mx + c
∑
i=N
E(m, c) = (yi − H(m, c, xi ))2
i=1
∑
i=N
= (yi − (m ∗ xi + c))2
i=1
(m, c)∗ = arg min E(m′ , c′ )

(m′ ,c′ )
∂E(m, c) ∂E(m, c)
∇E(m, c) = ( , ) (7)
∂m ∂c
The weight update equation for two variables now is (m, c)new = (m, c)old − ∇E(m, c)|(m,c)=(m,c)old
Though determining the analytical solution for input features involving several hundreds or thousands
of features is not practically feasible, for the purposes of understanding of the concept, we can take a look
at the analytical solution. Equating the gradient to zero (Equation 7), we get the following solution.
∇E(m, c) = 0
∂E(m, c) ∂E(m, c)
⇒( , )=0
∂m ∂c
∂E(m, c)
⇒ =0 (8)
∂m
∂E(m, c)
⇒ =0 (9)
∂c
Consider Equation 8,
∂E(m, c)
=0
∂m
∑
i=N
⇒2∗ (yi − (m ∗ xi + c)) ∗ xi = 0
i=1
∑i=N
i=1 (yi − c) ∗ xi
⇒m= ∑i=N (10)
i=1 xi ∗ xi
∂E(m, c)
=0
∂c
∑
i=N
⇒2∗ (yi − (m ∗ xi + c)) ∗ 1 = 0
i=1
36
1 ∑
i=N
⇒c= (yi − m ∗ xi ) (11)
N i=1
As we see, Equation 10 which updates m is a function involving c and vice versa, the Equation 11 that
updates c depends on m. This cyclic dependency is expectable in case of multivariate input i.e. having many
features. Any iterative process of starting with one initial setting of m, c and updating successive values
results in converging to an optimal solution. Gradient descent of the (m, c) vector starts with initial values
of m and c and converges to an optimal solution over successive iterations.
If we consider partial derivatives of ∇E(m, c) vector with respect to m and c we can observe that it results
in a negative constant (exercise to the reader). This asserts the fact that there exists a global minimum
(m, c) vector for the squared error loss function.
6.2.1 Polynomial fitting
Linear regression may be applied to fit higher order curves. Consider the same data set of 2D points, D as
in the previous section. Convert each point to a degree d polynomial by feature transformation,
Dd = {(x′i , yi )|x′i = [x0i , . . . , xdi ]}
Now the learning task becomes identifying the coefficients for each of the powers of elements. Consider
an array of coefficients (to be learned) as,
Θ = [a0 , . . . , ad ]
Given yi is a product of the these coefficients against the transformed input coordinate,
∑
k=d
(∀x′i ∈ Dd ) : ŷi = M (Θ, x′i ) = ak ∗ xki
k=0
The polynomial fitting problem now corresponds to reduction of loss between the actual and the predicted
y coordinates. Using the least squares loss function, the gradient with respect to Θ becomes,
∑ ∑
L(Θ, Dd ) = ||yi − ŷi ||2 = ||M (Θ, x′i ) − yi ||2
x′i ∈D d x′i ∈D d
∑ [ ]
∇Θ L = 2 ∗ (M ((Θ, x) − yi )T × ∇Θ M (Θ, x) x=x′
i
x′i ∈D d
 ∂M (Θ,x)
  
∂a0 x0
 ..   . 
∇Θ M (Θ, x) = 
 .
= . 
  . 
∂M (Θ,x)
∂ad xd
An example of polynomial fitting with degrees 1, 4 and 15 on a sinusoidal function with an error is
presented in (Figure 3). The figure depicts the actual data points in blue color, the true function in red color
and the predicted curve in blue color. One can observe that as the degree is increasing the curve is wiggly
and overfitting to the points. When the degree is too small than required, the curve exhibits a general shift
in its position also termed as bias. There exists an optimal parameter setting for a given modeling procedure
(here it is polynomial fitting) and the parameter is the degree.
37
6.2.2 Thresholding and Linear Regression
Though linear regression is a plain real valued attribute prediction problem, it can be converted to a classi-
fication problem by imposing a threshold. However a threshold has to be learned from the data instead of
manually fixing the same. This would be a case when the set of points represent a boundary or gulf regions
between two (or more) classes points. An application of threshold on the linear regression would then spot
a point in one of the buckets surrounding the gulf region of points over which a regression problem is solved.
However, more interpretable and sophisticated methodologies such as Logistic regression, Support vector
machine, Decision tree and other formulations are available in place of converting a linear regression prob-
lem to a classification problem. However surprisingly the formulations still utilize gradient descent in case
of differentiable loss functions and in philosophical sense what is solved is also a regression. Caution should
be exercised here to not get confused just because gradient descent is deployed in all these formulations, the
intuition behind the approaches are different between regression and classification problems.
In case of neural networks, one often finds solving regression problems for each output classes represented
by the neurons in the final layer of the neural network. The outputs are then compared to find maximally
scoring neuron, using a softmax classifier to detect the predicted class. In this case, regression scores are
used for comparison and are converted to solve a classification problem. However, it is recommended to treat
a neural network also as a separate problem formulation than plain regression formulation, though gradient
descent is deployed in both the places.
6.3 Logistic Regression

Logistic regression is one of the fundamental classification algorithms where a log odds in favor of one of the
classes is defined and maximized via a weight vector. As against a linear regression where w · x is directly
used to predict y coordinate, in the logistic regression formulation w · x is defined as log odds in favor of
predicted class being 1. But for the interpretation of the w.x values, the rest of the formulation depends on
regression of w vector to minimize a logit loss function, a name of logistic regression is given for this method.
It is essentially a classification algorithm though the word ’regression’ is present.
Let y ′ ∈ 0, 1 be the actual label of a data point in a two class problem. Let D = {(x′ , y ′ )} be the data set
of points. Let (∀(x′ , y ′ ) ∈ D) : D1 = {(x′ , 1)|y ′ = 1} and D0 = {(x′ , 0)|y ′ = 0} Let y denote the predicted
P (y=1|w,x)
class. Now, let us define w.x = log( P (y=0|w,x) ) where P (y = 1|w, x) denote the probability of predicted class
1 given w and x vectors. With this setting the logistic regression weight update equations are derived as
below.
38
P (y = 0|w, x) = 1 − P (y = 1|w, x)
P (y = 1|w, x)
w.x = log( )
P (y = 0|w, x)
1
⇒ P (y = 1|w, x) =
1 + e−w·x
P (D|w) = P (D1 |w) × P (D0 |w)(i.i.d)
∏ ∏
= P (y = 1|w, x1 ) × P (y = 0|w, x0 )
(x1 ,y1 )∈D1 (x0 ,y0 )∈D0
∏
= P (y = 1|w, x)y × P (y = 0|w, x)(1−y)
(x,y)∈D
∑
⇒ log(P (D|w)) = (y ∗ log(P (y = 1|w, x) + (1 − y) ∗ log(P (y = 0|w, x)))
(x,y)∈D
∑
⇒ log(P (D|w)) = (y ∗ log(P (y = 1|w, x)) + (1 − y) ∗ log(1 − P (y = 1|w, x))
(x,y)∈D
∑ 1 1
= (y ∗ log( −w·x
) + (1 − y) ∗ log(1 − )
1+e 1 + e−w·x
(x,y)∈D
∑ ew·x 1
= (y ∗ log( ) + (1 − y) ∗ log( )
1 + ew·x 1 + ew·x
(x,y)∈D
∑
= (y ∗ (w · x) − y ∗ log(1 + ew·x ) − (1 − y) ∗ log(1 + ew·x )
(x,y)∈D
∑
= (y ∗ (w · x) − log(1 + ew·x ))
(x,y)∈D
∑
L(w) = −logP (D|w) = − (y ∗ (w · x) − log(1 + ew·x )) (12)
(x,y)∈D
In logistic regression, we need to determine the w vector such that the probability of data is maximized
(Equation 12). The interpretation of the loss function is given by (Equation 13).
′
L(w, x′ , y ′ ) = −log(P (y = y ′ |w, x′ ) = log(1 + ew·x ) − y ′ ∗ (w · x) (13)
The optimal value is obtained by taking derivative of the loss function with respect w which is the
gradient and iterating by moving in the direction of negative gradient, calculated in each step (Equation 14).
w∗ = arg maxw P (D|w) = arg minw −log(P (D|w)) = arg minw L(w)
wnew = wold − α × ∇L(w)|w=wold (14)

∑
Here ∇L(w) = (x,y)∈D (y − 1+e1w·x ) ⊙ x.
Consider the second derivative of the loss function, or gradient of the gradient, results in the Hessian
matrix (Equation 15). The vector inner product ensures positive definiteness of the Hessian.
∂∇L(w) ∑ +1
= xxT ≥ 0 (15)
∂w (1 + ew·x )2
(x,y)∈D
39
Consider a vector y in the same d dimensional vector space as the input points, Rd ,
(∀y ∈ Rd ) : y T × (x × xT ) × y = (y T × x) × (xT × y) ≥ 0
There is an interesting simplified form of logistic regression formulation when y ∈ {−1, +1} instead of
{0, 1}. The loss function can be simplified. Consider the loss function (Equation 13), for two scenarios when
y ′ = 0 and y ′ = 1.
L(w, x, y) = log(1 + ew·x ) − y ∗ (w · x) (16)

y = 0 ⇒ L(w, x, 0) = log(1 + ew·x ) (17)
−w·x
y = 1 ⇒ L(w, x, 1) = log(1 + e w·x
) − (w · x) = log(1 + ew·x
) − log(e w·x
) = log(1 + e ) (18)
−β∗(w·x)
⇒ L(w, x, β) = log(1 + e ) (19)
Converting the problem of y to β = 2 ∗ y − 1 in (Equation 19).

For instance fitting a logistic regression classifier to a data set of moons type of data is shown in (Figure 5).
The shades of red color indicate probability of each pixel in the image region for the red class according to
the trained classifier. The shades of blue color indicate the probability of each pixel in the image region for
the blue class according to the trained classifier.
The multi-class classification scenario is more advanced than binary scenario and is presented in the
subsequent sections.
6.4 Support Vector Machine - Linear Kernel

A linear kernel support vector machine [14, 15] is a formulation that segregates two classes of points by a
margin. A margin is a region bounded by two parallel hyper planes (i.e. naming convention for plane in more
than 3 dimensions). The margin region between the two hyper planes of separation should not have any
points. In case of the soft margin formulation, those points falling in the margin are given certain penalty.
The data points lying on the margin boundaries are called support vectors. The data points on either side
of the separating hyper planes belong in the two classes although certain points may be wrongly classified
contributing to error value.
The first step in the formulation of SVM is to demarcate two classes with +1 and -1 labels. The
formulation goes as determining a vector w such that w · x ≥ +1 for the +1 class of points and w · x ≤ −1
for the other class of points.
Let D = {(x, y)|y ∈ {−1, +1}}
D− = {(x, y)|y = −1}

D+ = {(x, y)|y = +1}
Combining the two equations for +1 and -1 class,
w · x ≥ +1(∀(x, y) ∈ D+ → y = +1)
w · x ≤ −1(∀(x, y) ∈ D− → y = −1)
⇒ (∀y) : y ∗ (w · x) ≥ 1
40
Table 6: ξ interpretations
ξ value Interpretation
ξ≤0 Points are correctly classified
0<ξ<1 Points lying in the margin region
ξ=1 Points lying on the boundary hyper planes
ξ>1 Points occurring on the wrong side of the margin
The constraint for SVM is given in Equation 20. However, the error cases are for the points falling
inside the margin and points occurring on the wrong side. The penalty for these errors for each point is
shown in the Equation 21. The higher the ξi value, the more inconsistent that point xi is. The lesser the ξi
value for the ith point, the more consistent the point is. The example scenarios are shown in the Table 6.
(∀y) : y ∗ (w · x) ≥ 1 (20)
(∀(x, y) ∈ D), (∃ξi > 0) : yi ∗ (w · xi ) ≥ 1 − ξi (21)
When we take two points +1 hyper plane, A+ , B+ , the
w · A+ = +1
w · B+ = +1
⇒ w · (A+ − B+ ) = 0
⇒ w ⊥ Hyperplane
Consider two points along the w vector, one point A+ on the +1 hyper plane and the another point A−
on the -1 hyper plane. The size of the margin λ is given by the distance between two points on either of the
hyper planes lying along the w vector, given in Equation 22.
A+ − A− = λw
w · A+ = +1
w · A− = −1
w · (A+ − A− ) = +1 − (−1) = 2
w · (λw) = 2
λ||w||2 = 2
2
⇒λ=
||w||2
2
λ= (22)
||w||2
Identifying the margin resulting in maximal separation of the two classes corresponds to Equation 23.
41
w∗ = arg max λ
w
||w||2
= arg min
w 2
||w||2
w∗ = arg min (∋ (∀i ∈ [1..N ])yi ∗ (w · xi ) ≥ 1 − ξi ) (23)
w 2
The error terms ξi can be added to the optimization function with constraints in Equation 23 to assume
the form Equation 24.
||w||2 ∑
i=N
w∗ = arg min{ + ξi } (24)
w 2 i=1
However in this form, the positive and negative errors cancel out each other as in (Equation 25), while it
is required to penalize only ξj > 0 entities. In order to account for focus on correction of errors, a hinge loss
function is introduced into the optimization function in (Equation 23) to assume the form (Equation 26).
(∃i, j ∈ [1..N ]) : ξi < 0 ∧ ξj > 0 ⇒ ξi + ξj ≤ ξj (25)
||w||2 ∑ ||w||2 ∑
i=N i=N
w∗ = arg min{ + max(0, ξi )} = arg min{ + max(0, (1 − yi ∗ (w · xi ))} (26)
w 2 i=1
w 2 i=1
The more advanced forms of SVM are based on kernel functions and Mercer’s theorem to express w vector
as linear combination of the input data points. We will discuss those techniques in the kernel methods in
the subsequent sections.
6.5 Decision Tree

A decision tree algorithm [16] takes as an input, a table X and recursively partition the table into sub-sub
tables and so on improving a purity score of the ”label” column in each partition. The purity score is a
mechanism based on proportion of individual classes in a mixture of class labels. The higher the proportion
of one of the classes, the more pure the collection is.
The attributes of a table can be numeric or categorical. A numerical attribute can partition the table into
two parts about its mean value - rows with attribute value less than as one part and remaining as the other
part. A categorical attribute can partition the table into so many parts as its individual possible values.
Outline of decision tree
• Determine base propensities of the classes
• Determine base purity score
42
• Evaluated weighted purity score for each attribute over its possible value, where weight is the fraction
of rows having that value
• Select the attribute that gives highest purity gain over the base purity
• Split the table into parts based on attribute values
• Compute decision tree for each of the part recursively
• More detailed
A decision tree construction a function y = F (x) where F is composed of a sequence of test over the
attribute values to finally land up in a node. The node has propensities of individual classes which are output
a class probabilities for the input x. A new data point goes through a series of tests starting from the root
node down till the leaf nodes against the selected attributes and their corresponding values. The decision
tree algorithm is shown in Algorithm 5.
An application of the decision tree classifier to moons data is shown in (Figure 6). Each node in the
decision tree corresponds to either a horizontal or a vertical line. A particular sub-region is sub-divided by
horizontal and vertical lines according to the decision tree depth considered. The tree shown here is for a
depth of 3 where the whole region is partitioned into 8 sub regions. If the depth is k, the 2D region would
be partitioned into 2k sub-regions.
An application of the decision tree regression model is shown in (Figure 11). Here the feature space is one
dimensional. The output is a real valued quantity that needs to be predicted. The output in each sub-tree
of the decision tree is its mean value. The tree is computed for a depth of 3 hence resulting in 23 horizontal
partitions. Each horizontal partition corresponds to a mean value computed within that sub-branch. The
tree with depth 1 is also illustrated, which has just two mean values (21 ).
6.6 Ensemble methods

Ensembling is a method of combination of several classifiers to result in a better classifier in terms of reduced
variance and bias [17, 18]. There are three types of ensembling approaches - (i) Bagging, (ii) Boosting and
(iii) Random forests. Bagging is a process where a powerful classifier is trained on different subsets of data
and average prediction is taken. Such a process reduces variance in predictions. Boosting is a process where
incrementally, over steps, several weak classifiers are combined so as to reduce error. Random forest is a
technique where multiple random classifiers are built on different subsets of data columns to reduce bias and
variance both. Each of these methods has its own set of control parameters and characteristic behavior on
diverse data sets.
A sketch of the bagging approach is given in Algorithm 6. It is reasonable to assume that the test data
distribution spans across the diverse sub samples of the training data. In this case, conceptually, average
prediction from multiple classifiers built over those different subsets of training data behave differently from
a single model built over the entire training data. The behavior is analogous to cross validation process
where a model’s performance is averaged over multiple sub samples of training data. The average model
will have lesser variance than a single model. However nothing can be guaranteed about the training set
accuracy or the bias. If all the constituent models are too simple, the bias of their combination in bagging
is still going to be high.
Boosting algorithm as depicted in Algorithm 7, creates the ensemble with focus on error reduction at
each step of adding a new classifier. Typically all the classifiers are all weak, however due to reduction
in remaining error at each step of adding a new classifier to the ensemble, the final classifier attains very
43
Algorithm 5 Decision Tree Algorithm - DT
Require: X /*input table*/
Node = new Node() /*Node to return after tree construction*/
C = cols(X) /*set of columns*/
Node.basep = purity(X) /*purity on label column*/
Node.probas = probas(X[’label’) /*class propensities*/
maxpurity = Node.basep /*highest enhanced purity*/
maxc = ’label’ /*attribute that enhances purity*/
β = [] /*condition tests*/
Γ = [] /*sub trees*/
if Customize Stopping Criteria then
return Node
end if
for c ∈ C do
Π = [] /*table partitions based on values*/
βc = [] /*condition tests for this attribute*/
if type(c)=categorical then
for v ∈ set(X[c]) do
⊙
Π ← Π X[c = v]
⊙
βc ← βc ”c == v”
end for
else
µ = mean(vals(X[c]))
Π ← [X[c < µ], X[c ≥ µ]]
βc = [”c < µ”, ”c ≥ µ”]
end if
wtpurity = 0
for i = 1 : |Π| do
wtpurity+ = |Π[i]|
|X| ∗ purity(Π[i])
end for
if wtpurity > maxpurity then
maxpurity = wtpurity
maxc = c
β = βc
end if
end for
if maxc ̸= ’label’ then
for i = 1 : |β| do
⊙
Γ ← Γ DT (X[β[i]]) /*Recursion for sub tree building*/
end for
end if
return Node
44
Algorithm 6 Bagging Methodology
Require: X /*Data set*/
Γ = []
for i = 1 : M do
Xsub ← sample(X)
⊙
Γ ← Γ train(Xsub )
end for
if classification problem then
y ← voting({γ(x) : ∀γ ∈ Γ})
else
y ← mean({γ(x) : ∀γ ∈ Γ})
end if
high accuracy than individual weaker ones. However unlike bagging, the boosting algorithms do not reduce
variance and they are prone to overfitting, though they reduce bias. The only advantage is, a bunch of
simple weak classifiers are sufficient rather than a single complex classifier. Reducing overfitting is as simple
as reducing the size of the ensemble or considering even simpler classifiers of the ensemble.
Algorithm 7 Boosting Methodology

Γ0 = 0 /*List of classifiers so far*/
Error = L(Γ(X), y) /*Loss function*/
for i = 1 : M do
Γi ← Γi−1 + α × f (x) /*Add a new learner*/
fi , αi ← arg minf,α L(Γi (X), y) /*Reduce loss further*/
/*For a convex loss function*/
Setting ∂L(Γ∂fi (X),y)
= 0 → fi
Setting ∂L(Γ∂α
i (X),y)
= 0 → αi
end for
y = ΓM (x) /*to predict for new input vector*/
Random forest (RF) algorithm [19] as depicted in Algorithm 8, creates an ensemble of multiple classifiers
built over different subsets of columns of the input data. However, unlike bagging, where subsets of data are
used for model building, each classifier sees the whole data. This is also unlike boosting where new classifier
targets to reduce remaining error, here in RF, the new classifier does not have the notion of remaining error.
A random forest reduces variance due to averaging the decision over diverse classifiers. A random forest also
reduces bias if individual classifier is sufficiently complex over the given subset of features.
Comparing and contrasting between the three ensembling methods is given in Table 7. The bagging and
random forest methodologies are more closely related than either to the boosting methodology. The nature
of the constituent classifiers affects bias and variance.
6.6.1 Boosting algorithms
Boosting algorithms work by focusing on the error in each iteration and reducing that error by adding a new
model. There are two main types of boosting algorithms - AdaBoost is the one that focuses on erroneous
data points via a notion of sample weights and Gradient boost is the other one where the error value is
45
Algorithm 8 Random forest methodology
Γ0 = [] /*List of classifiers so far*/
χ = cols(X) /*set of columns*/
for i = 1 : M do
χi = subset(χ) /*random subset*/
h∗ = L(h(X[χi ]), y) /*best classifier on subset of columns*/
⊙
Γi ← Γi−1 h∗
end for
if classification problem then
y ← voting({γ(x) : ∀γ ∈ Γ})
else
y ← mean({γ(x) : ∀γ ∈ Γ})
end if
Table 7: Comparison of ensembling methods

Aspect Bagging Boosting Random forest
Training data Multiple subsets Whole data Whole data
for building con-
stituent classifiers
Typical complexity Complex Weak Complex
of the constituent
classifiers
Variance reduction Yes No Yes
Bias reduction No Yes No
Complexity control Constituent classi- Ensemble size Constituent classi-
fiers fiers and ensemble
size
46
Table 8: AdaBoost notations
Symbol Meaning
Wk (i) Weight of ith data
point in k th itera-
tion
hm (x) Classifier built in
mth iteration
ϵm Sum of weights of
erroneous points in
mth iteration
αm Weight of each clas-
sifier hm (x)
1{a ̸= b} This is an indica-
tor function which
1 when a ̸= b and
0 otherwise
predicted and subtracted.
AdaBoost algorithm: This algorithm [20] employs a notion of weight of each point. The higher the
weight the more is the contribution of the point to the overall error and the lower the value the lesser is
the contribution of the point to the overall error. The algorithm starts with an initial setting of the weights
uniformly for all the points and updates weights over iterations as in Algorithm 9. The notations used are
described in Table 8
Algorithm 9 AdaBoost algorithm

1: W0 (i) = 1/N (∀i ∈ [1..N ])
2: for m = 1 : M do
∑i=N
3: hm = arg minh i=1 Wm−1 (i) ∗ 1{yi ̸= h(xi )}
∑i=N
4: ϵm = i=1 Wm−1 (i) ∗ 1{yi ̸= hm (xi )}
5: αm = 1/2 ∗ log( 1−ϵ ϵm )
m
6: for i = 1 : N do
7: if yi = hm (xi ) then
1
8: γ = 1−ϵ m
9: else
10: γ = ϵ1m
11: end if
12: Wm (i) = 1/2 ∗ Wm−1 (i) ∗ γ
13: end for
14: end for
∑m=M
15: y = m=1 αm ∗ hm (x)
Let us discuss why this algorithm works in terms of the mathematics behind α, ϵ, W . The first thing to
start with is defining the error function. Let yi ∈ {−1, +1}(∀i ∈ [1..N ]). Such an assumption does not restrict
47
∑m=M
the scope of the derivation of the AdaBoost derivation as we will see. Let HM (x) = m=1 αm ∗ hm (x)
be the ensemble. Let us define a loss function L() over HM () classifier. The error on mis-classification is
determined using LA error function (Equation 27). However, for the easy of derivation of update equations
for α, ϵ, W quantities, we consider the exponential forms. The exponentiat’ed value of the LA is still a good
indicator of the error function although with all positive values LB in Equation 28. This is a function where
sum of terms is occurring inside the exponentiation, by applying Jensen’s rule over convex function, we can
deduce Equation 31 which is an upper bound on the error in LA term.
∑
i=N
LA = −1/N ∗ yi ∗ FM (xi ) (27)
i=1
A
LB = eL (28)
⇒ (LB ≥ 0) ∧ (LB ≥ LA ) (29)
∑
i=N
LB ≤ 1/N ∗ e−yi ∗FM (xi ) (30)
i=1
∑
i=N
L(FM ) := 1/N ∗ e−yi ∗FM (xi ) (31)
i=1
The AdaBoost formulation reduces upper bound on the error rather than directly the error. Let Wk (i)
∑i=N
is such that i=1 WZkk(i) = 1(∀k ∈ [0..M ]) where Zk is the normalizing factor. The best classifier and its
weight are selected following minimization of the loss function Equation 32.
L(FM +1 ) = L(FM + α ∗ h)
∑
i=N
= 1/N ∗ e−yi ∗FM +1 (xi )
i=1
∑
i=N
= 1/N ∗ e−yi ∗(FM (xi )+α∗h(xi ))
i=1
∑
i=N
= 1/N ∗ e−yi ∗FM (xi ) × e−yi ∗α∗h(xi )
i=1
e−yi ∗FM (xi )
Let WM (i) :=
ZM
∑
i=N
⇒ L(FM +1 ) = l(α, h) = 1/N ∗ WM (i) × e−yi ∗α∗h(xi )
i=1
∑
i=N
αM +1 , hM +1 = arg min l(α, h) = arg min l(α, h){1/N ∗ WM (i) × e−yi ∗α∗h(xi ) } (32)
(α,h) (α,h) i=1
We can note that the Equation 32 that, the hM +1 does not depend on α, ZM , therefore finding the
optimal hM +1 is given in Equation 33.
48
∑
i=N
L(FM +1 ) = 1/N ∗ WM (i) × e−yi ∗α∗h(xi )
i=1
∑ ∑
−α
= 1/N ∗ { WM (i) ∗ e + WM (j) ∗ eα }
yi =h(xi ) yj ̸=h(xj )
∑ ∑
−α
= 1/N ∗ {e ∗ WM (i) + eα ∗ WM (j)}
yi =h(xi ) yj ̸=h(xj )
∑ ∑
= 1/N ∗ {e−α ∗ (1 − WM (i)) + eα ∗ WM (j)}
yj ̸=h(xj ) yj ̸=h(xj )
∑ ∑
= 1/N ∗ {e−α − −e−α ∗ WM (j) + eα ∗ WM (j)}
yj ̸=h(xj ) yj ̸=h(xj )
∑
−α −α
= 1/N ∗ {e + (e − eα
)∗ WM (j)}
yj ̸=h(xj )
∑
i=N
= 1/N ∗ {e−α + (eα − e−α ) ∗ 1{yi ̸= h(xi )}WM (i)}
i=1
⇒ arg min l(α, h) = arg min l(α, h)(∵ α! → l(α, h))

h,α h
∑
i=N
= arg min 1{yi ̸= h(xi )}WM (i)
h i=1
∑
i=N
hM +1 = arg min L(FM +1 ) = arg min 1{yi ̸= h(xi )}WM (i) (33)
h h i=1
Once the minimizing hM +1 is determined, the next task is to determine, minimizing αM +1 of the loss
function (Equation 34). The αM is determined as in Equation 35.
∑
i=N
ϵM = WM (i) ∗ 1{yi ̸= hM +1 (xi )} (34)
i=1
∑
i=N
l(α, hM +1 ) = 1/N ∗ {e−α + (eα − e−α ) ∗ 1{yi ̸= hM +1 (xi )}WM (i)}
i=1
= 1/N ∗ {e−α + (eα − e−α ) ∗ ϵM +1 }
αM +1 = arg min l(α, hM +1 ) (35)

α
In order to identify α, taking partial differentiation of the Equation 35 with respect to alpha yields the
following equations (Equation 36).
49
∂l(α, hM +1 )
=0
∂α
⇒ {−1 ∗ e−α + (eα + e−α ) ∗ ϵM +1 } = 0
⇒ {−1 + (e2α + 1) ∗ ϵM +1 = 0}
1
⇒ e2α = −1
ϵM +1
1 − ϵM +1
αM +1 = 1/2 ∗ log( ) (36)
ϵM +1
The weight update equation for the next iteration WM +1 depends on the previous WM as in (Equation 37).
(∃γ)WM +1 (i) = γ ∗ e−yi ∗FM +1 (xi )

WM +1 (i) = γ ∗ e−yi ∗(FM (xi )+αM +1 ∗hM +1 (xi ))
= γ ∗ e−yi ∗FM (xi ) ∗ e−yi ∗αM +1 ∗hM +1 (xi )
= γ ∗ WM (i) ∗ ZM ∗ e−yi ∗αM +1 ∗hM +1 (xi )
(∃γ) : WM +1 (i) = γ ∗ e−yi ∗FM +1 (xi ) (37)
WM +1 (i) = γ ∗ WM (i) ∗ ZM ∗ e−yi ∗αM +1 ∗hM +1 (xi )

∑
i=N
−yi ∗αM +1 ∗hM +1 (xi )
⇒ WM +1 (i) = γ ∗ ZM ∗ sumi=N
i=1 WM (i) ∗ e =1
i=1
∑ ∑
= γ ∗ ZM ∗ ( WM (i) ∗ e−αM +1 + WM (i) ∗ eαM +1 ) = 1
yi =hM +1 (xi ) yi ̸=hM +1 (xi )
= γ ∗ ZM ∗ ((1 − ϵM +1 ) ∗ e−αM +1 + ϵM +1 ∗ eαM +1 ) = 1
1 − ϵM +1
αM +1 = 1/2 ∗ log( )
ϵM +1
√
1 − ϵM +1
⇒ eαM +1 =
ϵM +1
√
ϵM +1
⇒ e−αM +1 =
1 − ϵM +1
Then the normalizing weights WM +1 (i)(∀i ∈ [1..N ]) derives Equation 38.
50
γ ∗ ZM ∗ ((1 − ϵM +1 ) ∗ e−αM +1 + ϵM +1 ∗ eαM +1 ) = 1
√ √
ϵM +1 1 − ϵM +1
= γ ∗ ZM ∗ ((1 − ϵM +1 ) ∗ + ϵM +1 ∗ )=1
1 − ϵM +1 ϵM +1
√ √
= γ ∗ ZM ∗ ( (1 − ϵM +1 ) ∗ ϵM +1 + ϵM +1 ∗ (1 − ϵM +1 )) = 1
1
⇒γ= √
ZM ∗ 2 ∗ (1 − ϵM +1 ) ∗ ϵM +1
1
γ= √ (38)
ZM ∗ 2 ∗ (1 − ϵM +1 ) ∗ ϵM +1
Substituting (Equation 38) in (Equation 37) and simplifying for the cases of yi ̸= hM +1 (xi ) and the other
class of points results in separate weight update equations (Equation 39) and (Equation 40).
WM +1 (i) = γ ∗ WM (i) ∗ ZM ∗ e−yi ∗αM +1 ∗hM +1 (xi )

1
= √ ∗ WM (i) ∗ ZM ∗ e−yi ∗αM +1 ∗hM +1 (xi )
ZM ∗ 2 ∗ (1 − ϵM +1 ) ∗ ϵM +1
1
= √ ∗ WM (i) ∗ e−yi ∗αM +1 ∗hM +1 (xi )
2 ∗ (1 − ϵM +1 ) ∗ ϵM +1
1
(yi = hM +1 (xi )) : √ ∗ WM (i) ∗ e−αM +1
2∗ (1 − ϵM +1 ) ∗ ϵM +1
√
1 ϵM +1
= √ ∗ WM (i) ∗
2 ∗ (1 − ϵM +1 ) ∗ ϵM +1 1 − ϵM +1
1
(yi ̸= hM +1 (xi )) : √ ∗ WM (i) ∗ eαM +1
2 ∗ (1 − ϵM +1 ) ∗ ϵM +1
√
1 1 − ϵM +1
= √ ∗ WM (i) ∗
2 ∗ (1 − ϵM +1 ) ∗ ϵM +1 ϵM +1
WM
(yi = hM +1 (xi )) : (39)
2 ∗ (1 − ϵM +1 )
WM
̸ hM +1 (xi )) :
(yi = (40)
2 ∗ ϵM +1
6.6.2 Gradient Boosting Algorithm
Gradient Boosting [21] is another highly popular of boosting algorithm, where over iterations error itself is
predicted and subtracted from the output of the classifier. The word gradient is used to imply the fact that
error is proportional to the negative direction of gradient of a loss function.
Let y = F (x) be the machine learned function that predicts y coordinate for the input x. Let L(F (x), y)
be the loss function that computes difference between predicted and actual outputs. Let L(F (x), y) =
51
1/2 ∗ (y − F (x))2 be the squared loss function. Then the following derivations compute new values of F in
Equation 41.
L(F (x), y) = 1/2 ∗ (y − F (x))2

∂L(F (x), y)
= 1/2 ∗ 2 ∗ (y − F (x)) ∗ −1
∂F (x)
∂L(F (x), y)
⇒ ∇L(F (x), y) = = (F (x) − y) ⇒ F new (x) = F old (x) − ∇L(F (x), y)|F (x)=F old (x)
∂F (x)
F new (x) = F old (x) − predicted(F old (x) − y) (41)
Building sequence of classifiers: The steps in building sequence of classifiers is as follows.
• Let F1 (x) be the first classifier built over the data set
• Let e1 (x) = F1 (x) − y be the classifier or regressor for the error
• Let F2 (x) = F1 (x) − e1 (x) be the updated classifier
• Let e2 (x) = F2 (x) − y be the classifier or regressor for the error on the update classifier
• Let F3 (x) = F2 (x) − e2 (x) be the updated classifier
• and so on . . .
• Let FM +1 (x) = FM (x) − eM (x)

∑i=M
• Then we can expand FM +1 (x) = F1 (x) − i=1 ei (x)
For any other loss function of the form L(F (x) − y) if the function is not a constant function, then the
gradient term, ∇L(F (x) − y) ∝ (F (x) − y). There are other variants of gradient boosting techniques, one of
the very popular technique is called xgboost and it combines features of random forest and gradient boosting.
Though boosting algorithms reduce error, they are prone to overfitting. Unlike bagging, boosting algo-
rithms can be ensembles of several weak classifiers. The focus in boosting is error reduction, where as the
focus of bagging is variance reduction.
6.7 Bias Variance Trade off

Bias and Variance are two important characteristics of any machine learning model and they closely relate
to training and validation set accuracies [22]. The simpler the model is, the lower is the test set accuracy
and even the training set accuracy. On the other hand if the model is too complex, the training set accuracy
is higher where as the test set accuracies are poor. A machine learning model needs to generalize well on
the unseen data there by requiring the model not to underfit or overfit.
Bias is a problem related to underfitting of the model to the data. The problem is evident from low
training set accuracies it self. The problem is addressed by increasing the complexity of the model. Variance
is a problem related to overfitting of the model to the training data. Overfitting is a scenario where model
performs poorly when any input is given outside of the training data provided to the model. The variance
problem is addressed by considering bagging or increasing training data size or reduce complexity of the
52
model. It is desirable to have low bias and low variance although achieving both is difficult if not
impossible for practical problem scenarios.
The algorithm for determining bias and variance of a given model is as shown in Algorithm 10. Multiple
subsets are drawn from input (with replacement). For each subset, a classifiers is built that minimizes error
on that subset and finally an ensemble of such classifiers is built (Γ list). Then bias is defined as deviation of
the averaged prediction from the true value and variance is spread of the values across constituent classifiers’
predictions.
Derivation of bias and variance trade off involves considerig expectations of test set accuracies over diverse
data sets. Following the usual notation of Ex∈X [f (x)] denotes expectation taken for the function f (x) over
a range of inputs x ∈ X. Some of the terminology are shown in the tabulation below. Let us derive the bias
variance relation using a least squares error function.
Symbol Meaning
D Data set of (x, y)
points
sample() Function that se-
lects a subset of
points
Π Set of all models
built over subsets of
D
M (x) Average of model
predictions
δ(x) Error over given
data point.
D = {(x, y)}
Π = {MS |MS = model(sample(D))}
M (x) = EMS ∈Π [MS (x)]
δS (x) = {MS (x) − ŷ}2
The relationship between bias and variance is all hidden in δS (x) function for the model MS built over
the subset of data S. The final error is expected ϵ value taken over all the data points in the test set for all
the models hence built as depicted in Equation 42 where bias is denoted by the symbol β and variance by the
symbol ν. The bias is defined as in Equation 43 and variance in Equation 44. The final error is summation
of the bias and variance components Equation 45 because expanding the expectation over the constituent
terms of δS (x) function results in clear separation of bias and variance terms.
δS (x) = {MS (x) − ŷ}2

= {MS (x) − ŷ − M (x) + M )(x)}2
= {(MS (x) − M (x)) + (M )(x) − ŷ)}2
= {MS (x) − M (x)}2 + {M (x) − ŷ}2 + 2 ∗ (MS (x) − M (x)) ∗ (M (x) − ŷ)
ϵ = ES [Ex [δS (x)]] (42)
53
β = ES [Ex [{M (x) − ŷ}2 ]] = Ex [ES [{M (x) − ŷ}2 ]] = Ex [{M (x) − ŷ}2 ] (43)
ν = Ex [ES [{MS (x) − M (x)}2 (44)
ϵ = Ex [ES [δS (x)]]

= Ex [ES [{MS (x) − M (x)} + {M (x) − ŷ} + 2 ∗ (MS (x) − M (x)) ∗ (M (x) − ŷ)]]
2 2
⇒ ϵ = β + ν + Ex [ES [2 ∗ (MS (x) − M (x)) ∗ (M (x) − ŷ)]]
The third term in ϵ expansion results in a zero value as follows.
Ex [ES [2 ∗ (MS (x) − M (x)) ∗ (M (x) − ŷ)]]

= 2 ∗ Ex [ES [MS (x) ∗ M (x) − M (x) ∗ M (x) − MS (x) ∗ ŷ + M (x) ∗ ŷ]]
2 2
= 2 ∗ Ex [M (x) ] − 2 ∗ Ex [M (x) ] − 2 ∗ Ex [M (x) ∗ ŷ] + 2 ∗ Ex [M (x) ∗ ŷ]
=0
∴ϵ=β+ν (45)
Algorithm 10 Bias Variance Calculation

Require: X /*Input data set*/
Γ = [] /*List of classifeirs*/
for i = 1 : M do
Xsub = subset(X)
h∗ = arg minh L(h(Xsub ), y)
⊙
Γ ← Γ h∗
end for
/*Define bias and variance as functions over individual classifiers’ outputs*/
Given (x, y) ∈ X /*x is input and y is true value*/
2
bias(x) := (mean({γ(x) : ∀γ ∈ Γ}) − y)
variance(x) := var({γ(x) : ∀γ ∈ Γ})
6.7.1 Bias Variance Experiments
Bias and Variance are computed over two models - (i) a horizontal line and (ii) an inclined line, both of
which correspond to valid models (Figure 7). The data is a set of 2D coordinates drawn over a true curve
of sinusodial shape with some added noise. Thousand random subsets of models are selected from the data
set and both the models ((i) and (ii)) are plotted.
The average of the predictions is plotted in (Figure 8). This average corresponds to the bias in the
model with given parametre settings. The variance of about each data point is plotted for both the models
54
in (Figure 9). The smaller the width of the pink bounds region, the lesser is the variance. As it can be seen,
the variance for some points is high while for the others it is low. The more denser regions correspond to
low variance zones.
The variance plots for cases when the training data size is more is shown in (Figure 10). As the training
data size increased, the variance reduced for both the models.
6.8 Cross validation & Model selection

Cross validation is used for model configuration or parameter selection. The process of cross validation is
closely related to the bagging method since in both the processes, different subsets of training data are used
for building classifiers. The training data is repeatedly split into two parts - learning and validation parts.
On each of the learning subset, a machine learning model is built and evaluated on the validation subset.
The aggregate of the evaluation metric is computed and reported as final cross validation metric.
The model parameter configuration which results in overfitting causes each and every model to ovefit to
the learning subset and exhibit poor performance on its validation subset. The final aggregate metric over
multiple overfitting modelsl still results in poor performance with respect to the cross validation metric.
Two of the highly popular cross validation methods include K-Fold and Leave One Out (LOO). In K-Fold
cross validation algorithm, a data set shuffled and split into K parts of roughly equal size. Iteratively, over
K iterations, each of the subsets is chosen as the validation set and remaining (K-1) folds are combined
and used as learning subset. In LOO cross validation, just one exampler is chosen as validation subset
and remaining all points are used as learning subset. The sketch of the cross validation method is given in
Algorithm 11. The metric function metric(ypred , ytrue ) takes two arguments as input, predicted and actual
values of y coordinates. The function model() builds a model classifier or regressor. The data X has implicit
label column X[′ label′ ].
Algorithm 11 Cross Validation

Require: X, metric /*Input data set and metric function*/
scores = []
for i = 1 : M do
/*In K fold - subset() retrieves combined (K-1) folds and M is K*/

/*In LOO - subset() retrieves all but one exemplar*/
Xlearn = subset(X)
/*In K fold - the remaining fold is validation set*/

/*In LOO - the remaining exemplar is the validation set*/
Xval = X − Xlearn
M = model(Xlearn )
yval = Xval [′ label′ ]
⊙
scores = scores metric(M (Xval ), yval )
end for
return mean(scores)
55
The model selection process involves explore diverse configurations of the model parameters and evaluat-
ing each over the training data set using cross validation methodology. For each configuration of the model,
a cross validation score is output and finally that model configuration with highest score is output.
Model selection process Selection of a model corresponds to selection of parameters for a given chosen
ML methodology. For instance if SVM is the chosen methodology, then the parameters to select are the
kernel and C - penalty attributes. Another example, if the chosen methodology is decision tree, then the
parameters to select include depth of the tree, impurity function, minimal leaf sizeand minimal purity metric
value. The steps in model selection are summarized in the (Algorithm 12).
Algorithm 12 Model selection - pseudocode

1: Decide on the modeling methodology - for e.g. SVM, Decision tree, Logistic regression, Random forest
etc.
2: Provide possible sets of values for the configurable attributes of the model as a grid to explore automat-
ically
3: Choose a metric to evaluate, it can be customized metric as well
4: for param ∈ parameter sets do
5: for (learning subset,validation subset) ∈ of training data do
6: Select subset of training data as learning data and remaining data as validation subset
7: Perform cross validation and compute the metric
8: end for
9: Compute average value of the metric
10: end for
11: Choose the parameter combination that gives highest score for the metric
6.8.1 Learning curves
Learning curves correspond to evaluation of the model over various parameters or over various sizes of the
data. They may also correspond to any other aspect of model evaluation where the idea is to determine
best values of the metrics that lead to selection of (near)optimal parameters for the model. An illustration
of learning curves for various values of depth of a decision tree is shown in (Figure 16). Another illustration
of learning curve over multiple subsets of training data over various sizes is shown in (Figure 17).
6.9 Multi class and multi variate scenarios

The algorithms discussed so far handled the scenarios of two label classification and uni-variate regression.
These algorithms need to be modified for a multi class and multi variate scenarios. Modification of the linear
regression algorithm for a multivariate scenario is straightforward. It is equivalent to carrying out multiple
linear regressions simultaneously one for each output variable occurring in a summation term in the loss
function. However, modification of the two class classification algorithms for multi class classification is not
straightforward and typically involves tricky customization techniques.
6.9.1 Multi variate linear regression
Multi variate linear regression is modeled over output vector can be posed as a minimization over a loss
function which has sum of errors for each of the output variables.
56
Let y = [y1 , · · · yK ] corresponding to K dimensional output vector for any data point. The input data
i.e. rows are indicated by X matrix where Xi denotes ith row. Let Y denote a matrix of output vectors of
all data points, i.e. Yj denotes the j th column of Y matrix across all data points. Let Wj be the vector of
weights corresponding to j th column of Y . The j th column of the ith row is denoted by Yj [i]. The ith row of
the input is denoted by X[i]. The prediction formulation then becomes Equation 46, where N is the number
of data points.
Yj [i] = Wj · X[i](∀i ∈ [1 : N ] ∧ ∀j ∈ [1 : K]) (46)
Here Yj is (N × 1) vector and Yj [i] is scalar value in that vector corresponding to ith row. Wj is a vector
of (d × 1) dimensions where d is the dimensionality of the input data. The input data is denoted by, X which
is a (N × d) dimensions where N is the number of input points. The loss function is defined as summation
of losses incurred over individual Wj vectors over the entire data set, Equation 47.
∑ i=N
j=K ∑
L′ (X, W, Y ) = L(Xi · Wj , Yj [i]) (47)
j=1 i=1
This loss function is then minimized with respect to each of the Wj [i] variables and gradient descent
algorithm is applied to find optimal values Equation ??.
∑i=N ∑ ∂L(Xi · Wj , Yj [i])

i=N
∂L′ ∂( L(Xi · Wj , Yj [i]))
∇L′ : = i=1
= (48)
∂Wj [k] ∂Wj [k] i=1
∂Wj [k]
The weight update equation now becomes Equation 49.
Wj [k]new ← Wj [k]old − α × ∇L′ |Wj [k]=Wj [k]old (49)
6.9.2 Multi class classification
There are a variety of techniques to handle multi class classification problems based on the nature of the
classification algorithm applied. For support vector machines, multiple classes are handled in a one-vs-rest
or all-pair approaches. However for logistic regression, multiple classes are handled as summation of loss
functions of the individual logistic regressions over pertinent classes, each class having its own weight vector.
6.9.2.1 Multi class SVM There are two major approaches where multiple classes are handled in support
vector machines (SVM) [15]. In the first approach, an SVM is built one for each class treating that class
as positive and remaining as negative exemplars (Algorithm 13). In the other approach, several SVMs are
built for each pair of classes and a voting scheme or weighted confidence of prediction is determined for new
input (Algorithm 14).
6.9.2.2 Multi-class logistic regression The multi class logistic regression formulation needs a proper
definition of log odds in favor of one class to some other class or the rest of the data. The formulation
can be very similar to multi variate linear regression where the task is to determine each of the Wo vectors
corresponding to the output dimension o. However as logistic regression deals with probabilities of each class
which must sum up to 1, the formulation takes a different turn.
57
Algorithm 13 SVM - One vs Rest
Require: D={(x,y)} /*input data - feature vector and label*/
Let χ be the set of output classes of {y : (x, y) ∈ D}

Γ = []
for c ∈ χ do
D′ ← {(x, y = c)) : (x, y) ∈ D}
⊙
Γ ← Γ Ψc (D′ )
end for
/*Prediction for input x*/

y = arg maxΨ∈Γ Ψ(x)
/*Alternatively, all the SVMs can be used to transform input vector*/

DΓ = {[Ψ(x) : Ψ ∈ Γ](∀(x, y) ∈ D)}
M Γ = Classif ier(DΓ )

y = M Γ (x)
Algorithm 14 SVM - All Pairs

Require: D={(x,y)} /*input data - feature vector and label*/
Let χ be the set of output classes of {y : (x, y) ∈ D}

Γ = []
for c1 , c2 ∈ χ(∀c1 ̸= c2 ) do
D′ ← {(x, 1)) : (x, y) ∈ D ∧ y = c1 } ∪ {(x, 0) : (x, y) ∈ D ∧ y = c2 }
⊙
Γ ← Γ Ψc1 (D′ )
end for

y = arg maxc∈χ |{Mc (x) = 1 : Mc ∈ Γ} /*voting*/
/*Alternatively, all the SVMs can be used to transform input vector*/

DΓ = {[M (x) : M ∈ Γ](∀(x, y) ∈ D)}
M Γ = Classif ier(DΓ )

y = M Γ (x)
58
In order to define log odds, the definition of the other class becomes important. One of the classes is
chosen as pivot class and log odds is defined in terms of that class. Some of the symbols and their meaning
is given in the tabulation below.
Symbol Meaning
Wc Weight vector for determining class c
(∀c ∈ [1..(K − 1)]) : Wc · x = Odds in favor of class c against class k
log( PP(y=K|W
(y=c|Wc )
K)
)
K Pivot class
The relation between class c and class K is derived as below to result in Equation 50.
P (y = c|Wc )
Wc · x = log( )(∀c ∈ [1..(K − 1)])
P (y = K|WK )
⇒ P (y = c|Wc ) = eWc ·x ∗ P (y = K|WK )(∀c ∈ [1..(K − 1)])
∑
c=K−1
∵ P (y = c|Wc ) + P (y = K|WK ) = 1
c=1
∑
c=K−1
⇒( eWc ·x ∗ P (y = K|WK )) + P (y = K|WK ) = 1
c=1
∑
c=K−1
⇒ P (y = K|WK ) ∗ (1 + eWc ·x ) = 1
c=1
1
P (y = K|WK ) = ∑c=K−1 (50)
1+ c=1 eWc ·x
However to unify the notations for classes 1..(K − 1) and the K th class, the steps below result in (Equa-
tion 51).
(∀c ∈ [1..(K − 1)]) : P (y = c|x, [W1 , . . . , WK ]) = P (y = c|x, Wc ) = eWc ·x ∗ P (y = K|WK )

1
= eWc ·x ∗ ∑c=K−1 W ·x
1 + c=1 e c
1
c = K → P (y = K|[W1 , . . . , WK ]) = P (y = K|WK ) = ∑c=K−1 W ·x
1 + c=1 e c
α α
e e
= α ∗ P (y = K|WK ) = ∑c=K−1 (W ·x+α)
e α
e + c=1 e c
eWK ·x
(α = WK · x) → P (y = K|WK ) = ∑c=K−1
eWK ·x + c=1 eWc ·x+WK ·x
WK ·x
e
= ∑c=K−1
eWK ·x + c=1 e(Wc +WK )·x
⇒ (∀c ∈ [1..K]) : Wc′ ⊢ (Wc + WK )
⇒ P (y = c|Wc′ ) = P (y = c|Wc )
eWc ·x
(∀[W1 , . . . , WK ]), (∀c ∈ [1..K]) : P (y = c|Wc ) = ∑c=K (51)
Wc ·x
c=1 e
59
Now the likelihood of data with respect to Wc (∀c ∈ [1..K]) is defined as in Equation 52 due to i.i.d
property of the individual data elements.
i=1 P (y = yi |xi , [W1 , . . . , WK ]) = Πi=1 P (yi |xi , Wyi ) (52)

P ({(x1 , y1 ), . . . , (xN , yN )}|[W1 , . . . , WK ]) = Πi=N i=N
Expressing the likelihood of data using (Equation 51) results in the steps as below.
[W1 , . . . , WK ]∗ = arg max [W1 , . . . , WK ]Πi=N

i=1 P (yi |xi , Wyi )
i=1 P (yi |xi , Wyi ))

= arg max [W1 , . . . , WK ]log(Πi=N
i=N
Let, l(W1 , . . . , WK ) = log(Πi=1 P (yi |xi , Wyi ))
∑
i=N
⇒ l(W1 , . . . , WK ) = log(P (yi |xi , Wyi )
i=1
∑
i=N
1
= log(eWyi ·xi ∗ ∑c=K )
Wc ·xi
i=1 c=1 e
∑
i=N ∑
c=K
= ((Wyi · xi ) − log( eWc ·xi ))
i=1 c=1
∑
i=N ∑
i=N ∑
c=K
= Wyi · xi − log( eWc ·xi )
i=1 i=1 c=1
6.10 Regularization
Regularization is an attempt to reduce sensitivity of the model with respect to fine changes in the input.
Consider a model,
Ŷ = M (Θ, X)
Now consider a simple mean squared loss over N points,
1
||Ŷ − Y ||2
L(X) =
N
Note that we have expressed loss function in terms of the the input points X. Taking gradient of the loss
function with respect to input points, results in a function, g(Θ),
∇X L(X) = g(Θ)
The function is sensitive to the magnitudes of Θ, which implies slight changes in the input are going to
cause fluctuations in the model predictions.
In order to avoid this, a usual trick is to add magnitude of weights of the matrix or absolute values of
parameters to the penalty or loss function. This term is added with appropriate scaling factor to fine tune
the amount of regularization needed.
60
6.10.1 Regularization in gradient methods
An Lp norm for a k dimensional w vector in case of a linear regression or logistic regression problem, is
defined as
∑
i=k
Lp (w) = |wi |p
i=1
A value of p = 1 is called lasso and a value of p = 2 is called L2 norm. These values are used as
regularization factors in a loss function when fine tuning a machine learning model. An illustration of the
two norms over a 2D scenario (i.e. k = 2) is shown in (Figure 13).
6.10.2 Regularization in other method
In case of decision tree based methods, regularization corresponds to limiting the depth of a decision tree.
In case of neural networks, drop out has regularization effect. In case of adaboost method, regularization
corresponds to both the depth of a decision tree and also the number of estimators.
6.11 Metrics in machine learning

A machine learning model needs to be evaluated for optimal choice of parameters, optimal performance on a
data set, less bias and greater generalizability. Good performance on the training set gives a low bias model,
which is non-trivial and able to capture signal in the data. After the demonstrable accuracy on the training
data set, then comes generalizability. The algorithm should not over learn the training data, it should learn
sufficient enough such that performance on unseen data is improved. Generalization is characterized by low
variance. The points are summarized into 4 steps below.
• First identify essential model configuration to result in decent training set accuracies
• First identify low bias model
• Next identify low variance model for better performance on test data or better generalizability
• Search over the parameter space and data sub-sets
6.11.1 Confusion matrix
In case of a two class classification problem, a typical confusion matrix is defined over positive and negative
classes. Steps are as below.
• Choose one of the two classes as positive, and the other as negative
• This choice varies from one domain to the another and is subjective to an individual or the team
involved
• The ground truth data (i.e. data set) can now be categorized into two parts - (i) actual positives and
(ii) actual negatives
• The model predictions can now be categorized into two parts - (i) predicted positives and (ii) predicted
negatives
• From these sets of data points, commonalities can be determined to form a confusion matrix (Table 9).
61
Table 9: Confusion matrix
Actual positives Actual Negatives
Predicted Positives True positives False positives
Predicted Negatives False negatives True negatives
In case of multiple classes, determining positive and negative classes is not possible. In this case a matrix
of predicted and actual classes is constructed. If the number of classes is k, then the dimensionality of this
matrix is k×k. The elements on the diagonal, i.e. on each of the (i, i)th cell corresponds to correct prediction.
The elements off the diagonal correspond to wrong predictions in both horizontal way and vertical way. An
illustration of the confusion matrix is given (Figure 18). In this table, vertical & off-diagonal elements
corresponds to false positives for that class. The horizontal and off diagonal elements correspond to false
negatives for that class. The true positives are the elements in the diagonal for each class. The true negatives
are the sum of those elements in other diagonal cells.
The accuracy metric from confusion matrix is (Equation 53). Here α denotes accuracy, the confusion
matrix is denoted by CM [][], the number of classes is k and the total number of elements is N .
∑i=k
i=1 CM [i][i]
α= (53)
N
Precision for j th class is given by (Equation 54).
CM [j][j]
π[j] = ∑i=k (54)
i=1 CM [i][j]
Recall for the j th class is given by (Equation 55).
CM [j][j]
ρ[j] = ∑i=k (55)
i=1 CM [j][i]
Same equations are applicable for the two class problem as well.
6.11.2 Precision-Recall Curve
Another metric for model assessment is the precision-recall curve, also called PR curve. This is actually not a
curve, a 2D scatter plot on which various classifiers are plotted. We want to select that classifier which gives
high precision and high recall scores. This can be used for model selection as well. However it is customary
to use PR curve, after the model selection process and for threshold selection. Most of the classifiers can be
configured to emit and equivalent of a probability value for each of the predictions. The probability value
can further be threshold’ed to refine the prediction quality.
The steps in PR curve formation are summarized in the list below. An illustration of the PR curve for a
decision tree on the moons data set is shown in (Figure 14).
• Perform model selection (using cross validation or other means)
• Configure the model to emit an output score (such as probability score)
• For each threshold on the score, determine confusion matrix values
62
• Compute precision and recall scores
• Plot them on a 2D plot of PR curve
• Each threshold becomes a point in the PR curve, these points are all connected to form a continuous
looking shape
• Pickup the threshold that gives highest precision and recall scores among others
6.11.3 ROC curve
Receiver Operating Characteristic (ROC) curve is also a 2D plot between Recall and Fall-out scores for a
classifier. Fall-out corresponds to how many of the true negatives are leaking in as false positives. It is a
ratio of false positives by total number of actual negatives. Recall corresponds to how many of the actual
positives are retrieved to be true positives by the classifier. We want while the classifier is able to retrieve
correctly, it should not allow leaking in from the negative lot.
The process of construction of the curve is similar to PR curve construction, although here Recall is
Y-axis and Fall-out is X-axis. An illustration of the ROC curve is shown in (Figure 15) for a decision tree
classifier on moons data set.
Different classifiers are compared for their behaviour with respect to shapes of the input data points is
shown in (Figure 19). We need to note that not all classifiers are suitable for a given problem at hand.
The best methodology is assessed both by judgement and reasoning of the engineer and the benchmarking
studies involving PR curves and ROC curves.
The best scoring threshold value in a ROC curve is the point which gives highest recall with least fall-out
scores. In an overall sense, to assess the quality of the chosen classification methodology and the optimal
parameters selected, the ROC curve can be used to compute what is the best (recall,fall-out) point, across
diverse methodologies and their configurations. One such assessment is by area under the curve (AUC). This
value is numerically computed by partitioning the ROC curve into vertical slices and summing up the areas
of each the slices. The higher the AUC score, the better is the classifier.
7 Practical considerations in model building

The supervised learning methods discussed so far may work well in ideal scenarios. However, the real
world data comes with a number problems at various stages of machine learning model building pipeline of
activities. Some of the concerns a practitioner would face include the following.
• Noise in the data [23, 24]
• Missing values [25]
• Class imbalance [26, 27]
• Data distribution changes [28, 29]
7.1 Noise in the data

The notion of noise in the data ranges from physical properties of measuring devices to semantic levels
of human understanding. In this context defining noise becomes important from one problem scenario to
63
another. Any undesirable or or unwanted or non-pattern can be defined as noise.
The best way to eliminate noise would be perform preprocessing of the data. Preprocessing need not a
simple task, it can be an involved effort by a team of customers, data engineers, developers, scientists and
management to identify and exclude noisy patterns to provide clean data.
The solution can even include engineering interventions such as force color coding certain entities in a
computer vision task or sound proof setting of a recording room in a speech processing task. The more clean
the data is, the more easy to build a machine learning model and maintain.
Sometimes the noise can have highly consistent pattern. In such cases, it is also common to build machine
learning models to identify noise and eliminate.
In summary there are 4 important aspects of noise.
• Noise in the machine learning practitioners world is typically defined at semantic level
• The noise patterns may be strong or weak. If the pattern is strong, it is a good idea to eliminate it by
creating rules
• Noise results in sensitivity of the accuracy with respect to slight perturbation in the input values.
Sensitivity analysis should detect noise.
• Label corruption is a form of noise
7.2 Missing values

Missing values is a regular phenomenon in the real world during data collection phase [25]. However each
and every classifier needs all the columns that it takes as input. Any missing value needs to be filled in for
the classifier or regression model to generate output. The process of coming up with a proposed value for
the missing value is called missing value imputation. Highlighting below some of the popular missing value
imputation techniques.
• Numeric attributes - Mean, Median or Mode among values of that attribute in the data set
• Categorical entity - Mode value among values of that attribute in the data set
• Inferring from other data points based on notion of neighbors
• Inferring from other attributes of a given data point (conditional inference)
• Missing values may also be imputed based on adjacency of attributes or columns or adjacency of rows.
• Fill them with zero
• Fill them with constant positive value
• May be possible to remove missing value containing data points (although this is very rare)
7.3 Class imbalance

The problem of class imbalance is the lack of availability of training data for the class that corresponds to
rare event or when the labeling is highly costly. The rare events are peculiar to each domain and occur only
sporadically. Some of the examples are fraudulent credit card transactions in banking domain, break down
64
of an electrical generator in a power grid sector or disease condition in healthcare.
Due to predominance of the abundant class of points, the classifier will be biased towards the majority
class. Even when there is only a dummy predictor which output for any input, the label of the output class,
it is going to be highly accurate as the majority class dominates as in (Equation 56). The precision also will
be high (Equation 57) due to under represented class are the ones contributing for F P . However, the class
specific recall is drastically for the under represented class is significantly lower.
TP + TN
lim →1 (56)
T P →∞ TP + TN + FP + FN
TP
lim → 1(∵ 0 ≤ F P << ∞) (57)
T P →∞ TP + FP
In order to overcome problem with representation of data points from either of the classes, some of the
techniques are as follows.
• Over sample the under represented class (SMOTE)
• Under sample the over represented class
• Increase weight of the smaller class (class weights vector)
• Increase weights of the points in the smaller class
• Synthetic data augmentation
• Preprocess to exclude bulk of over represented data
7.4 Model maintenance

Maintenance of a machine learning solution which is already deployed in production takes a major part of
the day to day activities of a data scientist [30, 31]. Unlike machine learning competitions and challenges
where a model is demonstrated on one data set and published, in an industry scenario the life cycle of a
machine learning model starts after it is put into production use.
The nature of data changes over time based on the stake holder entities involved in data generation.
An example is an e-commerce website that is to suggest advertisements to users based on their search and
access patterns. Multitude of factors play role here including seasonality, user behavioural changes or events,
current ongoing trends among other factors. The events of life play a significant role in the user behavioral
patterns, for instance, consider a parent blessed with a baby, they start to search for infant items in initial
months, then child items and so on.
A single model deployed in production does not perform when the test data on which it reported high
accuracies during deployment is different in nature from the production data that is hitting the model. Such
a scenario is commonly referred to as data distribution difference. The usual steps in case of data distribution
difference are summarized as follows.
• Set up alarms to detect difference between model training and/or test data and production data
• Define metrics to calculate how similar or dissimilar a pair of data sets is
• Determine the data points on which the prediction is ambiguous and send them for labeling
65
• Retrain the model periodically or based on alarm
• As new knowledge gets acquired or new features are discovered, upgrade the model
• As constituent or dependent sub-model improves, retrain the model
8 Unsupervised Methods
The bulk of the data in the machine learning world is unlabelled. This is the case because the labeling
process is costly both in terms of data acquirition and error free manual annotation [32, 33]. When only
a handful of labels are available, the unsupervised methods provide a mechanism to propagate the labels
to thousands of newer points. In case of supervised methods, the word ’learn’ quite understandable as it
means calculating an optimal mapping function between input dimensional space to the output dimensions
or ’label’. It is not very intuitive to understand the word ’learn’ in the unsupervised scenario. However
unsupervised learning corresponds to learning structure in the data. Based on how the structure is
represented and the learning process itself, there are a number of unsupervised learning methods, some of
which are listed below.
• Clustering [34, 35, 36]
• Matrix Factorization [37, 38]
• Principal component analysis [39, 40, 41]
• Graphical methods [42, 43, 44, 45]
8.1 Clustering
Clustering is more of a process or a paradigm or a philosophy than of a simple method [34, 35]. It is a
process of aggregating related entities together and segregating dissimilar entities from each other. The
representation of aggregates, their properties and quantification of quality of aggregation and segregation
differs from one algorithm to another. Each of the algorithm differs from the other in of terms of its
capabilities and limitations expressed via the nature of the data that can be clustered, the characteristics of
the algorithm itself, the output or resultant clusters and the representation of clusters.
Some of the popular algorithms include the following. Each algorithm comes with its own parameters, time
to execute and characteristic behaviour.
• K-Means
• Hierarchical Clustering
• Density based clustering
• Matrix Factorization
• Principle Component Analysis
66
8.1.1 K-Means
K-Means algorithm [46, 47] determines disconnected blobs of data based on the notion of centroid of a cluster,
distance between the points and centroids and membership of the points to the centroids as in (Equation 58).
Though the formulation looks like determining all possible subsets, the actual implementation of K-Means
algorithm is much simpler time complexity wise and converges much faster than exponential time. The
iterative algorithm is given in (Algorithm 15).
There are diverse variations of K-Means algorithm such as K-Medoids or the Fuzzy K-Means and others.
However in all of these algorithms, the spirit of formulation is maintained. One of the very different ap-
proaches is to formulate K-Means as a gradient descent algorithm as in the Equation 59. The gradient descent
formulation of K-Means converges much faster than the iterative data point membership based approach and
is suitable for online learning as well, as the centroid update corresponds to weight update (Equation 60).
∑
i=k ∑
χ = arg min (x − χ′ [i].cntr)2 (58)
χ′ i=1 x∈χ′ [i].set
Algorithm 15 K-Means
Require: X /*d-dimensional data*/, k /*number of clusters*/
1: (c ∈ [1..k]) : χ[c].cntr = RAN D(d, 1) /*k points, each of d dimensions*/
2:
3: for iter = 1:N do
4: χ[c] = {}
5: for x ∈ X do
6: c = arg minc∈[1..N ] DIST (χ[c].cntr, x)
7:
8: χ[c].set = χ[c].set ∪ {x}
9: end for
10: (∀c ∈ [1..N ]) : χ[c].cntr = M EAN (χ[c].set)
11: end for
12: return χ
∑
i=k ∑
w = [w1 ..wk ]∗ = arg min (x − wi )2 (59)
[w1 ..wk ] i=1 x∈X
∑i=k ∑
∂ x∈X (x − wi )2
w (i+1)
=w (i)
−α∗ i=1
|w=w(i) (60)
∂w
8.1.2 Hierarchical Clustering
Hierarchical clustering [48, 49] is a most commonly used technique especially in biological data analysis
including evolutionary characteristics of gene and protein sequences. In this form, the clusters are visualized
dendrograms which are essentially tree representation of points based on similarity or dissimilarity metrics.
Distance scores are defined between every pair of points. The distance metric is highly customizable capturing
any notion of dissimilarity including examples such as euclidean measure or manhattan measure or negative
67
of similarity score.
Once distances are defined between every pair of points, the clustering algorithm proceeds by iteratively
identifying sub clusters. The procedure can be top down or bottom up. In the top down process, at the
beginning, all points are assumed to be in a single cluster. As the algorithm proceeds, the single cluster is
split into two or more parts. However, the divisive clustering has to examine exponential number of subsets
to determine where to split. This problem is mitigated by bottom up approach. In this formulation, in the
beginning, all points are assigned to individual clusters. The clusters are then merged to result in next level
by defining notion of cluster to cluster distances. The algorithm for bottom up hierarchical clustering is
shown in (Algorithm 16).
Algorithm 16 Bottom Up Hierarchical Clustering

Require: X /*data set*/, k /*number of clusters*/
1: /*initialize*/
2: χ = {}
3: for i = 1 : N do
4: Ci = {xi }
5: χ = χ ∪ Ci
6: end for
7: while |χ| > k do
8: (i∗ , j ∗ ) = arg mini,j∈[1..|χ|]:i̸=j DIST (Ci , Cj )
9: χ = χ − {Ci∗ , Cj ∗ } ∪ (Ci∗ ∪ Cj ∗ )
10: end while
11: return χ
The dendrogram is shown in (Figure 21) where a set of points were generated using make blobs utility
of the scikit-learn library. The points are then clustered using agglomerative clustering and dendrogram of
pairwise distances are plotted.
8.1.3 Density Based Clustering
Density based clustering algorithm DBSCAN [36] is based on recursively determining chain of dense points
satisfying radius and density criteria. The algorithm is useful in determining arbitrary shaped clusters (Al-
gorithm 17). However the algorithm is mildly sensitive to starting point resulting in slight ambiguity at the
cluster boundaries although these issues are negligible when dealing with high volumes of data. The running
time is linear in the number of data points.
8.2 Comparison of clustering algorithms over data sets

Each clustering algorithm performs better on one data set and does not perform better on another data set.
For instance k-means works well on globular type of data. However k-means does not work well on arbitrary
shaped clusters. In case of DBSCAN, it works on arbitrary shaped clusters. However, the speed of execution
becomes an issue. An illustration of comparison of clustering algorithms discussed in this section against
some of the known synthetic data sets is shown in (Figure 20).
68
Algorithm 17 DBSCAN
Require: X /*data set*/, η /*number of points in radius*/, ρ /*radius*/
1: /*initialize*/
2: χ = [] /*list of clusters*/
3: Γ = {} /*dense points*/
4: for all x ∈ X do
5: if |{x′ ∈ X : |x − x′ | ≤ ρ}| ≥ η then
6: Γ = Γ ∪ {x}
7: end if
8: end for
9: while |Γ| > 0 do
10: C = {}
11: C = {(∃x ∈ Γ) : x} /* cluster starting from some dense point */
12: Γ′ = {} /* gather all dense points of a cluster, iteratively */
13: while |Γ′ − C| > 0 do
14: Γ′ = C
15: C = {(∃(x′ ∈ Γ, x ∈ Γ′ )) : |x′ − x| ≤ ρ} /*identify any other reachable dense points*/
16: end while
17: C = {(∀x ∈ X, ∃x′ ∈ Γ′ ) : |x − x′ | ≤ ρ} /*determine member points*/
⊙
18: χ = χ C /*append to the list of clusters*/
19: Γ = Γ − Γ′ /*focus on remaining core points*/
20: end while
21: return χ
69
8.3 Matrix Factorization
Matrix factorization techniques [37, 38] are commonly considered when dealing with recommendation systems.
The scenario corresponds to a group of m entities mapping to a group of n other entities. A very common
example is the movie recommendation for users. In this case the task is to determine proximity and distance
between a movie and a user as if both are same type of entities. In order to obtain such a representation
where movies and users can be compared and matched, they both need to be cast as vectors of identical
dimension.
Consider a matrix Xm×n corresponding to m users and n movies. The actual content of the matrix can
be any semantic such as ranking. We need a representation of each user and movie as a k dimensional vector.
One of the methods based on singular value decomposition is illustrated below. Along with regularization
the formulation is similar to as in (Equation 61).
Xm×n = Am×k × Bk×n

A , B ∗ = arg min |X − AB|2
∗
A,B
A∗ , B ∗ = arg min |X − AB|2 + |A| + |B| (61)

A,B
Matrix factorization when applied to Olivetti faces and plotting of the non-negative components of the
reconstructed matrices are shown in (Figure ??).
8.4 Principal component Analysis

Principal component Analysis (PCA) [40, 39, 41] is one popular and favourite technique in the machine
learning community for dimensionality reduction. The idea is to calculate a handful of vectors from the
input data which are representative of the whole data. Any new input can be represented now, as simply
dot products with respect to representative vectors. The effect is input dimensionality drastically reduces.
Consider a hypothetical example of a data set of human faces one mega pixel grey scaled photographs, each
of width and height 1000 × 1000. Each and every pixel is a feature. The input image is now a vector of
106 dimensions. Assume one is able to identify some 10 vectors (each of input dimensionality i.e. 106 )
representative of eyes, ears, mouth, nose, head regions. Now, the input image can be considered as dot
product with respect to each of these 10 dimensions, resulting in just 10 numbers. This is equivalent to
transforming 106 dimensions to 10 dimensions. How to automatically identify the representative vectors
is the problem addressed by PCA. Consider a data of XN ×d where N data points are there, each having
d dimensions. The eigen values of the correlation matrix gives the dimensions of maximal spread of data
points. Given new input, top k eigen values and their corresponding eigen vectors are selected and the whole
input is feature transformed into k dimensional space (Equation 62).
1 ∑
i=N
x= X[i]
N i=1
(∀i ∈ [1..N ]) : X[i] = X[i] − x

([λ1 , . . . , λd ], [v1 , . . . , vd ]) = eig(X T × X)[(∀i, j ∈ [1..d]) : λi ≥ λj ]
70
(∀i ∈ [1..k]) : xd×1 ⊢ x′k×1 = (x · v1 , . . . , x · vk ) (62)
The eigen value detection is carried out through numerical methods of which Golub and Kahan [50]
method is the most widely used. The procedure is to employ Givens row and column transformations
repeatedly to construct a bi-diagonal matrix There are various variants of PCA such as Kernel, Sparse,
Truncated and Incremental forms [51, 52].
Application of PCA algorithm for face recognition problem is shown in (Figure 22). The eigen vectors are
of same dimension as the input image. These vectors are then scaled between 0 and 255 and displayed back
as images to inspect which of the pixels constituted the significant values of the eigen vectors. Application
of PCA for detection of prominent direction in a point cloud originating from measurement of a mechanical
part is shown in (Figure 23). Axis detection problem has significance in mechanical part quality assessment
scenarios.
8.5 Understanding the SVD algorithm

SVD stands for Singular Value Decomposition [53, 54, 55] which is a numerical iterative method for matrix
factorization. The algorithm is used for determining eigen values, rank of a matrix, null space and this
method forms the crux of the Principle Component Analysis discussed earlier in the chapter. The state of
the art SVD algorithm employs involves QR Decomposition, Householder transformation and Givens matrix
rotations.
The following are some of the key concepts involved in understanding SVD algorithm.
• Solving system of linear equations and LU Decomposition
• Householder transformation
• Givens matrix rotations
• Orthonormal matrices
• Eigen vectors and eigen values detection
8.5.1 LU Decomposition
LU Decomposition stands for Lower Upper triangular decomposition. The triangular matrices speed up the
process of back substitution in root finding.
For instance consider the problem of solving a system of linear equations as below.
     
10 20 30 x1 140
     
(A =  3 30 45) × x2  = (b = 198)
5 22 54 x3 211
The Gauss-Siedel elimination process eliminates rows to make the A matrix triangular matrix and then
back substitute values to determine x. It uses augmented matrix to simultaneously affect b values.
           
10 20 30 : 140 1 2 3 : 14 1 2 3 : 14 1 2 3 : 14 x1 1
           
 3 30 45 : 198 → 3 30 45 : 198 99K 0 24 36 : 156 99K 0 1 0 : 2  ⇒ x2  = 2
5 22 54 : 211 5 22 54 : 211 0 12 39 : 141 0 0 1 : 3 x3 3
71
The back substitution process is the most attractive aspect of triangular form of matrices. However in
order to carry out the procedure of solving system of linear equations every time a new b is input, the row
operations need to be replayed. A more efficient approach is to remember the effect of the triangularization
process and the LU decomposition algorithm precisely address this problem.
The LU decomposition stands for Lower Upper triangular transformation of a square matrix.
Am×m = Lm×m × Um×m
     
10 20 30 1 0 0 10 20 30
     
3 30 45 = 0.3 1 0 ×  0 24 36
5 22 54 0.5 0.5 1 0 0 21
Now solving for x vector can be accomplished in two steps as below.
Ax = b ⇒ L × U × x = b
= L × (U × x) = b
Now the two steps for solving the system of linear equations is,
1. Let y = U x
2. First solve Ly = b using back substitution ⇒ y
3. Second solve U x = y using back substitution ⇒ x
       
1 0 0 10 20 30 x1 1
       
0.3 × ×
1 0  0 24 36 x2  2
=
0.5 0.5 1 0 0 21 x3 3
Solving for the L side of the equation,
         
1 0 0 y1 1 y1 140
         
0.3 1 0 × y2  = 2 → y2  = 156
0.5 0.5 1 y3 3 y3 63
Solving for the R side of the equation,
         
10 20 30 x1 140 x1 1
         
0 24 36 × x2  = 156 → x2  = 2
0 0 21 x3 63 x3 3
The LU decomposition has simplified solving for system of linear equations. There are a number of
algorithms to accomplish this factorization including,
• Doolittle’s algorithm
72
• Crout’s algorithm
• With full pivoting
• With partial pivoting
We present here a simplified algorithm forth LU decomposition is shown in (Algorithm 18) based on
Doolittle’s algorithm.
Consider a given square matrix An×n needs to be LU factorized. Let us denote each element of A by
typical computer programming style notation, A[i][j].
We will deduce a factorization A = L × U , where L is lower triangular with L[i][j] = 0(∀j > i) and
L[i][i] = 1 and U is an upper triangular matrix such that U [i][j] = 0(∀i > j).
   
1 0 ... U [1][1] U [1][2] ... U [1][n − 1] U [1][n]
 L[2][1]   0 
 1 0 ...   U [2][2] U [2][3] ... 
   

A= L[3][1] L[3][2] 1 0 ... × 0 0 U [3][3] U [3][4] ... 
  
 .. .. .. .. ..   .. .. .. .. .. 
 . . . . .   . . . . . 
L[n][1] L[n][2] ... L[n][n − 1] 1 0 0 ... 0 U [n][n]
Algorithm 18 LU Decomposition
1: (∀i ∈ [1 . . . n]) : U [1][i] = A[1][i]
A[i][1]
2: (∀i ∈ [1 . . . n]) : L[i][1] = U [1][1]
3: for i = 2 . . . n do
4: for m = 2 . . . n do
∑k=i−1
5: U [i][m] = A[i][m] − k=1 L[m][k] ∗ U [k][i]
6: end for
7: for m = 2 . . . n do ∑
k=i−1
A[m][i]− k=1 L[m][k]∗U [k][i]
8: L[m][i] = U [i][i]
9: end for
10: end for
However, when it comes to calculation of eigen vectors of a given matrix in applications such as PCA,
a more desirable factorization is QR decomposition where Q is an ortho-normal matrix and R is an up-
per triangular matrix. The algorithms accomplishing such a factorization are more involved than the LU
decomposition methods.
8.5.2 QR Decomposition
In LU decomposition, the L and U matrices are not designed to be orthogonal matrices. Requiring U matrix
to be orthogonal helps in devising SVD algorithm. In order to factorize a matrix, A = Q × R where Q
is an orthogonal matrix, QR factorization algorithm is used. There are multiple ways of performing this
decomposition, including.
• Gram-Schmidt method
• Householder reflection based method
• Givens rotations based method
73
The QR decomposition algorithm based on Givens rotations is presented in (Algorithm 19).
Recall a 2 × 2 rotation matrix whose first row denotes X axis and second row denotes Y axis. In order
to rotate any given point by an angle θ about X axis, the rotation matrix is as below.
[ ]
cos(θ) −sin(θ)
R(θ) =
sin(θ) cos(θ)
Applying the matrix to any other matrix results in rotation of all its rows about X axis by θ, R2×2 (θ) ∗
A2×m = Arotated
2×m .
Givens rotation matrix is a generalization of the rotation matrix to a high dimensional space. Consider
an identity matrix whose each row is a vector. Now in order to convert it to a rotation matrix in which we
need to rotate any given vector about ith dimension and j th dimension i.e. from ith to j th by an angle θ,
the matrix is given as below.
How to construct the Givens rotation matrix is as follows, Rn×n [i][i] = 1(∀i ∈ [1 . . . n]) ∧ R[i][j] = 0(∀i ̸=
j).
Set up the rotations as elements, R[i]i] = cos(θ), R[i][j] = −sin(θ), R[j][i] = sin(θ), R[j][j] = cos(θ).
 
1 0 ...
 
0 1 ... 
 
 .. .. .. .. .. .. .. 
. . . . . . 0 . 
 
0 −sin(θ) . . .
 ... cos(θ) 0 ... 0 
. .. .. .. .. .. .. 
G(i, j, θ) = 
 .. . . . . . 0 . 

 
0 ... sin(θ) 0 ... cos(θ) 0 . . .
 
. .. .. .. .. .. .. 
 .. . . . . . 0 . 
 
 
0 ... 1 0
0 ... 0 1
This matrix when applied to any other matrix rotates, all the ith and j th columns of the vectors in the
i − j th plane from ith to j th dimension by an angle θ.
We use this matrix to make zero some of the columns of the input matrix A, by
Amodif ied = G(i, j, θ) ∗ A
operation. The angle θ is set such that
θ∗ ← sin(θ) ∗ A[j][i] + cos(θ) ∗ A[j][j] = 0
. Such a θ∗ will make

Amodif ied = G(i, j, θ∗ ) ∗ A → Amodif ied [j][i] = 0
which is same as making zero a selected cell of a matrix.
Let us denote the operation of making zero, the j, ith cell of a matrix A using Givens rotation by the
operator, Z(j, i, A) which internally constitutes two steps - (i) selecting θ∗ and (ii) applying Givens rotation
matrix.
A sequence of Givens rotations on a matrix A can convert it to a upper triangular matrix as in (Algo-
rithm 19). Please note that Q matrix is still an ortho-normal after a series of multiplications in the iterations.
The R matrix is an upper triangular matrix.
74
Algorithm 19 QR Decomposition - Simplified pseudocode
1: Input: A
2: Q = I, R = A
3: for i = n : 2 do
4: for j = i − 1 : 1 do
5: Q = Z(i, j, R) ∗ Q
6: R=Q∗R
7: end for
8: end for
9: return (Q,R)
The Singular Value Decomposition (SVD) algorithm, makes use of QR decomposition and Givens rota-
tions to result in factorization of a non-square matrix Am×n .
Let A = U ΣV T be the SVD decomposition where Um×m and Σm×n and Vn×n are the factor matrices.
Both U and V are ortho-normal matrices. The Σ matrix is a diagonal matrix i.e. all elements (∀i ̸= j) :
Σ[i][j] = 0. The diagonal elements of the Σ are such that Σ[i][i] ≥ Σ[j][j](∀i ≤ j). If we denote σi as the
Σ[i][i] element, the number of non zero diagonal elements are σ1 . . . σk where k = min{m, n}.
There are several implementation of SVD algorithm including the list below. We show here a simplified
pseudocode for ease of understanding to a beginner reader(Algorithm 20). This procedure is based on the
fact that upper triangulation of lower triangular matrix results in diagonal matrix.
• Iterative Householder matrix transformations
• Golub-Reinsch algorithm
• Golun-Kahan algorithm
• Bi-diagonal algorithm
• Demmel-Kahan algorithm
• Numerical methods based algorithm
• Jacobi Rotation algorithm
There are a number of very important applications of SVD factorization including the following list.
• Eigen value computation
• Computing pseudo inverse of a matrix
• Principle Component Analysis
• Clustering problems
• Multi Dimensional Scaling
• Low rank approximations of matrices
75
Algorithm 20 SVD Algorithm - Simplified pseudocode
1: Input : Am×n matrix.
2: Ûm×m , Zm×n = QR(A) //QR factorization of A
3: =⇒ A = Û × Z
4: Note that Z is an upper triangular matrix
T
5: Consider Zn×m
6: Vn×n , Dn×m = QR(Z T ) //QR factorization of Z T
7: =⇒ Z T = V × D
8: A = Û × (Z T )T = Û × (V × D)T = U × D T × V T
9: Note that D T is still a diagonal matrix
10: Now, D T needs to be cast as Σ with diagonal element ordering
11: Let (∃P ) : D T = P × Σ
12: Let U = Û × P // to absorb the row permutations
13: Then, we have A = U × Σ × V T as required by the SVD factorization
8.6 Data Visualization

Data visualization [56] is a fundamental need for a data science practitioner. There are effectively two types
of visualization in data science - (i) metric plots and (ii) data distribution plots. While metric plots are a
routine aspect in every day life of a practitioner, the data visualization algorithms are only a handful.
Examples of some of the metric based visualization are listed below.
• Correlation scatter plots
• Precision-Recall and ROC curves
• Validation curves about metric of choice versus data set or parameters
• Confusion matrix
• Kullback-Liebler divergence score between data sets
• Time series plots about metric value of interest
Though the metric visualization suffices in most cases, the data visualization helps in getting an intuition
about the data and devise better methods or features. Some of the examples of data visualization approaches
are listed below.
• Network graph visualization of connections between data points
• Dendrogram of clustered data points
• Multi Dimensional Scaling (MDS)
• Student’s t-distributed Stochastic Neighborhood Embedding (tSNE)
• PCA based dimensionality transformation
In this chapter let us look at the data point visualization strategies - MDS, tSNE and PCA based
visualization.
76
8.6.1 Multi Dimensional Scaling
The MDS algorithm [57] works by projection of higher dimensional data onto lower dimensional data subject
to distance constraints. The key idea behind this algorithm is to deduce a topological encoding of the points
by way of considering all pair distances between points to be conserved across dimensions.
The formulation is based on pairwise distances between points. Consider all the points are centered about
their average. Consider all the input points to be unit vectors in D dimensional space. Given two points u
and v, the euclidean distance between them is proportional to
|u − v| = 2(1 − u · v)
Let the input set of points be set X where X[i] denotes ith point of the set. Now assume an isomorphic
mapping between the points D dimensional space to a lower dimensional space (typically 2 dimensional).
Let the points in lower dimensional space be Y , such that X[i] has a counterpart Y [i]. Now,
(∀i, j) : |X[i] − X[j]| ≈ |Y [i] − Y [j]|
The distance can be stated in terms of the dot product as,
ij = |X[i] − X[j]| = 2(1 − X[i] · X[j])

dX
dYij = |Y [i] − Y [j]| = 2(1 − Y [i] · Y [j])
ij ]∀i,j = 2I − 2(X X)
DX = [dX T
DY = [dYij ]∀i,j = 2I − 2(Y T Y )
We need to minimize the difference between the distance matrices,
Y ∗ = arg min ||DX − DY ||2

Y
The modifications can be to consider a kernel function in place of the plain dot product,
X[i] · X[j] ⊢ K(X[i], X[j])
where K(·, ·) is a kernel function.

Major drawback of this method is, it is not incremental , i.e. when new data point gets added, the whole
Y needs to be recomputed. Another issue with this method is sensitive to noise. The above issue limit
the practical applicability of the method in industry scale use cases.
Visualization of multi dimensional scaling of 20 random points is shown in (Figure 24). The original
points and the reconstructed points are shown in orange and green colors. The plot also shown a variant of
the MDS, called non-metric MDS, which enforces ordering of pairwise distances.
8.6.2 tSNE
In this method [58], the pairwise distance between points attains different interpretation as points selecting
their neighbors. The tSNE algorithm is much more successful than the MDS and it has been used in a
number domains for data visualization. This method provides for an impressive supplementary perspective
of commenting on the nature of the data in slow paced scenarios such as genomics research.
77
For a given point X[i], the tSNE method considers distance to another point X[j] as proportional to the
probability that ith point would select j th as its neighbor. The interpretation of probability that j th point
is selected a neighbor, given the cumulative probabilities of all other points being selected as neighbors is
given by P [i][j] as below.
e−(dij ) /(2∗σi )
X 2 2
P [i][j] = ∑ −(dX )2 /(2∗σ2 )

ke
ik i
For the corresponding points in the other dimensional space Y (typically reduced and 2 dimensional),
the equivalent definition of probability of picking j th point as neighbor for the ith point is given by,
(1 + ||Y [i] − Y [j]||2 )− 1

Q[i][j] = ∑ 2 −
k (1 + ||Y [i] − Y [k]|| ) 1
The difference between these two distributions is determined as error as below.

∑ P [i][j]
error(Y ) = KL(P ||Q) = P [i][j]log( )
Q[i][j]
i̸=j
The values for elements of the matrix Y are determined by gradient descent on the error(Y ) error
function.
The ”t” in the method corresponds to the nature of distribution of Q[·][·] being a heavy tailed Student’s
t-distribution curve.
The method is demonstrated for success on diverse data sets including MNIST digits, Olivetti faces,
Netflix, ImageNet and other popular data sets.
However one drawback of this method is, it not applicable in an incremental scenario where new data
points are added to the previous ones. Moreover multiple executions of the method on a same data set,
however in a batch-wise mode would result in different Y values and hence layout of the points. The method
is also prone to curse of dimensionality in which case, when the input dimensions are large, the P [·][·] values
are all similar and small. The method is applicable in scenarios of adding another perspective of a given data
set, where the data set is more or less mature and is expected to be static. However in industry scenarios
of seasonal data and interpretation of cause of error of machine learning models, it is still an active area of
research.
8.6.3 PCA based visualization
A simple alternative to MDS and tSNE algorithms is to simply plot each data point in a lower dimensional
space based on the eigen vectors of the data set. For a given set of points in D dimensional space, the PCA
algorithm output D eigen vectors.
Consider top 2 eigen values by their magnitudes λ1 , λ2 and their corresponding eigen vectors v̂1 , v̂2 . For
each point u ∈ X, compute (u · v̂1 , u · v̂2 ) as a 2 dimensional embedding and plot for visualization.
The major limitation of this approach is, non-linear relationships between features is not captured by
visualization as the standard PCA is a linear algorithm. Though other variants may be applicable such as
kernel PCA, it is still biased by the majority distribution in the data set.
78
8.6.4 Research directions
A visualization algorithm is closely tied to the vectorization of the data points. Vectorization means features
that determine a data point. The features themselves may be engineered or automatically learned such
as using deep learning methodologies like auto encoder, previous layers of a convolutional neural network
or hidden vector of a recurrent neural network. A good quality vectorization separates data points into
blobs in high dimensional space based on the true clusters. Consequently, a low quality vectorization cannot
separate data points into clean blobs and natural clusters overlap leading to ambiguity in predictions. A
necessary condition for clear separation of data in low dimensional space being visualized is high quality of
vectorization and features of input points. Typically the distance between points in low dimensional space
is less informative and may lead to false conclusions if not backed up by quantitative metrics.
There is scope for assessing points which regress over time as the model is trained on new seasonal data.
There is a need for visualization to be incremental as new data flows in.
9 Graphical Methods
Graphical models [42] in the machine learning world is widely associated with a minor subset of graph
based models called Bayesian networks [59, 60]. However the general concept of using a graph abstraction is
much beyond probabilistic framework and several state of the art practical systems are not constrained by
probabilistic framework requirements. For instance, a very common and widely used concept such as a flow
chart is a graphical model of control flow. In case of distributed computing, the message passing framework
among nodes is a graphical formulation. There is a wide variety of practical systems that employ graph
based abstraction such as page rank, social networking, cellular automata among several others.
In this section we focus on Bayesian networks that impose probabilistic framework among rows and
columns of a data matrix. This abstraction helps in human interpretability of data in terms of relations
between the columns is most desirable in the machine learning world. More precisely, the cause and effect
relation in the data [43, 44] if captured, the same can be used to carefully introduce interventions as required.
In order to understand the cause and effect relationship between data, we need to model the data generating
process and consider the given data set as a sampling from the generating process. Given the snapshot of
sampling from a hypothetical generating process, the task then remains to assess the right parameters of the
process. Once right parameters are determined, the generating model can be used to synthetically generate
data as required.
In this approach of understanding cause and effect relationship between data items, the features are
interconnected. One feature determines another. The measured value of a feature is determined by values
of other features. For instance shadow regions in a image correspond to occluded path from a bright light
region. A sequence of previous frames in a video, determine the content of the next video frame. The cause
and effect relations may be between data in two time frames (temporal) or between two features within a
data point (spatial). Another example in the context of a recommendation systems is, a woman buying
several baby products because she is recently blessed with a baby.
A graphical model is represented by a graph of nodes and edges. The nodes correspond to features or
hidden concepts. The nodes are also called states. The edges may be directed or undirected and are often
weighted to indicate the probability of transition between a pair of nodes. A typical graphical model imposes
certain restrictions on the topology of the graph and the edge weights.
There are two types of nodes - (i) observed nodes and (ii) hidden nodes. The observed nodes
correspond to the data set. The hidden nodes correspond to hypothesis of the generative process. It is not
79
necessary that hidden nodes be there, however observed nodes are compulsory.
The edges in the graph are of two types - (i) transitions between hidden states or (ii) connections
between hidden states to observed states. The latter is called emission. The edge weights are probabilities
of transitions between nodes. The observed states may be implicit in case of modeling real valued outputs.
In case of observations being real valued data, there is a probability associated with each value, typically
defined by a parameterized function.
The sum of probabilities on the out going edges of a node sum upto 1. The probability values may
be determined by a formula (parametric models) or singleton values. The parameters of the formulae are
determined by iterative algorithms. One algorithm that is used for estimating the parameter values is
expectation maximization.
Some of the popular graphical algorithms include the following.
• Naive Bayes [61]
• Gaussian Mixture Model [62]
• Markov Model [63]
• Hidden Markov Model [64]
• Latent Dirichlet Analysis [65, 66]
Extended abstraction of graphical models: A wide variety of problems can be considered as graphical
models by extrapolating the abstract concept of nodes to implicit nodes and implicit edges. The number of
abstract concept of nodes may be uncountably infinite when they represent real valued elements. It is not
necessary that graphical models oblige probabilistic framework, such as social network models [67], message
passing models [45], petrinets [68] and cellular automata [69]. It is only a small subset of problems, however
a highly impactful subset of problems can be Bayesian networks, a class of graphical models that oblige
probabilistic framework.
9.1 Naive Bayes Algorithm

Naive Bayes algorithm (NB) is Bayesian graphical model, that has nodes corresponding to each of the columns
or features. It is called naive because, it ignores prior distribution of parameters and assume independence
of all features and all rows. Ignoring prior has both an advantage and disadvantage. The advantage is that,
we can plugin any type of distribution over individual features and learn the maximum likelihood features
from the data. We need not restrict the class of prior distributions to exponential family in order to simplify
algebra of product of likelihood and prior. The disadvantage is that, it is a maximum likelihood model. It
does not improve posterior iteratively. Despite having advantages and disadvantages, the NB method is still
a probabilistic generative model i.e. given parameters, one can synthetically generate data. These node emit
values, which are observed feature values. The values may be real valued for numeric type of attributes or
may be discrete set of symbols for categorical type of attributes. The label column itself corresponds to a
node. The label may be a real valued quantity as in the case of a regression problem, or a categorical type.
NB makes two primary assumptions - (i) that all columns are independent of each other and only
dependent on the label and (ii) all rows are independent of each other. Based on the nature of the attributes
there are three major versions of NB algorithms - (i) Bernoulli NB [61], (ii) Gaussian NB and (iii) Multinomial
NB [70]. Though the names are different, the underlying formulation is common and generic which we present
below.
80
Let X denote the data set and x ∈ X denote a single data point. Let x[i] denote ith column of the data
element. Assuming there are n columns, i ∈ [1 . . . n]. Let x[L] denote the label value of the data element.
Let Ci denote the random variable for the ith column. Let P (Ci = x[i]|L = x[L]) denote the conditional
probability of ith column of a data point taking a value as observed in the data point x given its label value
as observed. Let each of the columns Ci has a set of parameters, compactly denoted as Θi .
Given that all columns are independent of each other, the conditional probability of a data point given
its label value is modeled as,
P (x|L = x[L]) = Πi P (Ci = x[i]|L = x[L])

Each and every random variable (∀i) : Ci comes with its own parameters based on the type of the data
observed and modeling assumption. For example, if a column indicates coin tossing, and observed data
is either head or tail, and the modelling scientist chose the distribution to be of Bernoulli type, then the
parameters of Ci would be λ indicative of probability of observing heads.
The machine learning task is to learn these parameters (∀i)Θi from data. Let Θ = {Θ1 ∪ Θ2 ∪ · · · ∪ Θn }.
The task of determining right parameters is as posed as maximizing the posterior as in (Equation 63). The
posterior is equal to the product of P (X|Θ) (likelihood) and P (Θ) (prior) probabilities.
Θ∗ = arg max P (Θ|X) = P (X|Θ) ∗ P (Θ) (63)

Θ
In case of multi-class classification problems, the NB formulation ties up each Θi with class Γ = {x[L] :
x ∈ X} as in (Equation 64). In this equation, X[1 . . . L − 1] denotes all columns excluding Lth column. All
the rows of the Lth column are denoted by X[L].
P (X|Θ) = P (X[1 . . . L − 1], X[L]|Θ) = P (X[1 . . . L − 1]|X[L], Θ) ∗ P (X[L], Θ) (64)
In the (Equation 64), the joint probability of X[L], Θ can be modeled by combining the Θi ’s with labels
to generate a multitude of parameters. For instance, given a set of labels, Γ and a set of parameters Θ,
the number of parameters when combined with each and every label, would then become |Γ| × |Θ|. Let us
denote combination of the parameters as here,
ΘL = {Θγi : γ ∈ Γ, i ∈ 1 . . . L − 1}
Note: It is completely upto the person who is modeling the problem to choose what variables to have what
parameters and distributions. One can have label also parameterized.
Based on independence of columns and rows, the equation can be simplified as in (Equation ??).
P (X|Θ) = Πx∈X P (x|Θ)

= Πx∈X Πi∈[1...L−1] P (Ci = x[i], L = x[L]|Θ)
= Πx∈X Πi∈[1...L−1] (P (Ci = x[i]|L = x[L], Θ) ∗ P (L = x[L]|Θ)
P (X|Θ) = Πx Πi P (Ci = x[i]|L = x[L], Θi ) ∗ P (L = x[L]) (65)
81
In case of multi-class classification problems, let Γ = {x[L]|x ∈ X} denote the set of all possible label
values of the data items in the given data set. Then determining the conditional probability P (Ci = x[i]|L =
x[L], Θi ) requires, creating separate instance of parameter sets for each and every possible value of L i.e.
(∀γ ∈ Γ) : Θγi .
In case of multi-class label having k labels and n number of columns, the number of parameter sets for
each column is k × n. The algorithm for learning the parameters is given in (Algorithm 21). The probability
of data set given the parameters is given as,
x[L]
P (X|[Θγi ]i,γ ) = Πx Πi P (Ci = x[i]|Θi )
In the prediction mode the Naive Bayes algorithm determines the posterior probability of the class scores
the best given the data.
The error function over the parameter is defined as log(P (·)),
∑∑ x[L]
Λ([Θγi ]i,γ ) = log(P (Ci = x[i]|Θi ))
x i
Algorithm 21 Naive Bayes - Multi-class

1: for γ ∈ Γ do
2: for i ∈ [1 . . . n] do
3: //In the usual NB formulations, parameters can be determined analytically and very easily
4: Θγi ← ∂Λ(·)
∂Θγ
=0
i
5: end for
6: end for
It is possible to derive other simplified variants, where Θi are independent of label classes γ ∈ Γ. The
variants of the NB formulations for P (R = v|θ) is tabulated below, where R is the random variable in
question Ci and θ are the parameters.
Parameters Data value Formulation type
θ = (µ, σ) Real valued Gaussian NB
θ=λ x[i] in 2 classes Bernoulli NB
θ = [λk ] x[i] in k classes Multinomial NB
9.2 Expectation Maximization

Expectation Maximization (EM) technique [71, 72] is typically easily understood from one specific problem
to another, however its formulation may be confusing in general purpose scenarios. In this section we clarify
the way in which graphical models are dealt with and generalization of the methods formulation.
In any engineering problem, the parameters need to be learned from data. The learning process may be
by humans or by automated methods. Eventually the parameter values need to be determined.
In case of graphical models, the parameters are probability distributions on each of the hidden and
observed nodes which generate observed data. A probability distribution is essentially a function that gives
score for each observed measurement. Some of the constraints are imposed on the scores subject to the
axioms of probability. A cost function as a function of parameters which is minimized over parameter
sequence using optimization algorithms.
82
In a probabilistic generative model, data can be sampled from the probability distribution. The probabil-
ity of current data set being sampled is computed and corresponding an error function is defined to enhance
this probability over parameters.
Assessing the true probability distribution based on given data set is often an intractable task. The
approach is to assume some probability distribution Q(·) over the model and improve the same over iterations
with respect to some loss function. In EM random variables observed and hidden states are all considered
random variables where each variable assumes certain values with certain probability scores. Each random
variable has an associated formula which will be used to assess the probability of a given observation. Let
us consider a scenario of spam and non-spam classification of email messages and definition of nodes.
Example of email spam and non-spam problem - posing as graphical model
• Define spam node
• Define non-spam node
• Determine vocabulary of the whole set of messages
• Each of the nodes can generate all of the words of the vocabulary
• However probability of certain words differs for each word
• Given a collection of spam and non-spam messages
• The task is to learn these probabilities
• There can be other features such as bulk email address and wrong names, they can also be modeled
likewise albeit differently
The set of parameters used in all of the probability distribution function used for the nodes of the
graphical model are indicated by the symbol Θ. The probability distribution function itself is indicated
by Q(Θ) over the parameter values. The set of all hidden variables are indicated by the symbol Z. The
probability distribution function for the values of Z given Θ is denoted by Q(Z|Θ). The actual probability
distribution (unknown) is denoted by P (·). The actual data is denoted by the symbol X.
The problem now is how to learn from data, the parameters Θ of the function Q(·) . As it is impossible
to exactly assess the true distribution P (·) of parameters and values taken by hidden and observed states,
we can only approximate by an empirical distribution and refine it over multiple iterations. The EM model
start with some distribution of parameters and node values Q(·) and improves the parameters over iterations
that reduces difference with respect to P (·). The EM procedure tries to find Θ that maximize the likelihood
of data given the parameters (Equation 66).
Θ∗ = arg max P (X|Θ) (66)

Θ
We need formulation in terms of Q(·) as P (·) are not defined. The following algebra results in a form
that imbibes Q(·) in the formulation.
83
∑ ∑ Q(Z|Θ)
P (X|Θ) = P (X, Z|Θ) = ∗ P (X, Z|Θ)
Q(Z|Θ)
Z Z
∑ P (X, Z|Θ)
= Q(Z|Θ) ∗
Q(Z|Θ)
Z
P (X, Z|Θ)
= EZ [ ]
Q(Z|Θ)
Finding the maximizing Θ∗ is equivalent to determining maximizer of the log likelihood (Equation 67).
Θ∗ = arg max P (X|Θ)

Θ
P (X, Z|Θ)
⇒ Θ∗ = arg max EZ [ ]
Θ Q(Z|Θ)
P (X, Z|Θ)
Θ∗ = arg max log(EZ [ ]) (67)
Θ Q(Z|Θ)
∑ ∑
However, dealing with log( ) is more difficult as compared to log(·) forms. We apply Jensen’s
inequality to (Equation 67) to result in (Equation 68). This form has a lower bound that needs to be
tightened and then Θ∗ can be determined.
P (X, Z|Θ) P (X, Z|Θ)

log(EZ [ ]) ≥ EZ [log( )] (68)
Q(Z|Θ) Q(Z|Θ)
The maximum value in (Equation 68) is attainable when the logarithm takes uniformly similar values as
input across all Z (Equation 69).
P (X, Z|Θ) ≈ Q(Z|Θ) (69)
Thus updating Q(·) for its posterior value increases the value of the lower bound which is also called
tightening the lower bound. This is equivalent to determining the posterior distribution of Z given Θ and
the data set X.
Q(Z|Θ) ≈ P (X, Z|Θ)

⇒ (∃c) : Q(Z|Θ) = c ∗ P (X, Z|Θ)
∑ c ∗ P (X, Z|Θ)
Q(Z|Θ) = 1 ⇒ Q(Z|Θ) = ∑ ′
Z Z ′ c ∗ P (X, Z |Θ)
P (X, Z|Θ)
= → P (Z|X, Θ)
P (X|Θ)
The Θ itself needs to be updated for maximizing the likelihood over the data (Equation 70).
Θ∗ = arg max Q(X|Θ) (70)

Θ
84
∑ ∑
Q(X|Θ) = Q(X, Z|Θ) = Q(X|Z, Θ) ∗ Q(Z|Θ)
Z Z
∑
= Q(X|Z, Θ) ∗ Q(Z|Θ) = EZ [Q(X|Z, Θ)]
Z
⇒ arg max Q(X|Θ) = arg max EZ [Q(X|Z, Θ)] = arg max log(EZ [Q(X|Z, Θ)])
Θ Θ Θ
⇒ arg max log(EZ [Q(X|Z, Θ)]) ≥ arg max EZ log(Q(X|Z, Θ))

Θ Θ
The error function is defined as, Λ(Θ) = EZ log(Q(X|Z, Θ)).

Maximizing value Θ is derived by gradient descent formulation using convenient Q(·) formulation such
as exponential family of distributions.
Λ(·)
Θ∗ ← =0
∂Θ
9.2.1 E and M steps
E-step: The expectation step corresponds to computing problem specific metrics based on given values of
Q(·). The metrics are averaged over probability scores of the current assignments of latent variables.
M-step: The maximization step is mostly analytically deduced per problem by the experts. This step
concerns about updating the Θ values based on the aggregate metrics computed in the E-step above.
The two steps are repeated till the probability values saturate to a threshold error or for for a fixed
number of iterations. As in all the iterative methods, we almost never know the exact number of iterations.
The engineering principle only considers the fact that solution is better than nothing i.e. starting with a
random setting and only improving over iterations.
9.2.2 Sampling error minimization
The E step closely relates to drawing a sample from the given Q(·) distribution. This can be seen as in
case of K − means where each point updates its cluster membership. Let us denote sample generation by
X ′ ← Γ(Θ).
The M step corresponds to analytically deducing the optimal parameters for the error between the
generated sample and the given data. The alternative formulation then corresponds to (Algorithm 22). The
Γ(Θ) is a sampler and any of the standard techniques can be used such as MCMC or Gibbs among others.
Algorithm 22 Simplified pseudocode - Sampling error minimization

Require: X and initial Θ
1: for i ∈ [1 . . . n] do
2: Λ(Θ) = |Γ(Θ) − X|2
3: Θ∗ = arg minΘ Λ(Θ)
4: end for
5: return Θ∗
This method of sampling based parameter estimation is used in Restricted Boltzmann Machines (RBM)
and the formulation is based on minimizing contrastive divergence.
85
Genetic algorithms: The genetic algorithms can be posed under the sampling error minimization for-
mulation as well. A new population of individuals is generated based on cross over and mutation operators
and this corresponds to the sampling step. The maximization step corresponds to applying fitness function
on the new population and selecting the top individuals for further iterations.
9.3 Markovian networks

A Markov model [63] is a Bayesian network without hidden variables. The transitions between states and
emissions of output symbols are defined for every node. In case of categorical data with finite vocabulary size,
a Markov network is directly defined over the observed symbols. The transitions and emission probabilities
need to be learned from data using maximum likelihood formulation.
A set of states Σ = {S1 , . . . , Sn } is defined. Each state emits an output symbol from the vocabulary,
W = {w1 , . . . , wm }. Each state can emit any of the output symbols with probability (∀i, j) : P (wi |Sj ). The
transitions between states is defined using the probability (∀i, j) : P (Si |Sj ). The parameters of the Markov
model then are all these probabilities bundled into a single set Θ.
The learning aspect of the Markov model defines likelihood of the training sequence over the model
probabilities. Let Ω = [o1 , . . . , ok ] : oi ∈ W be the training data sequence of length k. Let ξ = [σ1 , . . . , σk ]
be the sequence of states that would have emitted the Ω observations i.e. σj ∈ Σ.
The likelihood of the sequence is, P (Ω|ξ) is factorized due to Markovian assumptions is a maximization
problem as in (Equation ??).
P (Ω|ξ) = P (o1 , o2 , . . . , ok |σ1 , . . . , σk )

= P (σ1 ) ∗ Πi=k
i=2 P (oi |σi ) ∗ P (σi |σi−1 )
ξ ∗ = arg max P (Ω|ξ) (71)

ξ
Given a particular assignment for Θ, the (Equation 71) can be solved using a dynamic programming
algorithm such as viterbi to determine optimal assignment of states. The state assignment to input data can
also be thought of as generation of a data points which are actually tuples of the form < emit, Sj , ok > and
< transit, Si , Sj > and < start, Sk > to denote emission, transition and starting. Let us denote this data
as the equation below based on the optimal Markov sampling.
D′ = {< emit, Sj , ok >, < transit, Si , Sj >, < start, Sk >}

.
Based on these new assignments, the underlying probabilities of individual states can be adjusted as in
(Equation 72).
Θ∗ = arg max P (D′ |Θ) (72)

Θ
By choosing convenient exponential distributions and maximizing for log(P (D′ |Θ)), the Θ∗ can be de-
duced analytically. The steps of sampling and maximization when repeated over time, the system will
converge to appropriate Θ∗ values.
86
An illustration of very simple Markovian model where output states are directly connected is shown
(Figure ??). There are no hidden states in a simple Markovian model in order to imply more theory about
the observed states.
9.3.1 Hidden Markov Model
The Hidden Markov Model (HMM) [64, 62] is an extension to the Markov model concept where hidden states
are used as well. A set of states Σ = {S1 , . . . , Sn } is defined. Each state emits an output symbol from the
vocabulary, W = {w1 , . . . , wm }. Each state can emit any of the output symbols with probability (∀i, j) :
P (wi |Sj ). However, in HMM we do not have direct transitions between observed states, i.e. (∀i, j) : P (Si |Sj )
.
Instead we define hidden variables Z = {z1 , . . . , zq }. The transitions from hidden variables to observed
states are defined (∀i, j)P (Si |zj ). The transitions between hidden states are defined (∀i, j) : P (zi |zj ). Let
the parameters are all bundled into single set Θ just as in case of Markov model. The steps are outlined in
(Algorithm 23).
Algorithm 23 Simplified Pseudocode - HMM algorithm

Require: //Training data sequences
1: Initialize : Θ
2: for i = 1 : niter do
3: //Expectation step - deduce optimal tuples for emissions, transitions and start states
4: Construct D′
5: //Maximization step - using analytical pre-computed formulae
6: Find optimal: Θ∗ = arg maxΘ P (D′ |Θ)
7: end for
8: return Θ∗
9.3.2 Latent Dirichlet Analysis
Latent Dirichlet Analysis (LDA) [65, 66] is a very important method in automatic detection of topics in data.
Though it is mostly applied to text data, the non-textual data can be customized as well for topic detection
formulation. The approach here is to have latent variables topics in documents and topics generating
vocabulary which are observed symbols. Starting with a random distribution of topics per document and
vocabulary per topic, the algorithm arrives at optimal parameter values iteratively. The user needs to just
specify the number of topics and the algorithm simultaneously clusters documents and vocabulary according
to topics. The iterations are similar to K-Means algorithm where each point changes its cluster membership
and cluster centroid changes its location.
Let W denote vocabulary, W = {w1 , . . . , wn }. Let D denote the set of documents, D = {d1 , . . . , dm }.
Let T denote the set of topics, T = {t1 , . . . , tk }. Let the probability distribution of words per topic be,
q(w|t). Let the probability distribution of topics per document be, q(t|d). Let the parameters behind these
probability distributions be, Θ.
The topic detection algorithm proceeds much like the standard EM algorithm as below.
˙
• Initialize q().
• Select optimal assignments - (i) topics per document and (ii) words per topic based on closeness
measure as in K-means.
87
• Based on the memberships of topics, refine the Θ values using pre-defined update equations.
• Refine the above steps for certain number of iterations.
A simplified pseudocode of the algorithm for illustration purposes is shown in (Algorithm 24). The actual
implementation is much more complex than the one shown here and involves adjusting counts of topic-
document and topic-word matrices. We present here a formulation based on K-Means like methodology for
illustration purpose of a reading beginning to understand the concepts behind deploying topic models.
Algorithm 24 Simplified pseudocode - Topic modeling

Require: k //number of topics
1: (∀w ∈ W ) : map[w] = RAN D(1, . . . , k)
2: (∀d ∈ D) : map[d][w] = RAN D(1, . . . , k)
3: for iter = 1 . . . niter do
4: (∀w ∈ W ) : q(w|t) //calculate probability of word per topic
5: (∀d ∈ D) : q(t|d) //calculate probability of topic per document
6: (∀t) : q(t) //calculate probability of topic
7: for d ∈ D do
8: for t ∈ T do
9: for w ∈ W do
10: UPDATE: map[d][w] = arg maxt′ q(t′ , w|d) = arg maxt′ q(w|t′ ) ∗ q(t′ |d)
11: UPDATE: map[w] = arg maxt′ q(t′ , w) = arg maxt′ q(w|t′ ) ∗ q(t′ )
12: end for
13: end for
14: end for
15: end for
The topic model algorithm is not just restricted to text data. Any other multimedia data can be modeled
into the topic detection framework. Some of the examples of processing non-textual data using topic modeling
framework is outlined below.
Topic modeling of audio data:
• Consider a raw audio file such as .wav where sampled amplitudes are stored
• Create a sliding window of some size k amplitudes
• Slide through each audio file, at some stride (i.e. hop length) and generate slides
• Cluster all the slides into h clusters
• Each slide corresponds to a cluster
• Consider each cluster as a word and the vocabulary size is the number of clusters (h)
• For each audio file, consider all its constituent slides and it corresponds to words (if length of input =
l, number of slides = l − k + 1)
• Now each audio file is a sentence composed of words in a vocabulary which are clusters of audio slides
88
Topic modeling of image data:
• Consider an image file
• Consider a sliding rectangular region, call it sliding box
• Slide through an image from top-left to bottom-right corner using the selected rectangular sliding box
• Cluster all the rectangular image patches into h clusters
• If an each image is composed of l slides, then it is equivalent to a sentence having l words, where each
word corresponds to a cluster
• Now the problem is posed as topic modeling on a set of sentences which are images whose vocabulary
are cluster centroids of image patches
10 Deep learning
Deep learning [73, 74] is a methodology for transforming an input in a given dimensional space to another
dimensional space. The vectors in the transformed space are relatively spaced according to strength of
correlations between features in the input space. The correlations can be non-linear and highly complex
leading to applicability of deep learning algorithms to a wide array of real world problem statements. The
deep networks are used for determining complex patterns in the input which are difficult to be captured and
maintained by human feature engineering teams [75].
The intuition behind why deep neural networks are working today, the way they are, comes from an
interpretation of representation of a computer program. A computer program, can be abstracted as a flow
chart where nodes denote computations and edges denote flow of information. Assume compute nodes are
further decomposed to finest level of basic arithmetic operations over integers and real numbers. Now the
question is whether a neural graph or network can be constructed to emulate this compute graph. The answer
is yes, due to Universal approximation theorem [76, 77]. The universal approximation theorem only states
that, one can compose by hand, weights of a neural network to mimic any arbitrary function. However,
the task in machine learning is to learn from data. As learning a complex compute graph is practically not
feasible and therefore simple architectures with layered structures and sparse cycles have come into existence.
A number of deep neural network architectures are used in state of the art machine learning models today,
including the following. In this section will discuss their architectures and working details [75].
• Deep Belief Networks (DBN) [78]
• Restricted Boltzmann Machine (RBM) [79, 80]
• Auto Encoder (AE) [81]
• Variational Auto Encoder (VAE) [82]
• Convolutional Neural Networks (CNN) [83]
• Recurrent Neural Networks (RNN) [84] and variants including Long Short Term Memory networks
(LSTM) [85]
• Generative Adversarial Networks (GAN) [86] and variants such as Deep Convolutional GAN (DCGAN)
[87]
89
Table 10: Standard activation functions
Activation function Formula
Sigmoid σ(x) = 1+e1−x
ReLu relu(x) = max(0, x)
Leaky-ReLu lrelu(x) = max(α ∗ x, x)
ex −e−x
tanh ex +e−x
10.1 Neural network

Artificial Neural Network (ANN) or simple called Neural network is the state of the art computational model
for data transformation via weighted graph of nodes and edges [88, 89, 90, 91]. The weights of the network
are learned from the data. Every node has incoming edges and outgoing edges. A node consumes data
through incoming edges. It then performs arithmetic operations on input data and emit output values. The
output values are broadcasted through all the outgoing edges. During learning phase, bits and pieces of error
values flow in reverse direction and update the weights all along. The data flowing into a neuron is already
multiplied by the edge weight as it flows.
The first task a neuron does is simple addition of all its inputs. It then applies a transformation, called
activation function before emitting the transformed data value. The output value gets replicated on all of
the outbound edges. As the value flows through, it gets multiplied by the edge weights. And then the
process repeats at the receiving neurons and so on. There are various types of activation functions and can
be custom activation function as well. Some of the commonly used activation functions are listed below
(Table 10).
An illustration of a typical neural network is given in (Figure reffig:dl-nn).
Consider a data set having N points, S = {(x1 , y1 ), . . . , (xN , yN )}, where each point is d dimensional. A
single layer neural network with sigmoid activation function would consume each xi and emits corresponding
output yi , with bias b. This bias corresponds to y-intercept equivalent as in the case of equation of a line
(Equation 11).
ŷi = σ(w · xi + b)
Shown below is the weight update for minimizing squared error. There are other several forms of error
as well, some of the most popular ones are shown in (Table 11).
∑
i=N
L(w, x) = (ŷi − yi )2
i=1
∗
w = arg min L(w, x)
w
⇒ wnew = wold − η ∗ ∇w=wold L(w, x)
There are other forms of error functions which are described in (Table 11).
A neural network is a hierarchical organization from left to right. The numbering of the left most layer
is 1 which is always the input layer. The right most layer is the output layer, which is always the last layer.
In principle, you can have a minimum 2 layer neural network, one layer inputs and the other layer outputs.
Each layer has neurons. The neurons are numbered from top to bottom, with top most neuron being 1. As
90
Table 11: Some of the standard loss functions
Loss function Formula
N (ŷi − yi )
1 2
Mean Squared Error
Hinge loss max(0, ŷi − yi )
Cross entropy loss −{ŷi ∗ log(yi ) + (1 − ŷi ) ∗ log(1 − yi )}
layer 1 is input, the number of neurons in the first layer is same as the dimensionality of the input data.
As it might be difficult to add a bias column to the input, we will reserve one extra neuron in the 1st layer.
The bias neuron always emits 1. Let us indicate the last layer by L, which is also the total number of
layers in the neural network. There are connections between neurons from ith layer to (i + 1)th layer. The
connections are typically indicated as from left to right. However during training phase, back propagated
error values flow in reverse fashion.
Consider a neural network having L layers. Let us denote j th neuron of the ith layer by N [i][j]. Also,
let us denote size of ith layer by N [i].size attribute including the bias neuron, which always emits 1. Let us
denote weight on the connection between N [i][j] and L[i + 1][k] by W [i][j][k]. Now, compactly, W [i] denotes
a matrix of N [i].size × N [j].size. The output of ith layer is a N [i].size dimensional vector, say X[i] where
X[i][j] denotes the value emitted by the j th neuron of this layer. As the data flows into the network, at each
node N [i + 1][k], summation over incoming edges happens. The summation is,
∑
j=N [i].size
Z[i + 1][k] = X[i][j] ∗ W [i][j][k]
j=1
The summation can be captured in the form of matrix multiplication as,
Z[i + 1] = (X[i]T × W [i])T = W [i]T × X[i]

The summation of weighted inputs are then passed through an activation function at each neuron. Let
us denote the activation function ψ(·). Let us denote the final output of j th neuron of ith layer as ψ(Z[i][j]).
Also, in a compact notation, let us denote application of ψ(·) at the vector level as,
X[i] = ψ(Z[i])
.
Therefore the final output is X[L].
The final output is obtained by (Equation 73).
X[L] = ψ(Z[L]) = ψ(W [L − 1]T ∗ X[L − 1]) (73)
The output of ith layer, expressed in terms of the previous layer, recursively becomes (Equation 74).
Starting from the input X[1], the value gets computed by application of the weight matrices to generate
X[L].
(∀i = [L : 2]) : X[i] = ψ(W [i − 1]T X[i − 1]) (74)
Let us denote the ground truth output vector by Y where Y [j] denotes from top to bottom the j th neuron’s
expected value. Now, corresponding to the bias output of the Lth layer, which is the last layer, assume that
Y [N [L].size] = 1 always. Recall that the bias neuron of any layer always emits 1.
91
Now, we can formulate a loss function and gradient descent equations for the learning part of the network.
Let us denote all the weight matrices simply by, W , then the loss function is shown in (Equation 75). Note
that L(W, X[1]) is a single real number after substitution of its parameter values.
loss(W, X[1]) = L(X[L], Y ) (75)
The partial derivative of the loss function for the W [i][j] weight is given by the following equations where
⊙ denotes element-wise product.
∂L(·) ∂L(·) T ∂X[L]

= [( ) ]1×N [L].size × [ ]N [L].size×1
W [i][j] ∂X[L] ∂W [i][j]
∂X[L] ∂ψ(Z[L]) ∂ψ(Z[L]) ∂Z[L]
= =[ ]N [L].size×N [L].size × [ ]N [L].size×1
∂W [i][j] ∂W [i][j] ∂Z[L] ∂W [i][j]
∂Z[L] ∂([W [L − 1]T ]N [L].size×N [L−1].size × [X[L − 1]]N [L−1].size×1 )
=
∂W [i][j] ∂W [i][j]
∂(W [L − 1] × X[L − 1])
T
∂(W [L − 1] × X[L − 1])
T
∂X[L − 1]
=[ ]N [L].size×N [L−1].size × [ ]N [L−1].size×1 +
∂W [i][j] ∂X[L − 1] ∂W [i][j]
∂W [L − 1]T ∂(W [L − 1]T × X[L − 1])
[ ]N [L].size×N [L−1].size × [ ]N [L−1].size×1
∂W [i][j] ∂W [L − 1]T
∂(W [L − 1]T × X[L − 1])
= W [L − 1]T
∂X[L − 1]
∂(W [L − 1]T × X[L − 1])
= X[L − 1]
∂W [L − 1]T
Putting it all together, the simplification becomes,
∂X[L] ∂ψ(Z[L]) ∂ψ(Z[L]) ∂Z[L]

= = ×
∂W [i][j] ∂W [i][j] ∂Z[L] ∂W [i][j]
∂ψ(Z[L]) ∂W [L − 1]T × X[L − 1]
= ×
∂Z[L] ∂W [i][j]
∂ψ(Z[L]) ∂W [L − 1] × X[L − 1] ∂W [L − 1] ∂W [L − 1] × X[L − 1] ∂X[L − 1]
T T T
= ×{ × + × }
∂Z[L] ∂W T [L − 1] ∂W [i][j] ∂X[L − 1] ∂W [i][j]
∂ψ(Z[L]) ∂W T [L − 1] ∂X[L − 1]
= × {X[L − 1] × + W T [L − 1] × }
∂Z[L] ∂W [i][j] ∂W [i][j]
We also note that X[k] are independent of W [i][j] when (∀k < i) and set the partial derivatives to zero
whenever arise.
The recursive definition becomes (Equation 76).


 0 (i ≥ k)
∂X[k] ∂ψ(Z[k])
= × X[k − 1] (i = k − 1) (76)
∂W [i][j] ∂Z[k] 
 W T [k − 1] × ∂X[k−1]
∂W [i][j] (i < k − 1)
The loss function derivation becomes (Equation 77).
∂L(W, X[L]) ∂L(W, X[L]) ∂X[L]

= ∗ (77)
∂W [i][j] ∂X[L] ∂W [i][j]
92
Table 12: Some of the standard encoding methods
Method name Brief description
Binary encoding All columns are either 0 or 1
One hot encoding A special type of binary encoding, where only one 1 is present
Vector encoding Converting data into a numeric vector of fixed dimension
Text vectorizer Converting textual data into vector forms
10.1.1 Gradient magnitude issues
As we see from (Equations 76 and 77), the W [i][j] term is recursively multiplied along the way from the
last layer L, till its next (i + 1)th layer. This poses a problem as to weight updates in the left side layers
of the neural network. Let us denote ∇ψ(Z[i]) as the gradient of the activation function at Z[i]. If (∀i, j) :
∇ψ(Z[i]) ∗ W [i][j] > 1, then the product of the terms gets >> 1 and it is called exploding gradient problem.
When (∀i, j) : ∇ψ(Z[i]) ∗ W [i][j] < 1, then the product of the terms gets << 1 and it is called vanishing
gradient problem.
The solution to these problems is two fold - (i) choice of the ψ(·) function and (ii) initialize of W [i][j]. The
vanishing or exploding gradients problem occurs for σ(·) activation function. The other activation functions
as in (Table 10) do not have this problem. Also, the W [i][j] initialization can be done using Xavier’s method
i.e. setting the values to a Gaussian random variable with mean 0 and variance proportional to the average
of number of input and output neurons.
10.1.2 Relation to ensemble learning
There is strong relationship between neural networks and ensembling methods [92, 93]. An ensemble of weak
classifiers in the boosting algorithms makes a strong classifier, refer to the AdaBoost Algorithm (9). In case
of a neural network, the outputs of previous layer are fed into the next layer and so on. In case of a boosting
algorithm, the errors from the current cumulative classifier are fed into the new classifier and so on. As the
new classifier is added In case of neural networks also, the output of some neurons is input to further neurons
and so on. The loss function is minimized from the right end of the neural network. This is equivalent to
minimizing losses successively over layers from left to right. Essentially one can find resemblances between
boosting algorithms and neural networks and both are ensembling type of methods.
10.2 Encoder
Encoding corresponds to transformation of points in a given dimensional space to another dimensional space.
Typically the transformation happens to a lower dimensional space. It is also related to data compression
however reverse transformation may not be possible. There are different ways of encoding real world data
some of the methods are tabulated in (Table 12).
10.2.1 Vectorization of text
Vectorization of text is one of the fundamental steps in text analytics [94]. Consider a scenario of classification
of e-mail messages as spam or not-spam. In order for the classifier to consume data, the input text need
to be in the form of vector. There are various ways of vector representation of textual data and all of the
approaches operate over vocabulary sets [95].
• Gather all the words in the data set into a vocabulary set.
93
• Convert each e-mail message into a binary vector of 0’s and 1’s based on presence or absence of words
• The size of the vector is same as the size of the vocabulary
• The whole data set now is in a form ready to be consumed by standard machine learning algorithms
Count vectorization: In this technique, for each word a count of number occurrences with in a document
or paragraph is stored in the vector representation instead of mere presence or absence. A count vectorizer
may be more informative that plain binary vectorizer.
TF-IDF vectorization: In this technique [96] the frequency of a word within a document is computed,
however it is divided by the number of documents in which that particular term occurs. This is more
informative that mere count vectorizer.
k-gram vectorization: In this technique [97], instead of a single word, a stretch of k-words is considered as
a single word. In this setting the size of the vocabulary increases exponentially. If for instance a vocabulary
has 1000 words, and we are considering 3 grams. Then the new vocabulary size will be 10003 . Though
this technique results in very sparse spaces, it often retains highly informative features when combined with
TF-IDF vectorization as well.
One-hot vectorization: This technique is applicable to category type of attributes. If an attribute has
m categories, and at any point in time, i.e. in a given row of data, the attribute can take only one of the
m values, then it is processed via one-hot-encoding. In this encoding, a vector of length m is generated for
that attribute and only one bit is set to 1, while remaining are all 0 in this vector. The vectors for individual
attributes are all appended together to form the final feature vector for the data.
10.2.2 Auto encoder
An auto-encoder [81] is a machine learning based vector encoding that reduces difference between generated
data and the input data. In this approach a given data point is processed via an operator matrix to generate
a transformed vector. The transformed vector is then reverse mapped back into the input. The difference
between the input and the re-generated input is minimized using an optimization formulation that learns
the elements of the operator matrix.
Let a data point be x in a d-dimensional space. Let Wd×k be a transformation matrix.
Now, the point can be transformed into k-dimensional space by,
x′ = W T × x
The transformed point can then be further back transformed into d-dimensional space by,
xR = W × x′
Let the loss function be,
L(W ) = ||x − xR ||2

An optimal W is determined using gradient descent,
W ∗ = arg min L(W )

W
94
On a data set of X of points, the loss function is computed as cumulative summation,
∑
L(W ) = ||W W T x − x||2
x∈X
10.2.3 Restricted Boltzmann Machine
Boltzmann distribution is a probability function used in statistical physics to characterize state of a system
of particles, with respect to temperature and energy. The system can exist in several states, however the
chance of being in certain subset of states is higher than other. The chance itself is parameterized over
certain property values. As a system looses energy, it saturates to a certain state, when energy is high, it
fluctuates heavily and as it cools down it stabilizes to one of the states [79].
The analogy is used in artificial intelligence based state space search as well. In case of simulated
annealing, the probability of transition to a next state is proportional to a heuristic function value and the
number of iterations which emulates system cooling.
This method can find applications for synthetic generation of data, filling in missing values and encoding
of representation which are high importance in data science regime. Imagine a situation where all features
of a data point are interacting and influencing each other. A given data set, can be considered a sample
drawn from a probability distribution. How to induce a probability distribution on interactions between
features when considered over a collection of samples constitutes the learning task. A pair of data points
that exhibit certain feature values may be highly related than some other pairs in the data set. Determining
such a probability distribution helps in engineering better features or do feature transformations compliant
with hidden structure in the data.
The past work is referred to as Ising models [98] where a lattice of electrons are studied for their spins
subject to ambient energy. The spin of an electron can be +1 or −1. Given a lattice of electrons, the spin of
an electron is dependent probabilistically on the spins of their neighbors. When the system has high energy,
the spins are all random. As the system cools down, the system converges to specific structure of spins.
Relating it back to data science problems, one can imagine a data set where all features are binary. Each
feature flips its binary value based on its neighbors. A number of such feature flippings are generated to form
a synthetic data set. The difference between the synthetic data set and the given data is evaluated on any
metrics of choice and probability distribution parameters are adjusted. Over several iterations, the system
converges to optimal parameter values that capture the structure in the data.
However, these models have limitations as to capturing any underlying or hidden principles connecting
the direct features. And when dealing with a large number of features, such as a data set of images where
each image is one mega pixel, meaning that there are 1 million features, the feature correlations and neigh-
borhood approach is not practical, as we do not know how much neighborhood and too many parameters.
Alternatively one can model the problem by introducing hidden variables that connect a group of features
to one hidden variable, another group to another hidden variable and so on. The hidden states and the
observed states may have inter-connections between them and among them each. Such a model is called
Boltzmann model. The Boltzmann model is impractical to solve due to circular nature of connections and
the system forgets the learned parameters and repeats calculations.
A Restricted Boltzmann Machine (RBM) [80] is a simplification over the general Boltzmann machine
approach, in the sense of imposing more restrictions on the structure of the graphical model. A bipartite
graph is created, composed of hidden states and observed (or visible) states. There are no connections
among hidden and no connections among visible states themselves. These restrictions force the system to
learn parameters and converge over iterations.
95
A Bernoulli RBM is a modeled as a mapping from a visible vector v of d-dimensional space to a hidden
vector h of k-dimensional space via a matrix Wd×k . Both the visible and the hidden vectors are binary
vectors. The hidden vector is selected from the visible vector by application of the W matrix and setting the
bits of the h vector based on a probability from a sigmoid activation function. Similarly the visible vector is
generated back from the h vector based on the sigmoid activation function.
h σ(W T × v)
v R σ(W × h)
The difference between the reconstructed visible vector v R and the actual visible vector v is minimized
over the data set X, (Equation 78).
∑
W ∗ = arg min ||W × W T v − v||2 (78)
W
x∈X
An illustration of an RBM is shown in (Figure 27) where hidden and visible nodes are marked.
10.3 Convolutional neural network

A Convolutional Neural Network (CNN) [83] is a deep neural network architecture coming under supervised
learning methodology. It was mainly developed to address automatic learning of features in images through
several layers of convolutions and scaling. Though the networks are typically used for images and initial
demonstration was on images, it can applied to any type of data where all features have homogeneous
semantic.
Feature homogeneity and localization: This is a notion that all features have same semantics and
value distributions and possess proximity. Same value distribution means, the probability that a feature
Fi assumes a value v, P (Fi = v) is same as P (Fj = v) for another feature Fj . Same semantic means, all
features have same meaning. For instance, all pixels, all audio amplitudes, all temperatures, all electrostatic
forces and so on. Localization of features corresponds to a notion of neighborhood among the features. For
instance, a given pixel has 8 neighboring pixels to the left, right, top, bottom and 4 corners. In case of an
audio signal, the samples are taken at regular intervals, say every 10ms i.e. 100Hz. Neighbors of a given
amplitude recording are the one on the left and the one on the right. In case of a video recording, a given
frame has a predecessor frame and the next frame, of course excluding the boundaries. In case of text data, a
given sentence is a sequence of words. The previous word and the current word correspond to neighborhood
of a given word.
Notion of a filter: Consider for instance an example of a household filter which segregates larger items from
the smaller ones. Though the filter outputs smaller items mixed up among each other, still it has carried out
a useful step of eliminating large elements. Consider another example of a filter, which segregates elements
by their weight. For instance in a mass spectrometer, elements of differing masses
Convolution: Given a notion of neighborhood among features of a data point, a weighted average of a
neighborhood of a particular feature (imagine a pixel and its neighborhood), often captures more information.
96
For instance, consider a high resolution photograph of a house, of say 1 mega pixels, having 1000 pixel width
and 1000 pixel height. In order for us to recognize that image as a house, only a stamp sized image of mere
100 pixels with 10 pixel width and 10 pixel height may be sufficient. Often looking at information in high
resolution when not required leads to wrong predictions. It is required to purposefully down-sample the
data while retaining aggregate information. One way to accomplish this task in signal processing is to take
weighted average of neighborhoods and down-sample.
A fundamental building block of a CNN is the convolution filter. A filter is a matrix of smaller dimension
that used for localized weighted averaging purposes. Assume, given an image in the form of a matrix, XW ×H .
Consider a filter as a small matrix, F2k+1×2k+1 (assuming a simple odd sized, square filter). For a given pixel
position i, j in the input image, a convolution is defined as in (Algorithm ??).
Algorithm 25 Example - image convolution

Require: X, F //input and filter
1: X ′ = 0 //output image after convolution
2: for w = 0 . . . W − 1 do
3: for h = 0 . . . H − 1 do
4: //for each pixel, perform convolution
5: sum = 0
6: for p = −k . . . k do
7: for q = −k . . . k do
8: if 0 ≤ w + p ≤ W − 1 or 0 ≤ h + q ≤ H − 1 then
9: sum ← sum + X[w + p][h + q] × F [p + k][q + k]
10: end if
11: end for
12: end for
13: X ′ [w][h] ← sum
14: end for
15: end for
16: //Convolution is defined as, X ′ = X ⊙ F
17: //The same is computed by the above pseudocode
18: return X ′
Tensor convolution: The convolution operation may be defined over multiple dimensional filter as well.
For instance if Xd1 ×d2 ×...dk be a k dimensional tensor. Then a convolutional filter will be of k dimensions, of
size Fp1 ×p2 ×...pk . Then a convolution is defined as (Equation 79) for a multi dimensional matrix or tensor.
∑
i=k
X ′ [a] = X[a − pi /2 : a + pi /2, i] ⊙ F [0 : pi , i] (79)
i=1
The convolution operation is efficiently realized in a highly parallel environment on GPU architectures.
This the primary reason why convolutional neural networks are trained in orders of magnitude lesser time
on a GPU rather than on a CPU.
97
10.3.1 Filter Learning
The filters in a convolutional neural network need to be learned from data. In order to illustrate the filter
learning process, let us consider a very simple example which can trivially extend to complex and real world
cases.
Assume we are building a one dimensional CNN, where input is one dimension and the output is a single
numeric value. The input goes through convolution phase by a one dimensional filter. Let X = [x0 , . . . , xn−1 ]
be an input, whose dimension is 1 × n i.e one dimensional array. Let σ(·) be the activation function of the
neural network. Let the convolutional filter be F = [f0 , f1 , f2 ] be one dimensional with 3 elements. The set
up is shown in (Algorithm 26). The input is convolved with F [] and an intermediate output is generated.
The convolved vector is then multiplied with W [] vector to generate output (Equation 80). The output error
value is a loss function
o(X, F, W ) = σ(W, X ⊙ F ) (80)
Algorithm 26 CNN Filter learning example

1: //initialize convolution X ′ = [0]1×n
2: for i = 0 : n − 1 do
3: sum = 0
4: for p = [−1 · · · + 1] do
5: if 0 ≤ i + p ≤ n − 1 then
6: sum = sum + X[i + p] ∗ F [p + 1]
7: end if
8: end for
9: X ′ [i] = sum
10: end for
∑i=n−1
11: ŷ = σ( i=0 wi ∗ X ′ [i])
12: //Loss function L(y, ŷ)
The loss function derivative with respect to the filter values are computed as below on an example of
f0 . The loss function values for each of the fi is given in (Equation 81). The dimensions are indicated for
clarity.
98
∂L(ŷ, y) ∂L(ŷ, y) ∂ ŷ
= ×
∂f0 ∂ ŷ ∂f0
∂ ŷ ∂σ(W · X ′ ) ∂σ(W · X ′ ) ∂W ∂σ(W · X ′ ) ∂X ′
= = ∗ + ∗
∂f0 ∂f0 ∂W ∂f0 ∂X ′ ∂f0
∂W
(∵ = 0)
∂f0
∂σ(W · X ′ ) ∂σ(W · X ′ ) ∂X ′
= ∗0+ ∗
∂W ∂X ′ ∂f0
∂σ(W · X ′ ) ∂X ′
= ∗
∂X ′ ∂f0
∂(W · X ′ )
(∵ = W)
∂X ′
∂X ′
=W ∗
∂f0
∂X ′ ∂X ⊙ F ∂[xi−1 ∗ f0 + xi ∗ f1 + xi+1 ∗ f2 ]i=0...i=n−1
(∵ = = = [x−1 . . . xn−2 ]T )
∂f0 ∂f0 ∂f0
= W T × [x−1 . . . xn−2 ]
∂L(ŷ, y) ∂L(ŷ, y)
(∀i ∈ [0, . . . , 2]), (∀j ∈
/ [0, . . . , n−1] : xj = 0) : = ×([xi−1 , . . . , xn−2+i ]1×n ×Wn×1 )1×1 (81)
∂fi ∂ ŷ
Each of the filter’s weights is updated using gradient descent as in (Equation 82), where η is learning
rate.
(t) (t−1) ∂L(ŷ, y)

fi = fi −η (82)
∂fi
Filter update for multiple output neurons: The filter update equation presented in (Equation 82)
can be extended to complex scenarios. The filter F (in Equation 80) is situated between a pair of input and
output layers. Extending the (Equation 82) scenario to multiple output neurons o = 1 . . . O , is given in
(Equation 83).
(t) (t−1)
∑
o=O
∂L(yô , y)
fi = fi −η (83)
o=1
∂fi
Updating multi dimensional tensor filters: Te filters presented in (Equation 83) is one dimensional.
However, the formulation does not stop one from extending the filters to multiple dimensions. If the input
is a tensor in k dimensional, the convolving filter is also a k dimensional tensor. The update equation for a
tensor type filter is given in (Equation 84).
(t) (t−1)
∑
o=O
∂L(yô , y)
fi1 ,...,ik = fi1 ,...,ik − η (84)
o=1
∂fi1 ,...,ik
99
Table 13: Stride size
Stride Output size
1 Original size
2 Half
3 One-third
1
k k
Filter update in interior layers: The filter need not be between the end layers, they can be situated
between any pair of layers in the interior of a CNN. Let the layers encompassing a filter is between lth layer
and (l + 1)th layer, and the output of lth layer is X[l] and the output the (l + 1)t h layer is X[l + 1]. The
derivative of the loss function with respect to filter cell between X[l + 1] and X[l] is given by,
∂L(·) ∂L(·) X[l + 1]

= ×
∂fi1 ,...,ik ∂X[l + 1] ∂fi1 ,...,ik
10.3.2 Convolution layer
A typical CNN has several hundreds of filters at a convolutional layer. It also will have several tens of layers.
Each filter may also be a tensor in > 3 dimensions. The dimensionality of a filter in lth layer, matches with
the dimensionality of the output of lth layer. Each filter generates one output, of the same size as input,
however after convolution. If there are nf(l+1) filters in (l + 1)th layer, then number of outputs generate is
nf(l+1) . All these outputs are weighted and combined into required number of neurons in the (l + 1)th layer.
The number of convolution layers in some of the popular CNN architectures is given in (Table 14).
10.3.3 Max pooling
Max pooling corresponds to selection of a maximum value within a neighborhood. Such an operation is
desirable when the data needs to be down sampled. A down-sampling operation makes good sense after an
averaging step. For instance, in order to recognize a given image as an object of interest, it is sufficient to
blur the image repeatedly and reduce the size until a small thumbnail size image is good enough to infer.
In case of CNNs max pooling is carried out with stride (typically ≥ 2). Stride is a concept that
corresponds to jump in each of the tensor dimensions. If the jump is ≥ 2, then the image size reduces by
that scale. The stride value and the image size are reported in (Table 13).
10.3.4 Fully connected layer
A fully connected layer refers to connections from all inputs into a designated set of output neurons in this
layer. Typically this fully connected layer corresponds to the last layer of a CNN. If this is a final layer of
the CNN, the neurons of this layer correspond to the output categories. In order to determine the final class,
softmax operation is used to emit the probability scores relative to the other outputs.
If (l+1)th layer is a fully connected layer, then all the neurons of the previous layer lth layer are connected
to each and every neuron of the (l + 1)th layer. The number of weights is L[l].size × L[l + 1].size.
A trained neural network can be save for its all weights and re-used in a number of different applications.
During prediction phase, the input goes from left to right through a given neural network via the specified
layers.
100
10.3.5 Popular CNN architectures
Some of the popular CNN architectures are shown in (Table 14).
Table 14: Popular CNN architectures

Name Input image type Layers Activation Reference Other comments
LeNet-5 32x32 greyscale Seven Sigmoid [99] First proof of concept on MNIST database
AlexNet 224x224 RGB Eight ReLu [100] ImageNet winner
ZFNet 224x224 RGB Eight ReLu [101] Minor modifications to AlexNet, Baseline high accuracy on ImageNet
Inception 224x224 RGB Twenty two ReLu [102] Higher accuracy on ImageNet
VGGNet 224x224 RGB Sixteen ReLu [103] Higher accuracy on ImageNet
ResNet 224x224 RGB One hundred and fifty two ReLu [104] State of the art highest accuracy on ImageNet
10.4 Recurrent neural network

Learning sequential data requires a mechanism to remember the context. In case of plain feed-forward neural
networks and convolutional neural networks, there is no provision for memory. A recurrent neural network
address this issue [84] by looping back one of the outputs to previous layers. Though the concept looks
simple, it poses challenges when formulating back propagation equations. Certain protocol needs to be in
place to define how the weights should be updated such as back propagation through time.
Multi axle bus & the road analogy The unrolling steps and slicing of the input can be best imagined
if an analogy is brought out to something that we see everyday. Imagine an RNN as a multi axle bus, where
each of its tyres correspond to inputs that it consumes simultaneously. The road itself corresponds to a
variable sized input sequence of vectors. Exactly as a bus moves along a road, an RNN moves along an input
sequence of vectors and processes them. So much like a bus that can move fast or slow, an RNN can take a
longer or shorter stride through the input sequence. In case of an RNN the unrolled steps correspond to the
axle analogy of the bus. Each of the unrolled steps in an RNN take corresponding inputs, much like each
the pressure impact on each of the axles due to road conditions. The previous stretch of the road and the
current position of the bus affects the drivers perception of the road quality. In case of RNN, the hidden
vector encodes the memory of the previous inputs to the RNN and the current inputs and the vector changes
as the RNN slides through the input.
10.4.1 Anatomy of simple RNN
The input is a sequence of vectors S = [x1 , . . . , xN ] where each xi ∈ Rd is a d dimensional vector. An RNN
cell is composed of one or more units. Each unit consumes one vector some xj at a time as input. If an
RNN is composed of U units, each of the units consume respectively, xi , . . . , xi+U −1 vectors from the input.
A hidden vector hk×1 is maintained to encode the state of the RNN upon receiving of certain sequence of
inputs. Each unit is made up of 3 fundamental matrices - W xhd×k , W hhk×k , W hyk×m . The consumption
of input is shown in the list below. An illustration of the RNN unit is in (Figure 28).
• A single RNN unit consumes a single vector from input and processess it
• Consume input xi
• Pass it through W xh matrix to generate hidden vector h
• Use the previous hidden vector to update the state using W hh matrix
101
• Use the output matrix W hy to emit output in m dimensional space
• Proceed to process the next point
• Multiple RNN units consume, consecutive sequence of inputs simultaneously and process them
• There is communication from left to right among the multiple RNN units
The steps are presented in mathematical equations below for a single unit RNN.
hinput = xTi × W xhhhprev = hT(i−1) × W hhh(i) = hinput + hhprev ŷi = hT(i) × W hy
In case of RNN having multiple units, the h vector flows from left to right among the RNN units. Let us
denote the W xhu , W hhu , W hyu as the three transformation matrices of the uth unit of an RNN cell. If uth
cell consumes xi , then (u + 1)th cell consumes xi+1 . The h(i) is sent as input to the (u + 1)th cell. Each of the
U cells, emit ŷi . . . yi+U −1 output vectors. The left most cell gets a h vector of 0 in most of the architectures.
However, it can be configured to assume any other value as well. The RNN unit is given as a function in
(Equation 85).
ŷ, h = Γ(x, h) (85)
A series of RNN units thus perform the operation as in (Equation 86).
[ŷi , . . . , ŷ(i+U −1) ] = Γ([xi , . . . , xi+U −1 ], h(i−1) ) (86)
10.4.2 Training a simple RNN
The training of an RNN results in the update of the elements of the matrices of an RNN unit. The output
vectors are regressed with respect to ground truth via a loss function. The gradient of the loss function is
used for weight updates. The loss function for the output vectors is given in (Equation 87). In this loss
function a stride value of 1 is used. However, the step for ith position need not be 1, it can be more as well.
The number of RNN units is given as U here.
∑ j=U
i=N ∑
L([W xh, W hh, W hy]) = ||ŷj − yj ||2 (87)
i=1 j=i
We should note that in the (Equation 87), the W xh, W hh, W hy matrices are updated in one shot for the
entire 1 . . . U units as if [yi , . . . , yi+U −1 ] were a single vector. As the back-propagation is happening over the
unrolled units, it is also commonly referred to as Back propagation through time (BPTT) updates.
10.4.3 LSTM
Long Short Term Memory type of cell [85] is a special configuration of the basic RNN unit. In order for
the RNN unit to remember the context over long unrolled lengths, the h vector size has to be increased.
However, increasing the h vector increases sparsity in the model and induces sensitivity in the output vector
generation. A solution to this problem is to capture the necessary complexity in the form of other matrices
called gates. The LSTM introduces three types of gates - input gate, output gate and forget gate. The input
gate combines the input and updates the h vector, the output gate combines the current and previous h
vectors and the forget gate prevents the current h vector from updates to the given RNN unit.
102
10.4.4 Examples of sequence learning problem statements:
Semantic encoding: In order to learn a semantic for a given sentence, the same problems occur with
variable input size and dependence on the sequence of words. The task here is to learn a vector encoding
for a given sequence of words. The vector is a representation of the semantics.
10.4.5 Sequence to sequence mapping
Language translation: Consider the task of translating a sentence in a given language to another lan-
guage. The input is a sequence of words in one language. The output is a sequence of words in another
language. The sizes of input and output can be quite different. In the training data, there can be several
(input,output) pairs where lengths are varying. In this scenario, a fixed dimension input mechanism is not
suitable. A word at ith position depends on words at j th (j < i) positions.
Figure captioning: Give an image, the task is to generate a sequence of words describing it. Once objects
in an image are identified and tagged with textual words, the image can be scanned from top-left corner to
bottom-right corner and in a left-to-right manner. This corresponds to an input sentence. The corresponding
output sentence is a meaningful figure caption in the training data set of (input,output) pairs. The mapping
can be learned via sequence to sequence learning as in language translation.
Automated dialog-bot: Consider a case of a repository of software trouble shooting where conversations
between a human and a computer operator is recorded over time. This is a sequence to sequence scenario
where input is a query sequence of words and the output is another sequence of words which constitutes
response. Now, given a new query, how can we generate a response automatically. The problem is much
similar to language translation, albeit the vocabulary and letters of the words might be same.
10.5 Generative adversarial network

Generative adversarial networks (GAN) [86] is a graphical model for generation of synthetic data. The
working process is similar to an RBM [80] where data is sampled from a Bernoulli network i.e. hidden states
are 0 or 1. In GAN system as well, data is sampled from a neural network. The neural network in GAN is
architected in reverse sense of neural network in a typical feed forward network. In a feed forward network,
the size of the vector diminishes and finally the output layers form a vector for performing softmax. In case
of a GAN, the architecture is reverse, it starts from a small vector, increase the vector dimensions over layers
and the final output layer has the same size as the input image.
10.5.1 Training GAN
Training GAN is accomplished using the gradient descent defined over the output generated from GAN vs
a ground truth data point. A GAN is linked to a discriminator network for forming a loss function. The
discriminator network classifies given input from an adversarial network as fake or real. The discrimina-
tor network compute data distribution difference between generated data and the known real data as the
loss function. The cumulative difference between the encoding by discriminator network for the adversary
generated data and the ground truth data is minimized by gradient descent over the adversarial network
weights.
Let G(·) be the generator network. Upon input z, the generator network generates a synthetic data point.
Let D(·) be the discriminator network, that generates binary output whether real or fake. Let X denote
103
Name Remark
Scikit-learn Standard machine learning
Keras, Tensorflow Deep learning
Mxnet Deep learning
PyTorch Deep learning
Mallet Topic modeling
NLTK Natural language processing
the ground truth known real data. Let XG denote the data generated by G() be the fake data, by random
inputs to the generator network. Now the loss function is formulated as,
L(G) = ExG ∈XG [log(1 − D(xG ))] − Ex X [log(D(x))]
10.5.2 Applications of GANs
GANs are applied for data augmentation purposes. When the training data is not sufficient and does not
capture required varieties of sub-distributions, GANs are used to generate synthetic data. Some of the
GAN generated fake images appear to be so real that human eye cannot distinguish between real and fake
(Figure 29).
GANs may also be used to generate sequence data [105]. The formulation is closely related to reinforce-
ment learning where the generator is modeled as RL agent.
11 Applications and Laboratory exercises

Machine learning libraries are available in a wide variety of programming languages and libraries. Industry
practices typically recommend python due to ease of integration with other programs and scripting capabil-
ities on Unix based systems on the server side. However it is also common to see people using R, Matlab
and other tools as well. Some of the popular and production grade libraries and their platforms is presented
in (Table 11).
There a number of other implementations such as Weka and a number of packages in R and Matlab which
are also useful, however from the point of view of a big software team integrating with other code bases,
though these implementations serve for building great prototypes, the scalability aspects restrict their usage
in production grade systems.
11.1 Automatic differentiation

Automatic differentiation [106] refers to generation of expression parse tree after application of differentiation
as an operator. In a graphical representation of a program, the loss function is essentially an expression tree.
The more complex a loss function, the more difficult it is to manually and analytically perform gradient
calculation. Today it has become norm of the day and the de-facto standard to use automatic differentiation
for loss functions minimizing human error or efforts.
An illustration of a very minimalistic differentiation is shown in (Figure 30). The figure shows application
of the differential operator to the multiplication operator,
d(X ∗ Y ) 99K d(X) ∗ Y + d(Y ) ∗ X
104
The recursive definition of conversion of expression trees extends to loop structures as well. Most typically,
the loop iterations are treated as incremental and the differentiation operation goes inside of the loop.
Therefore it is the responsibility of the engineer to create loss functions that ensure that each iteration of
the loop corresponds to an operation on an independent parameter of the data. Given that a loss function
is a sequence of operations rather than a single operation, the sequence is processed from top to bottom (or
left to right) for differential operator application. In the loss function, any reference to a previously used
variable, the automatic differentiation operator will also guarantee that its previously computed differential
value will be used, seamlessly.
11.2 Machine learning exercises

Some of the laboratory exercises in the traditional machine learning problems would include the following
list [95, 75].
• Synthetic data generation and plotting - (i) Concentric circles, (ii) Moons and (iii) Blobs
• Gradient descent - (i) Full data, (ii) Batch, (iii) Stochastic variants
• Classifiers to train and fit - (i) Logistic regression, (ii) Support vector machine, (iii) Decision Tree
• Ensemble classifiers to train and fit - (i) Bagging classifier, (ii) AdaBoost, (iii) Gradient Boost
• Validation curves - Plot cross validation scores against other axis (i) depth of a decision tree and (ii)
size of learning set
• Regression data - Generate synthetic data by perturbation of a sinusoidal curve by noise
• Regression models - (i) Fit linear regression, (ii) Decision tree regressor, (iii) Use polynomial features
11.3 Clustering exercises

Some of the exercises on the clustering would include the following list.
• Generate synthetic data - use sklearn libraries to generate blobs of data and retrieve k clusters using
k-means
• Perform agglomerative clustering of data points and threshold the dendrogram tree to match the
clusters
• Change the distance measure and repeat K-means cluster and count number of steps till convergence
• Pose k-means as gradient descent problem and retrieve the clusters
• Perform DBSCAN clustering of the S curve shaped data and compare against k-means on the same
11.4 Graphical model exercises

Latent Dirichlet Analysis (LDA) is most suitable for automatic topic extraction. The topics are not restricted
to text, they can be image’s sub-regions in an image data set or audio clips in an audio data set.
The following three exercise questions will clarify the applicability scope of the LDA method.
105
11.4.1 Exercise - Topics in text data
Consider a corpus of text data composed of sentences. Each sentence is composed of words. Now the task
is to apply text pruning to remove non-informative words and initiate steps in LDA. Given a number of
k topics, deduce topic-word matrix and document-topic matrix, iterate and saturate to determine optimal
assignments. Use scikit-learn library functions for performing LDA.
11.4.2 Exercise - Topics in image data
Consider an image as composed of sub-image regions. Consider a sliding window and slide through the whole
image from top-left corner to the bottom-right corner with a stride of choice. As the image is slided over, a
large number of slides is formed per image. In a given data set of images, the slides per image multiply and
form a large data set of slides. Cluster all the slides into some k clusters. Each cluster now corresponds to
a word in our vocabulary. A given image is composed of slides which belong in some clusters or other and
hence certain vocabulary. Convert the image data set to a text corpus, albeit here words are cluster ids. Do
not perform word pruning and carry out LDA on this image-to-text converted corpus.
11.4.3 Exercise - Topics in audio data
Consider a raw audio wave form of time-series of amplitudes as composed of several sub-regions in 1D.
Consider a sliding window of chosen interval. Slide through the audio file from left to right and generate
several slides per audio file. In the whole database of audio files, there are slides per audio file, leading to a
large number of slides across all files. Cluster the slides into some k cluster. Now each cluster-id, serves as a
word in the vocabulary. The analysis is now similar to the image patch exercise as above. Now apply LDA
on the audio-to-text converted corpus to detect topics in the audio data.
11.5 Data visualization exercises

Visualization of high dimensional data in a lower dimensional space adds value to our understanding of the
nature of the data. In this regard, exercises are listed below for practice.
• Generate a synthetic data set of blobs using scikit-learn libraries
• Compute all pair distances and perform multi-dimensional scaling to plot the data on a 2D plate and
visualize
• Perform tSNE and visualize the clusters
• Determine top 2 principal components using PCA and plot the points about the two axes In this
exercise the blobs of the data points are to be seen as distinct clusters in 2D as well
• Repeat the exercise on concentric circles data set, moons data set and the S-curve data sets
• Repeat the exercises on MNIST digits data set and identify if the digits are occurring in separate
regions on the 2D transformed space as well
11.6 Deep learning exercises

Some of the exercises are listed below to practice deep learning tools.
• Convolutional neural networks to be built and evaluated on MNIST and ImageNet data sets.
106
• Recurrent neural network (using LSTM) to be built to learn to map English-to-French.
• Consider a data set of audio wave forms for digit pronunciations. An LSTM to be built to vector
encode each digit pronunciation and determine classification accuracy on the vector encoded audio
waves.
• Build an auto-encoder to convert a digit image (from MNIST data set) into a vector and determine
classifier accuracy over the vectors using known digit classes
12 Optimization
The optimization methods discussed so far included gradient descent for linear regression, logistic regression
and neural networks. In case of tree based methods, iteratively the impurity scores are reduced using a hill
climbing paradigm of choosing which attribute to split. However the loss function may be complicated and
not differentiable in a number of real world problems related to molecular dynamics simulation, manifold
learning and topological optimization. For a rigorous study of optimization problems and numerical setting,
the reader is requested to refer to some of the standard texts in this domain [107, 108, 109].
Genetic algorithms [110] are used in cases where gradient descent formulation is difficult and multiple
local minima are expected. In a genetic algorithm, a solution is represented as set of chromosomes. A
chromosome is composed of a set of genes which are related to each other in certain semantic aspects. A gene
is a fundamental unit that goes through mutations and cross over operations over time. A simple problem
may just have one chromosome have few genes. A gene can be equated to a parameter-value assignment.
A chromosome would then correspond to multiple parameter-value assignments to constitute a solution. A
population is seeded with initial random assignment of parameter-value pairs and hence individuals. The
individuals undergo mutation and cross-over operations over iterations and result in newer genes and hence
newer individuals. The expanded population is then subjected to fitness function in which only top k fit
individuals are retained in the population and remaining are discarded. A cross over operation takes two
solutions and mixes subset of parameter-values between them and generates newer individuals. A mutation
operation takes a solution and modifies randomly one of the parameter-values. The rates of mutations and
cross-over operations are updated over time. Initially mutation rate is high and as the time progresses, the
cross-over rate is increased as it results in a fine search as the optimal solution approaches.
References
[1] S. Russel and P. Norvig, Artificial Intelligence: A Modern Approach. Pearson Education Inc., 2003.
[2] N. J. Nilsson, Principles of artificial intelligence. San Francisco, CA, USA: Morgan Kaufmann Pub-
lishers Inc., 1980.
[3] S. M. Lavalle, Planning Algorithms. Cambridge University Press, 2006.
[4] T. Bylander, “The computational complexity of propositional strips planning.,” Artif. Intell., vol. 69,
no. 1-2, pp. 165–204, 1994.
[5] K. R. Ryu and K. B. Irani, “Learning from goal interactions in planning: Goal stack analysis and
generalization.,” in AAAI (W. R. Swartout, ed.), pp. 401–407, AAAI Press / The MIT Press, 1992.
107
[6] J. H. Siekmann, “Computational logic.,” in Computational Logic (J. H. Siekmann, ed.), vol. 9 of
Handbook of the History of Logic, pp. 15–30, Elsevier, 2014.
[7] I. Bratko, Prolog Programming for Artificial Intelligence. Addison-Wesley, 1986.
[8] M. Baaz and R. Iemhoff, “On skolemization in constructive theories.,” J. Symb. Log., vol. 73, no. 3,
pp. 969–998, 2008.
[9] J. Levy, “A unification algorithm for concurrent prolog.,” in ICLP (S.-. Trnlund, ed.), pp. 333–341,
Uppsala University, 1984.
[10] D. McDermott, M. Ghallab, and A. C. Howe, “PDDL-the planning domain definition language,” in
Proceedings of the International Conference on Artificial Intelligence Planning Systems, 1998.
[11] R. Duda, P. Hart, and D. Stork, Pattern classification. Wiley New York, 2001.
[12] C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer-Verlag, 2008.
[13] A. Ng, “Cs229 lecture notes - supervised learning.” 2012.
[14] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995.
[15] G. J. J. van den Burg and P. J. F. Groenen, “Gensvm: A generalized multiclass support vector
machine.,” Journal of Machine Learning Research, vol. 17, pp. 225:1–225:42, 2016.
[16] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” Systems, Man
and Cybernetics, IEEE Transactions on, vol. 21, no. 3, pp. 660–674, 1991.
[17] C. Shi, X. Kong, P. S. Yu, and B. Wang, “Multi-label ensemble learning,” in Proceedings of the
ECML/PKDD 2011, 2011.
[18] G. Webb and Z. Zheng, “Multistrategy ensemble learning: Reducing error by combining ensemble
learning techniques,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, pp. 980–
991, 2004.
[19] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[20] Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in International Con-
ference on Machine Learning, pp. 148–156, 1996.
[21] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” The Annals of Statis-
tics, vol. 29, no. 5, pp. 1189–1232, 2001.
[22] D. H. Wolpert, “On bias plus variance,” Neural Computation, vol. 9, no. 6, pp. 1211–1243, 1997.
[23] S. Yakowitz and E. Lugosi, “Random search in the presence of noise, with application to machine
learning.,” SIAM J. Scientific Computing, vol. 11, no. 4, pp. 702–712, 1990.
[24] Y. L. Lui, H. T. Wong, C.-S. Leung, and S. Kwong, “Noise resistant training for extreme learning
machine.,” in ISNN (2) (F. Cong, A. C.-S. Leung, and Q. Wei, eds.), vol. 10262 of Lecture Notes in
Computer Science, pp. 257–265, Springer, 2017.
[25] Q. Zhang, A. Rahman, and C. D’Este, “Impute vs. ignore: Missing values for prediction.,” in IJCNN,
pp. 1–8, IEEE, 2013.
108
[26] R. Alejo, R. M. Valdovinos, V. Garca, and J. H. Pacheco-Sanchez, “A hybrid method to face class
overlap and class imbalance on neural networks and multi-class scenarios.,” Pattern Recognition Letters,
vol. 34, no. 4, pp. 380–388, 2013.
[27] P. Soda, “A multi-objective optimisation approach for class imbalance learning.,” Pattern Recognition,
vol. 44, no. 8, pp. 1801–1810, 2011.
[28] M. Hutter and M. Zaffalon, “Distribution of mutual information from complete and incomplete data,”
Computational Statistics & Data Analysis, vol. 48, no. 3, pp. 633–657, 2005.
[29] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, “Large scale online learning of image similarity
through ranking,” Journal of Machine Learning Research, vol. 11, pp. 1109–1135, 2010.
[30] F. Yang, A. Ramdas, K. G. Jamieson, and M. J. Wainwright, “A framework for multi-a(rmed)/b(andit)

testing with online fdr control.,” in NIPS (I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach,
R. Fergus, S. V. N. Vishwanathan, and R. Garnett, eds.), pp. 5959–5968, 2017.
[31] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful ap-
proach to multiple testing,” Journal of the royal statistical society. Series B (Methodological), vol. 57,
no. 1, pp. 289–300, 1995.
[32] G. E. Hinton and T. J. Sejnowski, eds., Unsupervised Learning. Cambridge, MA: MIT Press, 1999.
[33] Z. Ghahramani, “Unsupervised learning,” in Advanced lectures on machine learning (O. Bousquet,
G. Raetsch, and U. von Luxburg, eds.), Lecture Notes in Artificial Intelligence 3176, Berlin: Springer-
Verlag, 2004.
[34] R. Xu and D. Wunsch, Clustering. Wiley-IEEE Press, 2008.
[35] E. Rasmussen, “Clustering algorithms,” in Information Retrieval. Data Structures & Algorithms.
(W. F. R. Baeza-Yates, ed.), pp. 419–442, Prentice Hall, 1992.
[36] E. Schubert, J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, “Dbscan revisited, revisited: Why and
how you should (still) use dbscan.,” ACM Trans. Database Syst., vol. 42, no. 3, pp. 19:1–19:21, 2017.
[37] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in NIPS, pp. 556–562,
2000.
[38] R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization,” in Advances in Neural Information
Processing Systems, vol. 20, 2008.
[39] S. Ding, “Independent component analysis based on learning updating with forms of matrix transfor-
mations and the diagonalization principle.,” in FCST, pp. 203–210, IEEE Computer Society, 2006.
[40] M. Agarwal, H. Agrawal, N. Jain, and M. Kumar, “Face recognition using principle component analysis,
eigenface and neural network.,” in ICSAP, pp. 310–314, IEEE Computer Society, 2010.
[41] I. Jolliffe, Principal component analysis. Wiley Online Library, 2005.
[42] D. Edwards, Introduction to Graphical Modelling. Springer, June 2000.
[43] D. Heckerman and R. Shachter, “A definition and graphical representation for causality,” in Proc.
Eleventh Conf. on Uncertainty in Artificial Intelligence (UAI-95) (P. Besnard and S. Hanks, eds.),
(Montréal, Québec), pp. 262–273, Aug. 1995.
109
[44] M. Pacer and T. L. Griffiths, “A rational model of causal inference with continuous causes.,” in NIPS
(J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, eds.), pp. 2384–
2392, 2011.
[45] J. Zhang, A. G. Schwing, and R. Urtasun, “Message passing inference for large scale graphical models
with high order potentials.,” in NIPS (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and
K. Q. Weinberger, eds.), pp. 1134–1142, 2014.
[46] G. Hamerly and C. Elkan, “Learning the k in k-means,” in Advances in Neural Information Processing
Systems, vol. 17, 2003.
[47] D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in SODA ’07: Pro-
ceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, (Philadelphia, PA,
USA), pp. 1027–1035, Society for Industrial and Applied Mathematics, 2007.
[48] I. Jonyer, L. B. Holder, and D. J. Cook, “Graph-based hierarchical conceptual clustering,” International
Journal on Artificial Intelligence Tools, vol. 10, no. 1-2, pp. 107–135, 2001.
[49] M. Bateni, S. Behnezhad, M. Derakhshan, M. Hajiaghayi, R. Kiveris, S. Lattanzi, and V. S. Mirrokni,

“Affinity clustering: Hierarchical clustering at scale.,” in NIPS (I. Guyon, U. von Luxburg, S. Bengio,
H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, eds.), pp. 6867–6877, 2017.
[50] G. Golub and W. Kahan, “Calculating the singular values and pseudo-inverse of a matrix,” J. Soc.
Indust. Appl. Math.: Ser. B, Numer. Anal., vol. 2, pp. 205–224, 1965.
[51] N. N. Schraudolph, S. Gnter, and S. V. N. Vishwanathan, “Fast iterative kernel pca.,” in NIPS
(B. Schlkopf, J. C. Platt, and T. Hofmann, eds.), pp. 1225–1232, MIT Press, 2006.
[52] U. Demšar, P. Harris, C. Brunsdon, A. S. Fotheringham, and S. McLoone, “Principal component

analysis on spatial data: An overview,” Annals of the Association of American Geographers, vol. 103,
pp. 106–128, July 2012.
[53] V. Klema and A. Laub, “The singular value decomposition: Its computation and some applications,”
IEEE Transactions on Automatic Control, vol. 25, no. 2, pp. 164–176, 1980.
[54] L. Hogben, Handbook of linear algebra. Chapman & Hall/CRC, 2007.
[55] G. Strang, Introduction to Linear Algebra. Wellesley, MA: Wellesley-Cambridge Press, fourth ed., 2009.
[56] U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery. Morgan Kaufmann, 2001.
[57] B. Wu, J. S. Smith, B. M. Wilamowski, and R. M. Nelms, “Dcmds-rv: density-concentrated multi-

dimensional scaling for relation visualization.,” J. Visualization, vol. 22, no. 2, pp. 341–357, 2019.
[58] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning
Research, vol. 9, pp. 2579–2605, 2008.
[59] F. Jensen and T. Nielsen, Bayesian Networks and Decision Graphs. New York, NY: Springer, 2nd ed.,
June 2007.
[60] R. Neal, Bayesian Learning for Neural Networks. New York: Springer-Verlag, 1996.
110
[61] A. McCallum and K. Nigam, “A comparison of event models for naive bayes text classification,” in IN
AAAI-98 WORKSHOP ON LEARNING FOR TEXT CATEGORIZATION, pp. 41–48, AAAI Press,
1998.
[62] L. R. RABINER, “A tutorial on hidden markov models and selected applications in speech recognition,”
in Proceedings of IEEE, vol. 77, pp. 257–286, IEEE, 1989.
[63] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley &
Sons, 2014.
[64] S. R. Eddy, “What is a hidden markov model?,” Nat Biotech, vol. 22, pp. 1315–1316, Oct. 2004.
[65] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning
Research, vol. 3, pp. 993–1022, 2003.
[66] S. Nakajima, I. Sato, M. Sugiyama, K. Watanabe, and H. Kobayashi, “Analysis of variational bayesian
latent dirichlet allocation: Weaker sparsity than map.,” in NIPS (Z. Ghahramani, M. Welling,
C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), pp. 1224–1232, 2014.
[67] S. Wasserman and K. Faust, Social network analysis: Methods and applications. Cambridge Univ Pr,
1994.
[68] J. Peterson, Petri Net Theory and the Modelling of Systems. Prentice Hall, 1983.
[69] E. F. Codd, Cellular Automata. New York: Academic Press, 1968.
[70] J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger, “Tackling the poor assumptions of naive bayes
text classifiers,” in ICML (T. Fawcett and N. Mishra, eds.), pp. 616–623, AAAI Press, 2003.
[71] C. Do and S. Batzoglou, “What is the expectation maximization algorithm?,” Nature biotechnology,
vol. 26, no. 8, pp. 897–899, 2008.
[72] T. Heskes, O. Zoeter, and W. Wiegerinck, “Approximate expectation maximization.,” in NIPS

(S. Thrun, L. K. Saul, and B. Schlkopf, eds.), pp. 353–360, MIT Press, 2003.
[73] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
[74] Y. Bengio, “Deep learning of representations for unsupervised and transfer learning,” Unsupervised
and Transfer Learning Challenges in Machine Learning, vol. 7, p. 19, 2012.
[75] A. Geron, Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and tech-
niques to build intelligent systems. Sebastopol, CA: O’Reilly Media, 2017.
[76] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control,

Signals, and Systems (MCSS), vol. 2, pp. 303–314, Dec. 1989.
[77] Kurt and Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks,
vol. 4, no. 2, pp. 251–257, 1991.
[78] G. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural
Computation, vol. 18, no. 7, pp. 1527–1554, 2006.
111
[79] A. Fischer and C. Igel, “An introduction to restricted boltzmann machines,” in Progress in Pattern
Recognition, Image Analysis, Computer Vision, and Applications (L. Alvarez, M. Mejail, L. Gomez,
and J. Jacobo, eds.), (Berlin, Heidelberg), pp. 14–36, Springer Berlin Heidelberg, 2012.
[80] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent temporal restricted boltzmann machine.,”
in NIPS (D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds.), pp. 1601–1608, Curran Associates,
Inc., 2008.
[81] W. Joo, W. Lee, S. Park, and I.-C. Moon, “Dirichlet variational autoencoder.,” CoRR,
vol. abs/1901.02739, 2019.
[82] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for
deep learning of images, labels and captions.,” in NIPS (D. D. Lee, M. Sugiyama, U. von Luxburg,
I. Guyon, and R. Garnett, eds.), pp. 2352–2360, 2016.
[83] L. Lu, Y. Zheng, G. Carneiro, and L. Yang, eds., Deep Learning and Convolutional Neural Networks for
Medical Image Computing - Precision Medicine, High Performance and Large-Scale Datasets. Advances
in Computer Vision and Pattern Recognition, Springer, 2017.
[84] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85 –
117, 2015.
[85] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8,
pp. 1735–1780, 1997.
[86] I. J. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks.,” CoRR, vol. abs/1701.00160,
2017.
[87] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional
generative adversarial networks.,” in ICLR (Y. Bengio and Y. LeCun, eds.), 2016.
[88] P. J. Braspenning, F. Thuijsman, and A. J. M. M. Weijters, eds., Artificial Neural Networks: An

Introduction to ANN Theory and Practice, vol. 931 of Lecture Notes in Computer Science, Springer,
1995.
[89] L. C. Jain and G. N. Allen, “Introduction to artificial neural networks.,” in Electronic Technology
Directions (L. C. Jain, ed.), pp. 36–62, IEEE Computer Society, 1995.
[90] M. A. Nielsen, “Neural networks and deep learning,” 2018.
[91] C. C. Aggarwal, Neural Networks and Deep Learning - A Textbook. Springer, 2018.
[92] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 12, pp. 993–1001, 1990.
[93] A. Krogh and J. Vedelsby, “Neural network ensembles, cross validation, and active learning,” in Ad-
vances in Neural Information Processing Systems, vol. 7, pp. 231–238, 1995.
[94] C. C. Aggarwal, Machine Learning for Text. Springer Publishing Company, Incorporated, 1st ed.,
2018.
112
[95] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-
hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and
E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research,
vol. 12, pp. 2825–2830, 2011.
[96] A. Aizawa, “An information-theoretic perspective of tf-idf measures,” Information Processing & Man-
agement, vol. 39, pp. 45–65, January 2003.
[97] W. B. Cavnar and J. M. Trenkle, “N-gram-based text categorization,” in Proceedings of SDAIR-94, 3rd
Annual Symposium on Document Analysis and Information Retrieval, (Las Vegas, US), pp. 161–175,
1994.
[98] B. A. Cipra, “An introduction to the ising model,” The American Mathematical Monthly, vol. 94,
no. 10, pp. pp. 937–959, 1987.
[99] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recog-
nition,” in Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998.
[100] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural
networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
[101] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European
conference on computer vision, pp. 818–833, Springer, 2014.
[102] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-
binovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 1–9, 2015.
[103] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition.,”
in ICLR (Y. Bengio and Y. LeCun, eds.), 2015.
[104] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition.,” in CVPR,
pp. 770–778, IEEE Computer Society, 2016.
[105] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial nets with policy
gradient.,” in AAAI (S. P. Singh and S. Markovitch, eds.), pp. 2852–2858, AAAI Press, 2017.
[106] S. M. Kakade and J. D. Lee, “Provably correct automatic sub-differentiation for qualified programs.,”
in NeurIPS (S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,
eds.), pp. 7125–7135, 2018.
[107] J. Nocedal and S. Wright, Numerical optimization. Springer Science & Business Media, 2006.
[108] K. Deb, Multi-Objective Optimization using Evolutionary Algorithms. John Wiley & Sons, 2001.
[109] S. Boyd and L. Vandenberghe, Convex Optimization. Camebridge University Press, 2004.
[110] T. Y. Lim, “Structured population genetic algorithms: a literature survey.,” Artif. Intell. Rev., vol. 41,
no. 3, pp. 385–399, 2014.
113

Machine Learning Algorithms Applications and Practices in Data Science PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Machine Learning Algorithms Applications and Practices in Data Science PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Machine Learning Algorithms, Applications and Practices in Data

August 30, 2019

7 Practical considerations in model building 63

11 Applications and Laboratory exercises 104

Table 1: Terminology and abbreviations

5.1 Notion of state space & search

wnew = wold − α ∗ ∇L(w)|w=wold

Initial state = (Left,L,G,C,M),(Right,-,-,-,-). Next state can be (Left,L,G,-,-),(Right,-,-,C,M). However this

5.2 State space - Search algorithms

– Enumerative methods such as:

• Randomized search methods such as:

5.2.1 Enumerative search methods

5.2.2 Heuristic search methods - Example A* algorithm

Algorithm 1 Depth First Search

x∗ ← arg min(x.cost + H(x))

5.3 Planning algorithms

• Start driving from home

There can be other variables as well such as below.

• Read news paper

• Drop kids at school

The idea is to define truth value for each of the variables .

5.4 Formal logic

5.4.1 Predicate or Propositional logic

5.4.2 First order logic

• enroll(s,c), textbook(c,t) -¿ reads(s,t)

5.4.3 Automated theorem proof

5.5 Resolution by refutation method

[A|B, C| B]− > A|C (1)

• Negate the goal and augment to the knowledge base

• Arrive at contradiction (or refutation)

• Apply the resolution rule repeatedly

5.6 AI framework adaptability issues

S = {(x1 , y1 ), . . . , (xN , yN )|xi ∈ Rd , yi ∈ Rm } (3)

M ∗ = arg min L(M (xi ), yi ) (4)

Θ∗ = arg min L(M (Θ, xi ), yi ) (5)

Table 5: Synthetic data sets

6.1 Data sets

• Let the data set be D = {(xi , yi )(∀i ∈ [1..N ])}

• Let y = H(b, x) = b where b is a constant number that we will learn

bnew ← bold − ∇E(b)|b=bold

(m, c)∗ = arg min E(m′ , c′ )

6.2.1 Polynomial fitting

Dd = {(x′i , yi )|x′i = [x0i , . . . , xdi ]}

6.3 Logistic Regression

wnew = wold − α × ∇L(w)|w=wold (14)

L(w, x, y) = log(1 + ew·x ) − y ∗ (w · x) (16)

Converting the problem of y to β = 2 ∗ y − 1 in (Equation 19).

6.4 Support Vector Machine - Linear Kernel

D− = {(x, y)|y = −1}

Combining the two equations for +1 and -1 class,

(∀(x, y) ∈ D), (∃ξi > 0) : yi ∗ (w · xi ) ≥ 1 − ξi (21)

When we take two points +1 hyper plane, A+ , B+ , the

(∃i, j ∈ [1..N ]) : ξi < 0 ∧ ξj > 0 ⇒ ξi + ξj ≤ ξj (25)

6.5 Decision Tree

Outline of decision tree

• Determine base propensities of the classes

• Determine base purity score

• Split the table into parts based on attribute values

• Compute decision tree for each of the part recursively

6.6 Ensemble methods

Algorithm 7 Boosting Methodology

6.6.1 Boosting algorithms

Table 7: Comparison of ensembling methods

/In K fold - subset() retrieves combined (K-1) folds and M is K/

/In K fold - the remaining fold is validation set/

/Prediction for input x/

/Alternatively, all the SVMs can be used to transform input vector/

/Prediction for input x/

/Prediction for input x/

/Alternatively, all the SVMs can be used to transform input vector/

/Prediction for input x/