Deep Learning With Keras

DEEP LEARNING IN PYTHON
Introduction to
deep learning
Deep Learning in Python
Imagine you work for a bank

● You need to predict how many transactions each customer
will make next year
Example as seen by linear regression

Age
Bank Balance Number of

Transactions
Retirement Status
…
Example as seen by linear regression

Model with no interactions Model with interactions
Predicted Predicted
Transactions Not Retired Transactions Not Retired
Retired Retired
Bank Balance Bank Balance

Interactions
● Neural networks account for interactions really well
● Deep learning uses especially powerful neural networks
● Text
● Images
● Videos
● Audio
● Source code
Course structure
● First two chapters focus on conceptual knowledge
● Debug and tune deep learning models on conventional
prediction problems
● Lay the foundation for progressing towards modern
applications
● This will pay off in the third and fourth chapters
Build deep learning models with keras

In [1]: import numpy as np
In [2]: from keras.layers import Dense
In [3]: from keras.models import Sequential
In [4]: predictors = np.loadtxt('predictors_data.csv', delimiter=',')
In [5]: n_cols = predictors.shape[1]
In [6]: model = Sequential()
In [7]: model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
In [8]: model.add(Dense(100, activation='relu')
In [9]: model.add(Dense(1))
Deep learning models capture interactions

Age
Number of
Bank Balance Transactions
Retirement Status
…
Interactions in neural network

Input Layer
Hidden Layer
Age
Output Layer
Income
Number of
# Accounts Transactions
Let’s practice!
Forward
propagation
Course Title
Bank transactions example

● Make predictions based on:
● Number of children
● Number of existing accounts
Forward propagation
Input Hidden Layer
1
# Children 2 5
1 2
Output
9 # Transactions
-1
-1
3 1 1
# Accounts
Forward propagation
Input Hidden Layer
1
# Children 2 5
2
1 Output
9 # Transactions
-1
-1
3 1 1
# Accounts
Forward propagation
Input Hidden Layer
1
# Children 2 5
2
1 Output
9 # Transactions
-1
-1
3 1 1
# Accounts
Forward propagation
Input Hidden Layer
1
# Children 2 5
2
1 Output
9 # Transactions
-1
-1
3 1 1
# Accounts
Course Title
Forward propagation
● Multiply - add process
● Dot product
● Forward propagation for one data point at a time
● Output is the prediction for that data point
Forward propagation code

In [2]: input_data = np.array([2, 3])
In [3]: weights = {'node_0': np.array([1, 1]),

...: 'node_1': np.array([-1, 1]),
...: 'output': np.array([2, -1])}
In [4]: node_0_value = (input_data * weights['node_0']).sum()
In [5]: node_1_value = (input_data * weights['node_1']).sum()
Input Hidden Layer Output

1
2 5 2
1
3 -1
1 -1
1
Forward propagation code

In [6]: hidden_layer_values = np.array([node_0_value, node_1_value])
In [7]: print(hidden_layer_values)
[5, 1]
In [8]: output = (hidden_layer_values * weights['output']).sum()
In [9]: print(output)
9
Input Hidden Layer Output

1
2 5 2
1
9
3 -1
1 -1
1
Let’s practice!
Activation
functions
Linear vs Nonlinear Functions
Linear Functions Nonlinear Functions

Activation functions
● Applied to node inputs to produce node output
Improving our neural network

Input Hidden Layer
1
2 5
1 2
Output
9
-1
-1
3 1 1
Input Hidden Layer
1 tanh(2+3)
2
2
1 Output
9
-1
-1
3 1
tanh(-2+3)
ReLU (Rectified Linear Activation)

Rectifier
In [2]: input_data = np.array([-1, 2])
In [3]: weights = {'node_0': np.array([3, 3]),

...: 'node_1': np.array([1, 5]),
...: 'output': np.array([2, -1])}
In [4]: node_0_input = (input_data * weights['node_0']).sum()
In [5]: node_0_output = np.tanh(node_0_input)
In [6]: node_1_input = (input_data * weights['node_1']).sum()
In [7]: node_1_output = np.tanh(node_1_input)
In [8]: hidden_layer_outputs = np.array([node_0_output, node_1_output])
In [9]: output = (hidden_layer_output * weights['output']).sum()
In [10]: print(output)
1.2382242525694254
Let’s practice!
Deeper networks
Multiple hidden layers

2 -1 Age
3
-3
4 1
4
2 7
-5 2
55
Calculate with ReLU Activation Function


2 -1 Age
3
-3
4 1
4
2 7
-5 2
55


2 -1 Age
3
-3
4 1
4
2 7
-5 2
55


2 -1 Age
3
-3
4 1
4
2 7
-5 2
55


2 -1 Age
3
-3
4 1
4
2 7
-5 2
55


2 Age
3
55


2 Age
3
55


2 Age
3
55


2 Age
3 26
55


Age
3 26
-5
55


Age
3 26
-5
55


Age
3 26
-5
55


2 -1 Age
3 26
-3
4 1
4
2 7
-5 2
55 0


2 -1 Age
3 26 0
-3
4 1
364
4
2 7
-5 2
55 0 52

Representation learning
● Deep networks internally build representations of pa"erns in
the data
● Partially replace the need for feature engineering
● Subsequent layers build increasingly sophisticated
representations of raw data
Representation learning
Deep learning
● Modeler doesn’t need to specify the interactions
● When you train the model, the neural network gets weights
that find the relevant pa"erns to make be"er predictions
Let’s practice!
The need
for optimization
A baseline neural network

Input Hidden Layer
1
2 5
2
1 Output
9
-1
-1
3 1 1
● Actual Value of Target: 13

● Error: Actual - Predicted = 4
A baseline neural network

Input Hidden Layer
1
2 5
3
1 Output
13
-1
-2
3 1 1
● Actual Value of Target: 13

● Error: Actual - Predicted = 0
Predictions with multiple points

● Making accurate predictions gets harder with more points
● At any set of weights, there are many values of the error
● … corresponding to the many points we make predictions for
Loss function
● Aggregates errors in predictions from many data points into
single number
● Measure of model’s predictive performance
Squared error loss function

Prediction Actual Error Squared Error
10 20 -10 100
8 3 5 25
6 1 5 25
● Total Squared Error: 150

● Mean Squared Error: 50
Loss function
Loss function
We
ight1 h t 2
ei g
W
Loss function
● Lower loss function value means a be!er model
● Goal: Find the weights that give the lowest value for the loss
function
● Gradient descent
Gradient descent
● Imagine you are in a pitch dark field
● Want to find the lowest point
● Feel the ground to see how it slopes
● Take a small step downhill
● Repeat until it is uphill in every direction
Gradient descent steps

● Start at random point
● Until you are somewhere flat:
● Find the slope
● Take a step downhill
Optimizing a model with a single weight
Loss(w)
Minimum value
w
Let’s practice!
Gradient descent
Gradient descent
Loss(w)
w
Gradient descent
● If the slope is positive:
● Going opposite the slope means moving to lower
numbers
● Subtract the slope from the current value
● Too big a step might lead us astray
● Solution: learning rate
● Update each weight by subtracting
learning rate * slope
Slope calculation example

2
3 6 Actual Target Value = 10
● To calculate the slope for a weight, need to multiply:

● Slope of the loss function w.r.t value at the node we
feed into
● The value of the node that feeds into our weight
● Slope of the activation function w.r.t value we
feed into

2

feed into
feed into

2
● Slope of mean-squared loss function w.r.t prediction:

● 2 * (Predicted Value - Actual Value) = 2 * Error
● 2 * -4

2

feed into
feed into

2

feed into
feed into

2

feed into
feed into

2
● 2 * -4 * 3
● -24
● If learning rate is 0.01, the new weight would be
● 2 - 0.01(-24) = 2.24
Network with two inputs affecting prediction
3
1
4
Code to calculate slopes and update weights

In [2]: weights = np.array([1, 2])
In [3]: input_data = np.array([3, 4])
In [4]: target = 6
In [5]: learning_rate = 0.01
In [6]: preds = (weights * input_data).sum()
In [7]: error = preds - target
In [8]: print(error)
5
Code to calculate slopes and update weights

In [9]: gradient = 2 * input_data * error
In [10]: gradient
Out[10]: array([30, 40])
In [11]: weights_updated = weights - learning_rate * gradient
In [12]: preds_updated = (weights_updated * input_data).sum()
In [13]: error_updated = preds_updated - target
In [14]: print(error_updated)
-2.5
Let’s practice!
Backpropagation
Backpropagation
● Allows gradient descent to update all weights in neural network (by
ge!ing gradients for all weights)
● Comes from chain rule of calculus
● Important to understand the process, but you will generally use a
library that implements this
Prediction
Error
Backpropagation process
● Trying to estimate the slope of the loss function w.r.t each weight
● Do forward propagation to calculate predictions and errors
Backpropagation process ReLU Activation Function

Actual Target Value = 10
2 0
0.5
-2 1 1
4 2 2
-1 3
1
Backpropagation process ReLU Activation Function

Actual Target Value = 10
Error = 3
2 0
0.5 0 1
-2 1 1
7
4 2 2
-1 3
1 1 3
● Go back one layer at a time
● Gradients for weight is product of:
1. Node value feeding into that weight
2. Slope of loss function w.r.t node it feeds into
3. Slope of activation function at the node it feeds into
1
ReLU Activation Function
1
● Need to also keep track of the slopes of the loss function
w.r.t node values
● Slope of node values are the sum of the slopes for all
weights that come out of them
Let’s practice!
Backpropagation in
practice
Backpropagation
ReLU Activation Function
1 Actual Target Value = 10
1 Error = 3
7
2
● Top weight’s slope = 1 * 6

● Bo!om weight’s slope = 3 * 6
Backpropagation
0
0 6
1
1 3 18
Calculating slopes associated with any weight

● Gradients for weight is product of:
1. Node value feeding into that weight
2. Slope of activation function for the node being fed into
3. Slope of loss function w.r.t output node
Backpropagation
0 Current
Gradient
0 6 Weight Value
1 0 0
1 6
2
2 0
1 3 18
3 18
Backpropagation: Recap
● Start at some random set of weights
● Use forward propagation to make a prediction
● Use backward propagation to calculate the slope of
the loss function w.r.t each weight
● Multiply that slope by the learning rate, and subtract
from the current weights
● Keep going with that cycle until we get to a flat part
Stochastic gradient descent

● It is common to calculate slopes on only a subset of the
data (‘batch’)
● Use a different batch of data to calculate the next
update
● Start over from the beginning once all data is used
● Each time through the training data is called an epoch
● When slopes are calculated on one batch at a time:
stochastic gradient descent
Let’s practice
Creating a
keras model
Model building steps

● Specify Architecture
● Compile
● Fit
● Predict
Model specification
In [2]: from keras.layers import Dense
In [3]: from keras.models import Sequential
In [4]: predictors = np.loadtxt('predictors_data.csv', delimiter=',')
In [7]: model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
In [8]: model.add(Dense(100, activation='relu'))
Let’s practice!
Compiling and
fi!ing a model
Why you need to compile your model
● Specify the optimizer

● Many options and mathematically complex
● “Adam” is usually a good choice
● Loss function
● “mean_squared_error” common for regression
Compiling a model
In [3]: model.add(Dense(100, activation='relu', input_shape=(n_cols,)))
In [6]: model.compile(optimizer='adam', loss='mean_squared_error')

What is fi!ing a model
● Applying backpropagation and gradient descent with your

data to update the weights
● Scaling data before fi!ing can ease optimization
Fi!ing a model
In [3]: model.add(Dense(100, activation='relu', input_shape=(n_cols,)))
In [6]: model.compile(optimizer='adam', loss=‘mean_squared_error')
In [7]: model.fit(predictors, target)

Let’s practice!
Classification
models
Classification
● ‘categorical_crossentropy’ loss function
● Similar to log loss: Lower is be!er
● Add metrics = [‘accuracy’] to compile step for easy-to-
understand diagnostics
● Output layer has separate node for each possible outcome,
and uses ‘so"max’ activation
Quick look at the data

close_def_
shot_clock dribbles touch_time shot_dis shot_result
dis
10.8 2 1.9 7.7 1.3 1
3.4 0 0.8 28.2 6.1 0
0 3 2.7 10.1 0.9 0
10.3 2 1.9 17.2 3.4 0

Transforming to categorical
shot_result Outcome 0 Outcome 1
1 0 1
0 1 0
0 1 0
0 1 0
Classification
In[1]: from keras.utils import to_categorical
In[2]: data = pd.read_csv('basketball_shot_log.csv')
In[3]: predictors = data.drop(['shot_result'], axis=1).as_matrix()
In[4]: target = to_categorical(data.shot_result)
In[5]: model = Sequential()
In[6]: model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
In[7]: model.add(Dense(100, activation='relu'))
In[8]: model.add(Dense(100, activation='relu'))
In[9]: model.add(Dense(2, activation='softmax'))
In[10]: model.compile(optimizer='adam', loss='categorical_crossentropy',

...: metrics=['accuracy'])
In[11]: model.fit(predictors, target)

Classification
Out[11]:
Epoch 1/10
128069/128069 [==============================] - 4s - loss: 0.7706 - acc: 0.5759
Epoch 2/10
128069/128069 [==============================] - 5s - loss: 0.6656 - acc: 0.6003
Epoch 3/10
128069/128069 [==============================] - 6s - loss: 0.6611 - acc: 0.6094
Epoch 4/10
128069/128069 [==============================] - 7s - loss: 0.6584 - acc: 0.6106
Epoch 5/10
128069/128069 [==============================] - 7s - loss: 0.6561 - acc: 0.6150
Epoch 6/10
128069/128069 [==============================] - 9s - loss: 0.6553 - acc: 0.6158
Epoch 7/10
128069/128069 [==============================] - 9s - loss: 0.6543 - acc: 0.6162
Epoch 8/10
128069/128069 [==============================] - 9s - loss: 0.6538 - acc: 0.6158
Epoch 9/10
128069/128069 [==============================] - 10s - loss: 0.6535 - acc: 0.6157
Epoch 10/10
128069/128069 [==============================] - 10s - loss: 0.6531 - acc: 0.6166
Let’s practice!
Using models
Using models
● Save
● Reload
● Make predictions
Saving, reloading and using your Model

In [1]: from keras.models import load_model
In [2]: model.save('model_file.h5')
In [3]: my_model = load_model('my_model.h5')
In [4]: predictions = my_model.predict(data_to_predict_with)
In [5]: probability_true = predictions[:,1]

Verifying model structure

In [6]: my_model.summary()
Out[6]:
_____________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
=========================================================================================
dense_1 (Dense) (None, 100) 1100 dense_input_1[0][0]
_____________________________________________________________________________________________
dense_2 (Dense) (None, 100) 10100 dense_1[0][0]
_____________________________________________________________________________________________
_____________________________________________________________________________________________
=========================================================================================
Total params: 21,502
Trainable params: 21,502
Non-trainable params: 0
Let’s practice!
Understanding
model
optimization
Why optimization is hard

● Simultaneously optimizing 1000s of parameters with
complex relationships
● Updates may not improve model meaningfully
● Updates too small (if learning rate is low) or too large (if
learning rate is high)
Stochastic gradient descent

sgd.py
def get_new_model(input_shape = input_shape):

model = Sequential()
model.add(Dense(100, activation='relu', input_shape = input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))
return(model)
lr_to_test = [.000001, 0.01, 1]
# loop over learning rates

for lr in lr_to_test:
model = get_new_model()
my_optimizer = SGD(lr=lr)
model.compile(optimizer = my_optimizer, loss = 'categorical_crossentropy')
model.fit(predictors, target)
The dying neuron problem

-3
1 0
Vanishing gradients
tanh function
Vanishing gradients
● Occurs when many layers have very small slopes (e.g. due to
being on flat part of tanh curve)
● In deep networks, updates to backprop were close to 0
Let’s practice!
Model validation
Validation in deep learning

● Commonly use validation split rather than cross-
validation
● Deep learning widely used on large datasets
● Single validation score is based on large amount of
data, and is reliable
● Repeated training from cross-validation would take
long time
Model validation
In [1]: model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics=['accuracy'])
In [2]: model.fit(predictors, target, validation_split=0.3)
Epoch 1/10
89648/89648 [==============================] - 3s - loss: 0.7552 - acc: 0.5775 - val_loss: 0.6969 -
val_acc: 0.5561
Epoch 2/10
val_acc: 0.6102
...
Epoch 8/10
val_acc: 0.6037
Epoch 9/10
val_acc: 0.6110
Epoch 10/10
val_acc: 0.6126
Early Stopping
In [3]: from keras.callbacks import EarlyStopping
In [4]: early_stopping_monitor = EarlyStopping(patience=2)
In [5]: model.fit(predictors, target, validation_split=0.3, epochs=20,

...: callbacks = [early_stopping_monitor])
Output from early stopping

Train on 89648 samples, validate on 38421 samples
Epoch 1/20
val_acc: 0.6151
Epoch 2/20
val_acc: 0.6154
...
Epoch 8/20
val_acc: 0.6160
Epoch 9/20
val_acc: 0.6172
Epoch 10/20
val_acc: 0.6134
Epoch 11/20
val_acc: 0.6169
Let’s practice!
Thinking
about model
capacity
Overfi!ing
Model Capacity
Workflow for optimizing model capacity

● Start with a small network
● Gradually increase capacity
● Keep increasing capacity until validation score is no longer
improving
Sequential experiments
Mean Squared
Hidden Layers Nodes Per Layer Next Step
Error
Increase
1 100 5.4
Capacity
Increase
1 250 4.8
Capacity
Increase
2 250 4.4
Capacity
Decrease
3 250 4.5
Capacity
3 200 4.3 Done

Let’s practice!
Stepping up
to images
Recognizing handwri!en digits

● MNIST dataset
● 28 x 28 grid fla!ened to 784 values for each image
● Value in each part of array denotes darkness of that pixel
Let’s practice!
Final thoughts
Next steps
● Start with standard prediction problems on tables of
numbers
● Images (with convolutional neural networks) are common
next steps
● keras.io for excellent documentation
● Graphical processing unit (GPU) provides dramatic
speedups in model training times
● Need a CUDA compatible GPU
● For training on using GPUs in the cloud look here:
h!p://bit.ly/2mYQXQb
Congratulations!

Deep Learning With Keras

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Deep Learning With Keras

Hochgeladen von

Copyright:

Verfügbare Formate

DEEP LEARNING IN PYTHON

Imagine you work for a bank

Example as seen by linear regression

Bank Balance Number of

Example as seen by linear regression

Bank Balance Bank Balance

Build deep learning models with keras

In [2]: from keras.layers import Dense

In [3]: from keras.models import Sequential

In [4]: predictors = np.loadtxt('predictors_data.csv', delimiter=',')

In [5]: n_cols = predictors.shape[1]

In [6]: model = Sequential()

In [7]: model.add(Dense(100, activation='relu', input_shape = (n_cols,)))

In [8]: model.add(Dense(100, activation='relu')

Deep learning models capture interactions

Interactions in neural network

Bank transactions example

Forward propagation code

In [2]: input_data = np.array([2, 3])

In [3]: weights = {'node_0': np.array([1, 1]),

In [4]: node_0_value = (input_data * weights['node_0']).sum()

In [5]: node_1_value = (input_data * weights['node_1']).sum()

Input Hidden Layer Output

Forward propagation code

In [8]: output = (hidden_layer_values * weights['output']).sum()

Input Hidden Layer Output

Linear vs Nonlinear Functions

Linear Functions Nonlinear Functions

Improving our neural network

ReLU (Rectified Linear Activation)

In [2]: input_data = np.array([-1, 2])

In [3]: weights = {'node_0': np.array([3, 3]),

In [4]: node_0_input = (input_data * weights['node_0']).sum()

In [5]: node_0_output = np.tanh(node_0_input)

In [6]: node_1_input = (input_data * weights['node_1']).sum()

In [7]: node_1_output = np.tanh(node_1_input)

In [8]: hidden_layer_outputs = np.array([node_0_output, node_1_output])

In [9]: output = (hidden_layer_output * weights['output']).sum()

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function

Multiple hidden layers

Calculate with ReLU Activation Function