Sie sind auf Seite 1von 11

# 1.

Feedforward Propagation
We first implement feedforward propagation for neural network with the already given weights.
Then we will implement the backpropagation algorithm to learn the parameters for ourselves.
Here we use the term weights and parameters interchangeably.

## 1.1 Visualizing the data:

Each training example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is
represented by a floating point number indicating the grayscale intensity at that location. The
20 by 20 grid of pixels is “unrolled” into a 400-dimensional vector. Each of these training
examples becomes a single row in our data matrix X. This gives us a 5000 by 400 matrix X
where every row is a training example for a handwritten digit image. The second part of the
training set is a 5000-dimensional vector y that contains labels for the training set.
1.2 Model Representation

Our neural network has 3 layers — an input layer, a hidden layer and an output layer. Do
recall that the inputs will be 20 x 20 grey scale images “unrolled” to form 400 input features
which we will feed into the neural network. So our input layer has 400 neurons. Also the
hidden layer has 25 neurons and the output layer 10 neurons corresponding to 10 digits (or
classes) our model predicts. The +1 in the above figure represents the bias term.
We have been provided with a set of already trained network parameters. These are stored in
ex4weights.mat and will be loaded into theta1 and theta2 followed by unrolling into a vector
nn_params. The parameters have dimensions that are sized for a neural network with 25
units in the second layer and 10 output units (corresponding to the 10 digit classes).
1.3
Fe
edf
or
wa
rd
an
d cost function
First we will implement the cost function followed by gradient for the neural network (for which
we use backpropagation algorithm). Recall that the cost function for the neural network with
regularization is

## cost function of neural network with regularization

where h(x(i)) is computed as shown in the Figure 2 and K = 10 is the total number of possible
labels. Note that h(x(i)) = a(3) is the activations of the output units. Also, whereas the original
labels (in the variable y) were 1, 2, …, 10, for the purpose of training a neural network, we
need to recode the labels as vectors containing only values 0 or 1, such that
one-hot encoding
This process is called one-hot encoding. The way we do this is by using the get_dummies
function from the ‘pandas library’.

sigmoid function
def sigmoid(z):
return 1/(1+np.exp(-z))

cost function

## def nnCostFunc(nn_params, input_layer_size, hidden_layer_size,

num_labels, X, y, lmbda):
theta1 = np.reshape(nn_params[:hidden_layer_size*(input_layer_size+1)],
(hidden_layer_size, input_layer_size+1), 'F')
theta2 = np.reshape(nn_params[hidden_layer_size*(input_layer_size+1):],
(num_labels, hidden_layer_size+1), 'F')
m = len(y)
ones =
np.ones((m,1))
a1 = np.hstack((ones,
X))
a2 = sigmoid(a1 @
theta1.T)
a2 = np.hstack((ones,
a2))
h = sigmoid(a2 @
theta2.T)
y_d =
pd.get_dummies(y.flatten())
temp1 = np.multiply(y_d,
np.log(h))
temp2 = np.multiply(1-y_d, np.log(1-
h))
temp3 = np.sum(temp1 +
temp2)
sum1 = np.sum(np.sum(np.power(theta1[:,1:],2),
axis = 1))
sum2 = np.sum(np.sum(np.power(theta2[:,1:],2),
axis = 1))
return np.sum(temp3 / (-m)) + (sum1 + sum2) * lmbda / (2*m)

2
Ba
ck
pr
op
ag
ation
In this part of the exercise, you will implement the backpropagation algorithm to compute the
gradients for the neural network. Once you have computed the gradient, you will be able to
train the neural network by minimizing the cost function using an advanced optimizer such as
fmincg.

## 2.1 Sigmoid gradient

We will first implement the sigmoid gradient function. The gradient for the sigmoid function
can be computed as
2.2 Random initialization
When training neural networks, it is important to randomly initialize the parameters for
symmetry breaking. Here we randomly initialize parameters named initial_theta1 and
initial_theta2 corresponding to hidden layer and output layer and unroll into a single vector as
we did earlier.

def randInitializeWeights(L_in,
L_out):
epsilon =
0.12
return np.random.rand(L_out, L_in+1)
* 2 * epsilon - epsilon
initial_theta1 =
randInitializeWeights(input_layer_size,
hidden_layer_size)
initial_theta2 =
randInitializeWeights(hidden_layer_size,
num_labels)
# unrolling parameters into a single
column vector
nn_initial_params = np.hstack((initial_theta1.ravel(order='F'), initial_theta2.ravel(order='F')))

2.3 Backpropagation
Backpropagation is not so complicated algorithm once you get the hang of it.
I strongly urge you to watch the Andrew’s videos on backprop multiple times.
In summary we do the following by looping through every training example:
1. Compute the forward propagate to get the output activation a3.
2. Calculate the error term d3 that’s obtained by subtracting actual output from our calculated
output a3.
3. For hidden layer, error termd2 can be calculated as below:

## 4. Accumulate the gradients in delta1 and delta2 .

5. Obtain the gradients for the neural network by diving the accumulated gradients (of step 4)
by m.
6. Add the regularization terms to the gradients.

## def nnGrad(nn_params, input_layer_size, hidden_layer_size,

num_labels, X, y, lmbda):
initial_theta1 =
np.reshape(nn_params[:hidden_layer_size*(input_layer_size+1)],
(hidden_layer_size, input_layer_size+1), 'F')
initial_theta2 =
np.reshape(nn_params[hidden_layer_size*(input_layer_size+1):], (num_labels,
hidden_layer_size+1), 'F')
y_d =
pd.get_dummies(y.flatten())
delta1 =
np.zeros(initial_theta1.shape)
delta2 =
np.zeros(initial_theta2.shape)
m = len(y)
for i in
range(X.shape[0]):
ones =
np.ones(1)
a1 = np.hstack((ones,
X[i]))
z2 = a1 @
initial_theta1.T
a2 = np.hstack((ones,
sigmoid(z2)))
z3 = a2 @
initial_theta2.T
a3 =
sigmoid(z3)
d3 = a3 -
y_d.iloc[i,:][np.newaxis,:]
z2 = np.hstack((ones,
z2))
d2 = np.multiply(initial_theta2.T @ d3.T,
delta1 = delta1 + d2[1:,:] @
a1[np.newaxis,:]
delta2 = delta2 + d3.T @
a2[np.newaxis,:]
delta1 /=
m
delta2 /=
m
#print(delta1.shape,
delta2.shape)
delta1[:,1:] = delta1[:,1:] + initial_theta1[:,1:] *
lmbda / m
delta2[:,1:] = delta2[:,1:] + initial_theta2[:,1:] *
lmbda / m
return np.hstack((delta1.ravel(order='F'), delta2.ravel(order='F')))
By the way, the for-loop in the above code can be eliminated if you can use a highly
vectorized implementation. But for those who are new to backprop it is okay to use for-loop to
gain a much better understanding. Running the above function with initial parameters gives
nn_backprop_Params which we will be using while performing gradient checking.

2.
Why do we need Gradient checking ? To make sure that our backprop algorithm has no bugs
in it and works as intended. We can approximate the derivative of our cost function with:

The gradients computed using backprop and numerical approximation should agree to at
least 4 significant digits to make sure that our backprop implementation is bug free.
hidden_layer_size, num_labels,myX,myy,mylambda=0.):
myeps =
0.0001
flattened =
nn_initial_params
flattenedDs =
nn_backprop_Params
n_elems =
len(flattened)
#Pick ten random elements, compute numerical gradient, compare to
respective D's
for i in
range(10):
x=
int(np.random.rand()*n_elems)
epsvec =
np.zeros((n_elems,1))
epsvec[x] =
myeps
cost_high = nnCostFunc(flattened + epsvec.flatten(),input_layer_size, hidden_layer_size,
num_labels,myX,myy,mylambda)
cost_low = nnCostFunc(flattened - epsvec.flatten(),input_layer_size, hidden_layer_size,
num_labels,myX,myy,mylambda)
mygrad = (cost_high - cost_low) /
float(2*myeps)
print("Element: {0}. Numerical Gradient = {1:.9f}. BackProp Gradient =

2.
5
Le
ar
ni
ng
pa
rameters using fmincg
After you have successfully implemented the neural network cost function and gradient
computation, the next step is to use fmincg to learn a good set of parameters for the neural
network. theta_opt contains unrolled parameters that we just learnt which we roll to get
theta1_opt and theta2_opt.

## theta_opt = opt.fmin_cg(maxiter = 50, f = nnCostFunc, x0 = nn_initial_params, fprime

args = (input_layer_size, hidden_layer_size, num_labels, X, y.flatten(),
lmbda))
theta1_opt = np.reshape(theta_opt[:hidden_layer_size*(input_layer_size+1)],
(hidden_layer_size, input_layer_size+1), 'F')
theta2_opt = np.reshape(theta_opt[hidden_layer_size*(input_layer_size+1):], (num_labels,
hidden_layer_size+1), 'F')

## 2.6 Prediction using learned parameters

It’s time to see how well our newly learned parameters are performing by calculating the
accuracy of the model. Do recall that when we used linear classifier like Logistic Regression
we got an accuracy of 95.08%. Neural network should give us a better accuracy.

## def predict(theta1, theta2, X,

y):
m = len(y)
ones =
np.ones((m,1))
a1 = np.hstack((ones,
X))
a2 = sigmoid(a1 @
theta1.T)
a2 = np.hstack((ones,
a2))
h = sigmoid(a2 @
theta2.T)
return np.argmax(h, axis = 1) + 1
This should give a value of 96.5% (this may vary by about 1% due to the random initialization).
It is to be noted that by tweaking the hyperparameters we can still obtain a better accuracy.