Machine Learning - What Does The Hidden Layer in A Neural Network Compute - Cross Validated PDF

5/16/2018 machine learning - What does the hidden layer in a neural network compute?
- Cross Validated
_
Cross Validated is a question and Here's how it works:
answer site for people interested in
statistics, machine learning, data
analysis, data mining, and data
visualization. Join them; it only takes a
minute:
Anybody can ask Anybody can The best answers are voted
a question answer up and rise to the top
Join
What does the hidden layer in a neural network compute?
I'm sure many people will respond with links to 'let me google that for you', so I want to say that I've tried to figure this out so please forgive my lack of
understanding here, but I cannot figure out how the practical implementation of a neural network actually works.
I understand the input layer and how to normalize the data, I also understand the bias unit, but when it comes to the hidden layer, what the actual
computation is in that layer, and how it maps to the output is just a little foggy. I've seen diagrams with question marks in the hidden layer, boolean
functions like AND/OR/XOR, activation functions, and input nodes that map to all of the hidden units and input nodes that map to only a few hidden units
each and so I just have a few questions on the practical aspect. Of course, a simple explanation of the entire neural network process like you would explain
to a child, would be awesome.
What computations are done in the hidden layer?
How are those computations mapped to the output layer?
How does the ouput layer work? De-normalizing the data from the hidden layer?
Why are some layers in the input layer connected to the hidden layer and some are not?
machine-learning neural-networks nonlinear-regression
edited Mar 3 '17 at 17:39 asked Jul 2 '13 at 15:59

David J. Harris JCab
8,449 1 20 46 934 3 7 8
9 People around here are nice, I have never seen a “let me google that for you” answer but many surprisingly thorough and insightful answers to what seemed at first to be basic questions.
Unfortunately, I can't help you with yours but it seems quite relevant so I am happily voting it up. – Gala Jul 2 '13 at 16:07
3 Thanks for the comment and the vote Gael, I'm probably a bit jaded by the SO community as we all know how those folks can get :) Glad to see more of a spirit of collaboration over here as
opposed to trying to earn badges and points by editing/closing questions. – JCab Jul 2 '13 at 16:37
1 I am not expert in neural networks specifically, although I do get involved in their applications and methods. My maybe-not-so-helpful answer would be that the specific computations in the
hidden depend on the 'cost function' that you are imposing on your ouput, i.e., what you try achieve. For example, if you want to group the input elements into clustered sets, you will
compute distances between elements in the hidden layer. This may go through various iterations and optimization cycles within this layer, until you meet an error criterion that allows the
process to `leave' this layer. – Lucozade Jul 2 '13 at 17:50
4 Answers
Three sentence version:
Each layer can apply any function you want to the previous layer (usually a linear transformation
followed by a squashing nonlinearity).
The hidden layers' job is to transform the inputs into something that the output layer can use.
The output layer transforms the hidden layer activations into whatever scale you wanted your
output to be on.
Like you're 5:
If you want a computer to tell you if there's a bus in a picture, the computer might have an easier time if
it had the right tools.
https://stats.stackexchange.com/questions/63152/what-does-the-hidden-layer-in-a-neural-network-compute 1/5
5/16/2018 machine learning - What does the hidden layer in a neural network compute? - Cross Validated
So your bus detector might be made of a wheel detector (to help tell you it's a vehicle) and a box
detector (since the bus is shaped like a big box) and a size detector (to tell you it's too big to be a car).
These are the three elements of your hidden layer: they're not part of the raw image, they're tools you
designed to help you identify busses.
If all three of those detectors turn on (or perhaps if they're especially active), then there's a good
chance you have a bus in front of you.
Neural nets are useful because there are good tools (like backpropagation) for building lots of detectors
and putting them together.
Like you're an adult
A feed-forward neural network applies a series of functions to the data. The exact functions will depend
on the neural network you're using: most frequently, these functions each compute a linear
transformation of the previous layer, followed by a squashing nonlinearity. Sometimes the functions will
do something else (like computing logical functions in your examples, or averaging over adjacent pixels
in an image). So the roles of the different layers could depend on what functions are being computed,
but I'll try to be very general.
Let's call the input vector x, the hidden layer activations h, and the output activation y. You have some
function f that maps from x to h and another function g that maps from h to y.
So the hidden layer's activation is f (x) and the output of the network is g(f (x)) .
Why have two functions (f and g ) instead of just one?
If the level of complexity per function is limited, then g(f (x)) can compute things that f and g can't do
individually.
An example with logical functions:
For example, if we only allow f and g to be simple logical operators like "AND", "OR", and "NAND",
then you can't compute other functions like "XOR" with just one of them. On the other hand, we could
compute "XOR" if we were willing to layer these functions on top of each other:
First layer functions:
Make sure that at least one element is "TRUE" (using OR)

Make sure that they're not all "TRUE" (using NAND)
Second layer function:
Make sure that both of the first-layer criteria are satisfied (using AND)
The network's output is just the result of this second function. The first layer transforms the inputs into
something that the second layer can use so that the whole network can perform XOR.
An example with images:
Slide 61 from this talk--also available here as a single image--shows (one way to visualize) what the
different hidden layers in a particular neural network are looking for.
The first layer looks for short pieces of edges in the image: these are very easy to find from raw pixel
data, but they're not very useful by themselves for telling you if you're looking at a face or a bus or an
elephant.
The next layer composes the edges: if the edges from the bottom hidden layer fit together in a certain
way, then one of the eye-detectors in the middle of left-most column might turn on. It would be hard to
make a single layer that was so good at finding something so specific from the raw pixels: eye
detectors are much easier to build out of edge detectors than out of raw pixels.
The next layer up composes the eye detectors and the nose detectors into faces. In other words, these
will light up when the eye detectors and nose detectors from the previous layer turn on with the right
patterns. These are very good at looking for particular kinds of faces: if one or more of them lights up,
then your output layer should report that a face is present.
This is useful because face detectors are easy to build out of eye detectors and nose detectors,
but really hard to build out of pixel intensities.
So each layer gets you farther and farther from the raw pixels and closer to your ultimate goal (e.g.
face detection or bus detection).
Answers to assorted other questions
"Why are some layers in the input layer connected to the hidden layer and some are not?"
The disconnected nodes in the network are called "bias" nodes. There's a really nice explanation here.
The short answer is that they're like intercept terms in regression.
"Where do the "eye detector" pictures in the image example come from?"
I haven't double-checked the specific images I linked to, but in general, these visualizations show the
set of pixels in the input layer that maximize the activity of the corresponding neuron. So if we think of
the neuron as an eye detector, this is the image that the neuron considers to be most eye-like. Folks
usually find these pixel sets with an optimization (hill-climbing) procedure.
In this paper by some Google folks with one of the world's largest neural nets, they show a "face
detector" neuron and a "cat detector" neuron this way, as well as a second way: They also show the
actual images that activate the neuron most strongly (figure 3, figure 16). The second approach is nice
because it shows how flexible and nonlinear the network is--these high-level "detectors" are sensitive
to all these images, even though they don't particularly look similar at the pixel level.
Let me know if anything here is unclear or if you have any more questions.
edited Feb 17 at 21:02 answered Jul 2 '13 at 18:41

David J. Harris
8,449 1 20 46
2 maybe not appropriate to put the question here...about the part of "An example with images", I already know how to
visualize the first layer, but get lost when I visualize the second layer and the next, do you know how to? or point some code
links to me if you know that, thanks, David. – HaveF Jul 4 '13 at 3:52
2 So is there just one defined algorithm for every single node on a given layer and the weights are what make the outputs
different? Or can you program every node on the layer to be different? – JCab Jul 5 '13 at 17:52
3 @GeorgeMcDowd this gets at the key issue: looking at pixels and identifying busses is hard, as you suggested. Fortunately,
looking at pixels and finding edges is easy--that's all the first hidden layer tries to do. The next layer tries to make inferences
based on edges, which is much easier than trying to do so based on pixels. – David J. Harris Jul 8 '13 at 16:27
4 SO should give you some other reward (than just points) for the time and effort you put into this answer! – user601 Jun 1 '14
at 7:09
2 I know we should avoid saying thanks in comments, but this reply cant pass without a "Thank you!" – MAS Feb 6 '15 at 6:20
I'll try to add to the intuitive operational description...
A good intuitive way to think about a neural network is to think about what a linear regression model
attempts to do. A linear regression will take some inputs and come up with a linear model which takes
each input value times some model optimal weighting coefficients and tries to map the sum of those
results to an output response that closely matches the true output. The coefficients are determined by
finding the values which will minimize some error metric between the desired output value and the
value that is learned by the model. Another way to say it is that the linear model will try to create
coefficient multipliers for each input and sum all of them to try to determine the relationship between
the (multiple) input and (typically single) output values. That same model can almost be thought of as
the basic building block of a neural network; a single unit perceptron.
But the single unit perceptron has one more piece that will process the sum of the weighted data in a
non-linear manner. It typically uses a squashing function (sigmoid, or tanh) to accomplish this. So you
have the basic unit of the hidden layer, which is a block that will sum a set of weighted inputs-- it then
passes the summed response to a non-linear function to create an (hidden layer) output node
response. The bias unit is just as in linear regression, a constant offset which is added to each node to
be processed. Because of the non-linear processing block, you are no longer limited to linear only
responses (as in the linear regression model).
Ok, but when you have many of the single perceptron units working together, each can have different
input weight multipliers and different responses (even though ALL process the same set of inputs with
the same non-linear block previously described). What makes the responses different is that each has
different coefficient weights that are learned by the neural network via training (some forms include
gradient descent). The result of all of the perceptrons are then processed again and passed to an
output layer, just as the individual blocks were processed. The question then is how are the correct
weights determined for all of the blocks?
A common way to learn the correct weights is by starting with random weights and measuring the error
response between the true actual output and the learned model output. The error will typically get
passed backwards through the network and the feedback algorithm will individually increase or
decrease those weights by some proportion to the error. The network will repeatedly iterate by passing
forward, measuring the output response, then updating (passing backwards weight adjustments) and
correcting the weights until some satisfactory error level is reached. At that point you have a regression
model that can be more flexible than a linear regression model, it is what is commonly called a
universal function approximator.
One of the ways that really helped me to learn how a neural network truly operates is to study the code
of a neural network implementation and build it. One of the best basic code explanations can be found
in the neural network chapter of (the freely available) 'The Scientist and Engineer's guide to DSP' Ch.
26. It is mostly written in very basic language (I think it was fortran) that really helps you to see what is
going on.
edited Jul 2 '13 at 22:44 answered Jul 2 '13 at 22:04

pat
3,006 15 23
I'm going to describe my view of this in two steps: The input-to-hidden step and the hidden-to-output
step. I'll do the hidden-to-output step first because it seems less interesting (to me).
Hidden-to-Output
The output of the hidden layer could be different things, but for now let's suppose that they come out of
sigmoidal activation functions. So they are values between 0 and 1, and for many inputs they may just
be 0's and 1's.
I like to think of the transformation between these hidden neurons' outputs and the output layer as just
a translation (in the linguistic sense, not the geometric sense). This is certainly true if the
transformation is invertible, and if not then something was lost in translation. But you basically just have
the hidden neurons' outputs seen from a different perspective.
Input-to-Hidden
Let's say you have 3 input neurons (just so I can easily write some equations here) and some hidden
neurons. Each hidden neuron gets as input a weighted sum of inputs, so for example maybe
hidden_1 = 10 * (input_1) + 0 * (input_2) + 2 * (input_3)
This means that the value of hidden_1 is very sensitive to the value of input_1 , not at all sensitive to
input_2 and only slightly sensitive to input_3 .
So you could say that hidden_1 is capturing a particular aspect of the input, which you might call the
" input_1 is important" aspect.
The output from hidden_1 is usually formed by passing the input through some function, so let's say
you are using a sigmoid function. This function takes on values between 0 and 1; so think of it as a
switch which says that either input_1 is important or it isn't.
So that's what the hidden layer does! It extracts aspects, or features of the input space.
Now weights can be negative too! Which means that you can get aspects like " input_1 is important
BUT ALSO input_2 takes away that importance":
hidden_2 = 10 * (input_1) - 10 * (input_2 ) + 0 * (input_3)
or input_1 and input_3 have "shared" importance:
hidden_3 = 5 * (input_1) + 0 * (input_2) + 5 * (input_3)
More Geometry
If you know some linear algebra, you can think geometrically in terms of projecting along certain
directions. In the example above, I projected along the input_1 direction.
Let's look at hidden_1 again, from above. Once the value at input_1 is big enough, the output of the
sigmoid activation function will just stay at 1, it won't get any bigger. In other words, more and more
input_1 will make no difference to the output. Similarly, if it moves in the opposite (i.e. negative)
direction, then after a point the output will be unaffected.
Ok, fine. But suppose we don't want sensitivity in the direction of infinity in certain a direction, and we
want it to be activated only for a certain range on a line. Meaning for very negative values there is no
effect, and for very positive values there is no effect, but for values between say, 5 and 16 you want it
to wake up. This is where you would use a radial basis function for your activation function.
Summary
The hidden layer extracts features of the input space, and the output layer translates them into the
desired context. There may be much more to it than this, what with multi-layer networks and such, but
this is what I understand so far.
EDIT: This page with its wonderful interactive graphs does a better job than my long and cumbersome
answer above could ever do: http://neuralnetworksanddeeplearning.com/chap4.html
edited Sep 25 '14 at 8:18 answered Jul 3 '13 at 10:09

Rohit Chatterjee
811 8 11
Like the OP, I'm a bit confused about the hidden layer in neural networks. In your example, how does the NN algorithm find
the weights for the hidden_1, hidden_2, and hidden_3 neurons? And since hidden_1, hidden_2, and hidden_3 are derived
from the same input variables, wouldn't the weights converge to the same solution? – RobertF Jun 26 '16 at 20:55
Let us take the case of classification. What the output layer is trying to do is estimate the conditional
probability that your sample belongs to a given class, i.e. how likely is for that sample to belong to a
given class. In geometrical terms, combining layers in a non-linear fashion via the threshold functions
allows the neural networks to solve non-convex problems (speech recognition, object recognition, and
so on), which are the most interesting ones. In other words, the output units are able to generate non-
convex decision functions like those depicted here.
One can view the units in hidden layers as learning complex features from data that allow the output
layer to be able to better discern one class from another, to generate more acurate decision
boundaries. For example, in the case of face recognition, units in the first layers learn edge like
features (detect edges at given orientations and positions) and higher layer learn to combine those to
become detectors for facial features like the nose, mouth or eyes. The weights of each hidden unit
represent those features, and its output (assuming it is a sigmoid) represents the probability that that
feature is present in your sample.
In general, the meaning of the outputs of output and hidden layers depend on the problem you are
trying to solve (regression, classification) and the loss function you employ (cross entropy, least
squared errors, ...)
answered Jul 3 '13 at 7:49

jpmuc
8,766 18 42
protected by Community ♦ Aug 8 '17 at 12:50

Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation
on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?

Machine Learning - What Does The Hidden Layer in A Neural Network Compute - Cross Validated PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Machine Learning - What Does The Hidden Layer in A Neural Network Compute - Cross Validated PDF

Hochgeladen von

Copyright:

Verfügbare Formate

5/16/2018 machine learning - What does the hidden layer in a neural network compute?

What does the hidden layer in a neural network compute?

What computations are done in the hidden layer?

How are those computations mapped to the output layer?

machine-learning neural-networks nonlinear-regression

edited Mar 3 '17 at 17:39 asked Jul 2 '13 at 15:59

Three sentence version:

Like you're an adult

Why have two functions (f and g ) instead of just one?

An example with logical functions:

First layer functions:

Make sure that at least one element is "TRUE" (using OR)

Second layer function:

An example with images:

Answers to assorted other questions

edited Feb 17 at 21:02 answered Jul 2 '13 at 18:41

I'll try to add to the intuitive operational description...

edited Jul 2 '13 at 22:44 answered Jul 2 '13 at 22:04

hidden_1 = 10 * (input_1) + 0 * (input_2) + 2 * (input_3)

hidden_2 = 10 * (input_1) - 10 * (input_2 ) + 0 * (input_3)

or input_1 and input_3 have "shared" importance:

hidden_3 = 5 * (input_1) + 0 * (input_2) + 5 * (input_3)

edited Sep 25 '14 at 8:18 answered Jul 3 '13 at 10:09

answered Jul 3 '13 at 7:49

protected by Community ♦ Aug 8 '17 at 12:50

Would you like to answer one of these unanswered questions instead?

Das könnte Ihnen auch gefallen