Sie sind auf Seite 1von 42

Chapter 7 (part 3)

Neural Networks
What are Neural Networks?
• An extremely simplified version of
the brain
• Essentially a function approximator
Transform inputs into outputs to the
best of its ability

inputs outputs

inputs NN outputs
What are Neural Networks?
What are Neural Networks?
• For the most part, neural networks are useful in
situations where multiple regression is used.
• In a neural network
• the independent variables are called input cells
• the dependent variable is called an output cell
(depending on the network model there might be
more than one output).
• As in regression there is a certain number of
observations (say N).
• Each observation contains a value for each input cell
(or independent variable) and the output cell (or
dependent variable).
In an illustration of neural network, each circle
is a cell of the network. There are three
different types of cells:

• Input cells are the cells in the first column of


each network. The first column of cells in the
network is called the Input Layer.
• Output cell is the cell in the last column of
the network and the last layer is the Output
Layer of the network.
• All other layers of the network are called
Hidden Layers.
• Cell 0 may be viewed as a special input cell,
analogous to the constant term in multiple
regression.

• Each other Input cell in the input layer


corresponds to an independent variable.

• Each cell in the Output layer represents a


dependent variable we want to predict.

• The arc in the network joining cells i and j has


an associated weight wij.
Input Layer
Hidden Layer
0 Output Layer

1 8
5
2
6
3
7
4
The figure above has 4 input variables and one hidden layer. With the
exception of cell 0, each cell in the network is connected by an arc to
each cell in the next network layer. There is an arc connecting cell 0
with each cell in a Hidden or Output Layer.
INPUT and OUTPUT VALUES
For any observation, each cell in the network
has an associated input value and output value.

For any observation, each input cell (other than


cell 0):
has an input value equal to the value of input i for
that observation;
has an output value equal to the value of input i for
that observation.
For example, if the first input equals 3, then
the input and output from node 1 are both 3.
For cell 0, the input and output values always
equal 1.

Let INP(j) = input to cell j and OUT(j) = output


from cell j.

For now we suppress the dependence of INP(j)


and OUT(j) on the particular observation.

For any cell not in the Input Layer

INJ ( j )   wij (output from cell i )


all i
The following equations illustrate the use of the
previous formula by computing the inputs to cells 5-7 of
the previous neural network figure. Suppose that for a
given observation input cell j takes the value Ij. Then:
INP(5) = w05(1) + w15I1 + w25I2 + w35I3 + w45I4
INP(6) = w06(1) + w16I1 + w26I2 + w36I3 + w46I4
INP(7) = w07(1) + w17I1 + w27I2 + w37I3 + w47I4
0 Hidden Layer
Input Layer Output Layer
1 8
5
2
6
3
7
4
To determine the output from any cell not in
the input layer we need to use a transfer
function. We will later discuss the most
commonly used transfer functions, but for now
the transfer function f may stand for any
function. Then for any cell j not in the Input
Layer
OUT(j) = f(INP(j))

For our figure, this formula yields


OUT(5) = f(INP(5))
OUT(6) = f(INP(6))
OUT(7) = f(INP(7))
By using the formulas and the previous results, we
can now determine the input and output for our
output cell, cell 8.

INP(8)=w08(1)+w58OUT(5)+w68OUT(6)+ 78OUT(7)

OUT(8) = f(INP(8))

For any observation, OUT(8) is our prediction for


the output cell.
The complex math involved in neural networks is
used to compute weights that produce "good"
predictions.

More formally, let

OUTj(8) = output of cell 8 for observation j


Oj = actual value of cell 8 for observation j.

The real work in the neural network analysis is


to determine network weights wij that minimize
N

 (OUT (8)  O )
j 1
j j
2
SCALING DATA

For the computations in a neural net to be


tractable it is desirable to scale all data (inputs
and outputs) so that all inputs lie in either of
the following two intervals:
Interval 1: [0, 1]
Interval 2: [-1, +1]
If you want to scale your data to lie on [0, 1]
then any value x for Input i should be
transformed into
x  (smallest value for input i )
range for input i
Note: Range = Largest value for Input i –
Smallest value for input i

and any value x for the output should be


transformed into
x  (smallest value of output)
range for output
If you want to scale your data to lie on [-1 +1]
then any value x for input i should be
transformed into
2[ x  (largest value for input i)]
1
range for input i
If you want to scale your data to lie on [-1 +1]
then any value x for the output should be
transformed into
2[ x  (largest value for output)]
1
range for output
THE SIGMOID TRANSFER FUNCTION
Vast experience with neural networks indicates
that the sigmoid transfer function usually
yields the best predictions.
If your data has been scaled to lie on [-1, +1]
the relevant sigmoid function is given by
2
f (input )  input
1
1 e
Note that for this function an input near -
yields an output near -1 and an input near +
yields an output near +1.
If your data has been scaled to lie on [0 1] the
relevant sigmoid function is given by

 input
e einput
f (input )  input input
e e
Note that for this function an input near -
yields an output near 0 and an input near +
yields an output near 1.
REMARKS
1. The sigmoid function is often called the
squashing function because it "squashes" values
on the interval [-, +] to the unit interval [0 1].
2. The slope of the sigmoid function for the [0
1] interval is given by

f'(Input) = f(Input)(1-f(Input))

This implies that the sigmoid function is very


steep for intermediate values of the input and
very flat for extreme input values.
TESTING AND VALIDATION
When we fit a regression to data, we often use
80%-90% of the data to fit a regression
equation, and use the remaining data to
"validate" the equation.

The same technique is used in neural networks.


We begin by designating 80%-90% of our data
as the training or learning data set. Then we
"fit" a neural network to this data.
Suppose cell 8 is the output cell.

Let Oj(8) = the actual output for observation j

And

OUTj(8) be the value of the output cell for


observation j.

Let AVGOT(8) = average value of the output for


the training data.
Analogous to regression define;

SST(Train) = Sum of Squares Total for Training


data
SST(Train) = (Oj(8) - AVGOT(8))2

and

SSR(Train) =  (OUTj(8) - AVGOT(8))2

Define;

R2(Train) = SSR(Train)/SST(Train)
If the network is to be useful for forecasting,
the R2 computed from the Test portion of the
data should be close to R2(Train).
CONTINUOUS AND BINARY DATA
If your dependent variable assumes only
two values (say 0 and 1) we say we have
binary data.
In this case the usual procedure is to try and
train the network until as many outputs as
possible are below 0.1 and above 0.9.
Then those observations with predictions
less than .1 are classified as 0 and those
observations with predictions larger than .9
are classified as 1.
If we do not have binary data we say that
the data is continuous.
Examples of the Use of
Neural Networks
Example 1: The efficient market hypothesis of
financial markets states that the "past history"
of a stock's returns yields no information about
the future return of the stock. White (1988)
examines returns on IBM to see if the market
is efficient. He begins by running a multiple
regression where the dependent variable is the
next day's return on IBM stock and the five
independent variables are the return on IBM
during each of the last five days.
This regression yielded R2 = .0079, which is
consistent with the efficient market hypothesis.
White then ran a neural network (containing one
hidden layer) with the output cell corresponding to
the next day's return on IBM and 5 input cells
corresponding to the last five days' return on IBM.
This neural network yielded R2 = .179. This implies
that the past five days of IBM returns do contain
information that can be used to make predictions
about tomorrow's return on IBM!
According to the October 9, 1993 Economist,
Fidelity manages 2.6 billion dollars in assets using
neural nets. One of the neural net funds has beat
the S&P 500 by 2-7% a quarter for more than
three years.
Example 2: Researchers at Carnegie-Mellon
University have developed ALVINN 1
(Automated Land Vehicle in a Neural Network),
a neural network that can drive a car! It can tell
if cars are nearby and then slow down the car.
Within ten years a neural network may be
driving your car!
Example 3: In finance and accounting it is
important to accurately predict whether or not
a company will go bankrupt during the next year.
Altman(1968) developed a method (Altman's Z-
statistic) to predict whether or not a firm will
go bankrupt during the next year based on the
firm's financial ratios. This method used a
version of regression called discriminant
analysis. Neural networks using financial ratios
as input cells have outperformed Altman's Z.
Example 4: The September 22, 1993 New York
Times reported that Otis Elevator uses neural
networks to direct elevators. For example, if
elevator 1 is on floor 10 and going up, elevator 2
is on floor 6 going down, and elevator 3 is on
floor 2 and going up, which elevator should
answer a call to go down from floor 7?
Example 5: Many banks (Mellon and Chase are
two examples) and credit card companies use
neural networks to predict (on the basis of past
usage patterns) whether or not a credit card
transaction should be disallowed. AVCO
Financial used a neural net to determine
whether or not to lend people money. They
increased their loan volume by 25% and
Example 6: "Pen" computers and personal digital
assistants often use neural nets to read the
user's handwriting. The "inputs" to the network
are a binary representation of what the user
has written. For example, let a 1 = place where
we have written something and a 0 = a place
where we have not written something. An input
to the network might look like
0001111
0001000
0001110
00000l0
0001110
The neural net must decide to classify this
input as an "s", a "5" or something else.

Lecun tried to have a neural net "read"


handwritten zip code digits. 7291 digits were
used for training and 2007 for testing. Running
the neural net took three days on a Sun
workstation. The net correctly classified
99.86% of the Training data and 95.0% of the
Test data.
Why Neural Networks Can Beat
Regression: The XOR Example
The classical XOR data set can be used to
obtain a better understanding of how neural
networks work, and why they can pick up
patterns that regression often misses. The XOR
data set also illustrates the usefulness of a
hidden layer. The XOR data set contains two
inputs, one output, and four observations.
The data set is given below:

Observation Input 1 Input 2 Output


1 0 0 0
2 1 0 1
3 0 1 1
4 1 1 0
We see that the output equals 1 if either input (but not both) is equal to 1.
If we try to use regression to predict the output from the two inputs, we obtain
the equation = .5. This equation yields an R2 = 0, which means that linear
multiple regression yields poor predictions indeed.
Now let's use the neural network of the
following figure to predict the output.
0

1 3

We assume that for some θ the transfer function for the output cell is defined
by f(x) = 1 if x and f(x) = 0 if x<. Given this transfer function, it is natural to
ask whether there are any values for  and the wij's that will enable the above
network to make the correct predictions for the data set in the table in
previous slide.
From the figure in previous slide we find

INP(3) = w03 + w13(Input 1) + w23(Input 2)


and
OUT(3) = 1 if INP(3)  and OUT(3) = 0 if
INP(3)< 

This implies that this figure will yield correct


predictions for each observation if and only if
the following four inequalities hold:

Observation 1: w03<
Observation 2: w03 + w13
Observation 3: w03 + w23
Observation 4: w03 + w13 + w23<
There are no values of w03, w13, w23 and  that satisfy
the following inequalities:
Observation 1: w03<
Observation 4: w03 + w13 + w23<
To see this note that together
Observation 3: w03 + w23
Observation 4: w03 + w13 + w23<
imply w13<0.
Adding w13<0 to Observation 1: w03<
implies that w03 + w13< θ, which contradicts the following
inequality
Observation 2: w03 + w13
Thus we have seen that there is no way the neural network
of this figure can correctly predict the output for each
observation.
Suppose we add a hidden layer with two nodes
to the previous figure’s neural net.This yields
the neural network in the following figure.
0 5

3
1
4
2

We define the transfer function fi at cell i by fi(x) = 1 if xi and fi(x) = 0 for x<i. If
we choose 3 = .4, 4 = 1.2, 5 = .5, w03 = w04 = w05 = 0, w13 = w14 = w23 = w24 =
1, w35 = .6, w45 = -.2, then the neural network for the figure above will yield
correct predictions for each observation. We now verify that this is the case.
Observation 1:

Input 1 = Input 2 = 0, Output = 0


Cell 3 Input = 1(0) + 1(0) = 0
Cell 3 Output = 0 (0<.4)
Cell 4 Input = 1(0) + 1(0) = 0
Cell 4 Output = 0 (0<1.2)
Cell 5 Input = .6(0) - .2(0) = 0
Cell 5 Output = 0 (0<.5)
Observation 2:

Input 1 = 1, Input 2 = 0, Output = 1


Node 3 Input = 1(1) + 1(0) = 1
Node 3 Output = 1 (1.4)
Node 4 Input = 1(1) + 1(0) = 1
Node 4 Output = 0 (1<1.2)
Node 5 Input = .6(1) -.2(0) = .6
Node 5 Output = 1 (.6.5)
Observation 3:

Input 1 = 0 , Input 2 = 1, Output = 1


Node 3 Input = 1(0) + 1(1) = 1
Node 3 Output = 1 (1.4)
Node 4 Input = 1(0) + 1(1) = 1
Node 4 Output = 0 (1<1.2)
Node 5 Input = .6(1) - .2(0) = .6
Node 5 Output = 1 (.6.5)
Observation 4:

Input 1 = Input 2 = 1, Output = 0


Node 3 Input = 1(1) + 1(1) = 2
Node 3 Output = 1 (2.4)
Node 4 Input = 1(1) + 1(1) = 2
Node 4 Output = 1 (21.2)
Node 5 Input = .6(1) - .2(1) = .4
Node 5 Output = 0 (.4<.5)

We have found that the hidden layer enables us


to perfectly fit the XOR data set!
REMARK

One might think that having more hidden layers


will lead to much more accurate predictions.
Vast experience shows that more than one
hidden layer is rarely a significant improvement
over a single hidden layer.

Das könnte Ihnen auch gefallen