Beruflich Dokumente
Kultur Dokumente
GUIDE
For any further information about the SPAD software, training and consulting activities, please visit
us at www.coheris.com or contact us by email:
About
SPAD Software
SPAD Hot line
Training
Consulting
Books
E-mail
info-spad@coheris.com
support-spad@coheris.com
formation-spad@coheris.com
consulting-spad@coheris.com
publication-spad@coheris.com
For further information about the COHERIS Group offer (CRM, BI, Data Mining, Data Quality
Management, Merchandising Sfa), visit us at www.coheris.com
Table of contents
DESCRIPTIVE STATISTICS WITH SPAD
5
16
21
25
28
30
32
45
50
62
63
69
78
79
80
80
85
94
105
105
117
126
134
134
154
154
This procedure supplies a rapid and automatic description of your nominal and
continuous variables.
The Survey.sba base is an opinion survey file, which will be used for this example. The file is
supplied with the application and installed automatically on your PC.
The Cases, Weighting and Parameters tabs are available for almost all SPAD methods.
Cases: the Cases tab lets you select the cases used for the method
Weighting: the weighting tab allows you to adjust the distribution of the cases in the sample
Parameters: options and settings of the method
5
c
Click on Logical filter
e
Click on the operator
f
Click on the
operand
h
g
Global Definition
of the filter
Click on
Validate
In case of error, you can delete an expression from the filter by selecting the expression to discard,
and click on Delete.
The cases satisfying the filter are considered as active, while the others are supplementary.
Select the individuals from a list
Select the chosen
method by List
Select the status of
the cases
Choose your cases in the Available list and
use the transfer buttons to select them.
6
c Select
by interval as the
method of choice
Select the
weighting
type
Enter the theoretical percentage for each category and click on OK.
You can repeat this operation for another variable. In this way you get an adjustment as a function
of several variables with a simple weighting variable. This requires a calculation by successive
approximations, as shown in the window below:
Attention: The weighting calculation in the weighting tab page for a method is temporary (the
weighting variable is not saved). This approach lets you make quick tests and also to measure the
influence of the weighting on the results of the method. When a satisfactory weighting variable has
been obtained, it is preferable to create a permanent weighting variable with the menu Tools
Weighting of the main menu (Data Management Manual, paragraph 4.3).
Then in the weighting tab of a method, we will select this variable as the weight variable.
The Marginal distributions tab
We select the categorical variables in the list below.
The Parameters button allows you to display or not the categories without any
respondent and to display or not the missing data as a new category.
The Statistics button displays summary statistics on the selected variables. For example,
select the Region where the respondent lives (V1), then click on the statistics button. A
window opens with statistics on the variable:
This statistics window shows for the categorical
variables: the count and percentage associated for
each category. For the continuous variables; the
statistic window shows the count, the mean, the
standard deviation, as well as the minimum and
maximum.
The Parameters button allows you to set global or specific parameters for the histograms
characteristics such as the number of classes, the min and max bounds and the histogram
bar width.
You can also select continuous variables for categorization. As a result, each distinct value
is displayed with its frequency.
It is a preliminary step before splitting the continuous variable into classes.
It is not allowed to do both histograms and categorization for the same variable.
10
11
Once you have specified your request, then you validate the method by clicking on the
OK button.
RESULTS
Results are accessible in the Execution view or by right-clicking on the method and
choosing the Results command. Then, depending on the method, different choices are
available between the results editor, the Graphics gallery and Excel results.
The results editor
The Result Editor opens up in a new window.
12
13
14
15
155.00
149.00
115.00
50.00
49.21
47.30
36.51
15.87
49.21
47.30
36.51
15.87
265.00
200.00
166.00
160.00
84.13
63.49
52.70
50.79
84.13
63.49
52.70
50.79
Characterizing elements
categories
categorical variables
continuous variables
categories
categorical variables
continuous variables
16
In this example, the variable to characterize is V8 The family is the only place where you
feel well. All the other variables whether categorical or continuous are selected as
characterizing elements.
17
Once you have set the parameters, then you validate the method by clicking on the OK
button and run the chain.
18
Variable label
Caracteristic
categories
Marital status
Do you watch TV
Opinion about marriage
Are you worried about the risk of a nuclear plant accident
Do you have children
Are you worried about the risk of a road accident
Educational level of the respondent
Current situation of the respondent
Are you worried about the risk of a mugging
Do you think the society needs to change
married
every day
indissoluble
a lot
yes
a lot
primary school
retired people
a lot
I do not know
unemployed person
not at all
student
technical and GCSE
cohabitation
yes
more than 200000
improving a lot
quite often
single
no
dissolved if agreem
a little
more high school
5,22
23,04
2,17
3,48
3,04
20,43
17,83
3,91
19,57
9,57
17,39
30,87
15,65
9,13
7,30
26,35
3,81
5,40
5,08
24,13
21,59
6,67
24,13
13,33
21,90
36,19
20,32
13,65
52,17
63,86
41,67
47,06
43,75
61,84
60,29
42,86
59,21
52,38
57,97
62,28
56,25
48,84
-2,02
-2,02
-2,06
-2,10
-2,30
-2,33
-2,46
-2,81
-2,90
-2,93
-2,96
-3,07
-3,13
-3,49
0,022
0,022
0,020
0,018
0,011
0,010
0,007
0,002
0,002
0,002
0,002
0,001
0,001
0,000
Weight
223
176
81
89
243
115
54
54
92
29
23
83
12
17
16
76
68
21
76
42
69
114
64
43
% of category in group :
Frequency of the category in the group divided by the frequency of the group
% of category in set:
Frequency of the category in the population
% of group in category:
Frequency of the group in the category divided by the frequency of category
Test-value:
When the test-value is greater than zero, it means that the category is overrepresented in the group. The category is under-represented if the test-value is
negative. By default, SPAD displays only characterizing elements with a test-value
greater equal than 1.96 (i.e. a probability equal to 0.025 for an unilateral test).
Probability:
The probability evaluates the scale of the difference between the percentage of the
category in the group and the percentage of the category in the population. Lower is
the probability, more significant is the difference and greater is the test-value related
to this probability (the test-value is the fractile of the normal law that corresponds to
the same probability).
Weight:
Weight of the cases in the category
19
(Weight =
230.00 Count =
230 )
Age of respondent
Religion : importance given
Relatives, brothers, sisters ... : importance given
Category
mean
46,100
3,383
5,726
Overall
mean
43,756
3,241
5,629
Category Std.
deviation
16,752
2,081
1,380
Overall Std.
deviation
16,581
2,022
1,436
4044,990
4408,550
3690,140
4575,340
Category
mean
5377,780
Overall
mean
4408,550
Category Std.
deviation
6311,000
Overall Std.
deviation
4575,340
1,542
36,855
1,860
43,756
1,772
13,971
1,671
16,581
Characteristic variables
No
(Weight =
Characteristic variables
Salary of the respondent
Number of children
Age of respondent
83.00 Count =
Test-value Probability
4,12
2,04
1,98
0,000
0,021
0,024
-2,09
0,018
83 )
Test-value Probability
2,10
0,018
-2,02
-4,41
0,022
0,000
Category mean:
Weighted mean of the variable in the category
Overall mean:
Weighted mean of the category in the overall population
Interpretation:
One can see that the Age of respondent is the most characterizing continuous
variable of the group who answered yes to the question The family is the only
place where you feel well .
This group is significantly older than the average respondent with an average age of
46 years old, compared to 43.75 years old for the overall population.
20
21
22
23
24
If a variable appears in the Means column, each cell of the cross table will display the
weighted average corresponding to the cases of the cell.
If a variable appears in the Frequencies column, each cell of the cross table will display
the weighted sum of the values of the variable for the cases of the cell.
By clicking on local filter, you can define a specific filter for each command.
25
26
27
The graph editor of the BIVAR method is the same that is used for factorial analyses.
The capabilities of the graph editor will be described in the section Factorial analyses.
28
29
30
VOCABULARY
Active Variables
Supplementary variables
Contribution
Cosines
31
Cubic
capacity
1396
1294
1461
1294
1721
1580
1769
2068
1769
1998
1905
1993
Power
Speed
174
189
181
184
180
170
180
180
182
190
194
185
90
103
100
95
92
83
90
88
90
122
125
115
32
Weight
850
805
925
730
965
970
1080
1135
1060
1255
1120
1190
Length
369
370
363
350
415
395
440
446
424
473
439
451
Width
166
157
161
160
169
170
169
170
168
177
171
172
1995
1952
2109
1994
2986
2675
2548
2494
2933
1116
1580
1117
120
87
112
160
188
177
182
171
150
58
80
50
177
144
149
214
226
222
226
208
200
145
159
135
1265
1430
1320
1220
1510
1365
1350
1600
1345
780
880
810
436
436
457
439
472
469
471
432
466
364
370
371
177
169
184
169
175
175
180
164
176
155
156
162
The matrix plot, performed with the STATS method, is a good overview of the pair wise
relationships between variables.
33
34
Normed PCA means that all the active variables are previously centered and standardized
by SPAD. The consequence is that all the variables are assigned the same contribution to
the overall inertia.
When the PCA is not normed (only centered), the distance between the variable and the
origin is equal to the variance of the variable.
Most of the time, it is advised to perform a normed analysis in order to assign the same
importance to each active variable. It is particularly recommended when the
measurements scales are different.
In our example, we can see that the measurements scales are strongly different. Thus, we
will perform a normed PCA.
RETAINED COORDINATES
The number of retained coordinates is useful for the methods that follow the PCA in the
chain. These methods can be DEFAC (factors description) and RECIP/SEMIS (clustering).
36
The linear correlation coefficient points out the intensity of the relationship between two
continuous variable. The coefficient correlation ranges from 1 to 1. The closer the
correlation coefficient is to +1 or -1, the more closely the two variables are related.
TEST-VALUES MATRIX
|
CYLI
PUIS
VITE
POID
LONG
LARG
-----+-----------------------------------------CYLI | 99.99
PUIS |
6.35 99.99
VITE |
4.19
7.06 99.99
POID |
7.14
4.99
2.74 99.99
LONG |
6.42
4.14
2.90
6.40 99.99
LARG |
4.34
3.05
1.86
4.25
6.41 99.99
-----+-----------------------------------------|
CYLI
PUIS
VITE
POID
LONG
LARG
This matrix is related to the previous one. SPAD translates the test of correlation in terms
of test-value. In this example, the higher is the test-value, the more closely are the two
variables. We can consider that a test-value lower than 2 means no linear relationship
between the two variables.
EIGENVALUES
COMPUTATIONS PRECISION SUMMARY : TRACE BEFORE DIAGONALISATION..
6.0000
SUM OF EIGENVALUES............
6.0000
HISTOGRAM OF THE FIRST 6 EIGENVALUES
+--------+------------+-------------+-------------+----------------------------------------------------- +
| NUMBER | EIGENVALUE | PERCENTAGE | CUMULATED |
|
|
|
|
| PERCENTAGE |
|
+--------+------------+-------------+-------------+------------------------------------------------ -----+
|
1
|
4.6173
|
76.96
|
76.96
| ********************************************************************************|
|
2
|
0.8788
|
14.65
|
91.60
| ****************
|
|
3
|
0.3035
|
5.06
|
96.66
| ******
|
|
4
|
0.1055
|
1.76
|
98.42
| **
|
|
5
|
0.0732
|
1.22
|
99.64
| **
|
|
6
|
0.0216
|
0.36
|
100.00
| *
|
+--------+------------+-------------+-------------+------------------------------------------------------+
In the second column (Eigenvalue) above, we find the variance on the new factors that
were successively extracted. In the third column, these values are expressed as a percent of
the total variance. As we can see, factor 1 accounts for 77 percent of the variance, factor 2
for 15 percent, and so on. As expected, the sum of the eigenvalues is equal to the number
37
of variables. The third column contains the cumulative variance extracted. The variances
extracted by the factors are called the eigenvalues. This name derives from the
computational issues involved.
Eigenvalues and the Number-of-Factors Problem
Now that we have a measure of how much variance each successive factor extracts, we can
return to the question of how many factors to retain. By its nature this is an arbitrary
decision. However, there are some guidelines that are commonly used, and that, in
practice, seem to yield the best results.
The Kaiser criterion. First, we can retain only factors with eigenvalues greater than 1. In
essence this is like saying that, unless a factor extracts at least as much as the equivalent of
one original variable, we drop it. This criterion was proposed by Kaiser (1960), and is
probably the one most widely used. In our example above, using this criterion, we would
retain 1 factor (principal component).
The scree test. A graphical method is the scree test first proposed by Cattell (1966). We can
plot the eigenvalues shown above in a simple line plot.
5,0
4,0
3,0
2,0
1,0
0,0
1
Cattell suggests to find the place where the smooth decrease of eigenvalues appears to
level off to the right of the plot. To the right of this point, presumably, one finds only
"factorial scree" -- "scree" is the geological term referring to the debris which collects on the
lower part of a rocky slope. According to this criterion, we would probably retain 1 or 2
factors in our example.
RESEARCH OF IRREGULARITIES (THIRD DIFFERENCES)
+--------------+--------------+------------------------------------------------------+
| IRREGULARITY | IRREGULARITY |
|
|
BETWEEN
|
VALUE
|
|
+--------------+--------------+------------------------------------------------------+
|
1 -- 2 |
-2785.86 | **************************************************** |
+--------------+--------------+------------------------------------------------------+
RESEARCH OF IRREGULARITIES (SECOND DIFFERENCES)
+--------------+--------------+------------------------------------------------------+
| IRREGULARITY | IRREGULARITY |
|
|
BETWEEN
|
VALUE
|
|
+--------------+--------------+------------------------------------------------------+
|
1 -- 2 |
3163.20 | **************************************************** |
|
2 -- 3 |
377.34 | *******
|
+--------------+--------------+------------------------------------------------------+
38
. . . . . . . . . . . . .
. . .*--------+--------*.
.*--+--*. . . . . . . . .
*+* . . . . . . . . . . .
+*. . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.*----------------------------------------------+----------------------------------------------*.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Third and second differences as well as Andersons laplace intervals are other guidelines
to help the SPAD User to choose the number of dimensions to retain for further analyses.
LOADINGS OF VARIABLES ON AXES 1 TO 5
ACTIVE VARIABLES
---------------------+------------------------------------+-------------------------------+------------------------------VARIABLES
|
LOADINGS
| VARIABLE-FACTOR CORRELATIONS |
NORMED EIGENVECTORS
---------------------+------------------------------------+-------------------------------+------------------------------IDEN - SHORT LABEL
|
1
2
3
4
5
|
1
2
3
4
5 |
1
2
3
4
5
---------------------+------------------------------------+-------------------------------+------------------------------CYLI - Cubic capacity|
0.96
0.01 -0.15
0.04 -0.23 | 0.96 0.01 -0.15 0.04 -0.23 | 0.45 0.01 -0.27 0.11 -0.84
PUIS - Power
|
0.90
0.38 -0.02 -0.16
0.04 | 0.90 0.38 -0.02 -0.16 0.04 | 0.42 0.41 -0.03 -0.49 0.15
VITE - Speed
|
0.75
0.62
0.20
0.08
0.04 | 0.75 0.62 0.20 0.08 0.04 | 0.35 0.66 0.37 0.26 0.13
POID - Weight
|
0.91 -0.18 -0.35 -0.06
0.11 | 0.91 -0.18 -0.35 -0.06 0.11 | 0.42 -0.19 -0.63 -0.18 0.42
LONG - Length
|
0.92 -0.30
0.05
0.22
0.07 | 0.92 -0.30 0.05 0.22 0.07 | 0.43 -0.32 0.10 0.69 0.26
LARG - Width
|
0.80 -0.48
0.34 -0.14 -0.02 | 0.80 -0.48 0.34 -0.14 -0.02 | 0.37 -0.51 0.62 -0.42 -0.06
---------------------+------------------------------------+-------------------------------+-------------------------------
For normed PCA, correlations (variable factor) and loadings are equivalent.
Apparently, the first factor is generally more highly correlated with the variables than the
second factor. This is to be expected because, as previously described, these factors are
extracted successively and will account for less and less variance overall.
Normed eigen vectors are the coefficients that describe the linear relationship between the
active normed variables and the factors: in this example, we have:
CYLI Mean(CYLI )
PUIS Mean( PUIS )
+ 0.42
+ 0.35...
Factor1 = 0.45
STDEV (CYLI )
STDEV ( PUIS )
Note:
SPAD does not print out neither the contributions nor the cosinus for the active variables.
However, it is possible to calculate them this way:
39
DISTO : the distance between the case and the center of gravity of the overall sample. This
is helpful to determine the Average cars, (close to the center of gravity) and the more
specific ones that are far from the center of gravity.
40
To create a new factorial graph, select Graph - New , the following window
appears:
The preselection step allows you to select the different elements to display in the graph:
Active or supplementary cases
Active or supplementary variables
If you forget to select an element, you have to create a new graph and redo the
preselection.
THE TOOL BAR OF THE GRAPH EDITOR
Points
selection
Factors
selection
Total
Unselection
Framing
selection
41
Delete
the labels
Write
the labels
Cancel
the ghosts
Set
as ghost
Information
on points
Vertical
symmetric view
Refresh
Correlation
circle
Horizontal
symmetric view
SAVE A GRAPH
Internal save is dependent on the chain.
In the case of a re-execution of the chain, or the deletion by the user of the results of the
chain, these internal saves are deleted.
This type of save uses the commands:
Save
Save as internal save of the graphics menu.
When you save in internal format, you give a TITLE to the saved graphic.
Later you can reload this save with the command Open Internal save graphics menu.
The utility of the Save in Internal Format is that all the functions of the annotations and
properties of the factorial planes remain available.
The editor for the factorial planes also lets you save the graphics in .BMP or .PCX format.
These images can then be inserted into a word processor document.
The EMF Metafile format gives the best image quality.
This type of Save is made with the command Save as - Screen Image BMP/PCX.
42
GENERAL PRINCIPLES
The construction of a graphic after an analysis requires the following general principles:
Go to the New Graphics Menu, which opens the pre-selections Dialogue Box.
For a single analysis, you can open several graphics at once through the Graphics Menu
and make different pre-selections. All the graphics you create can be saved in an internal
or the archive format.
To modify your graph, apply the following rule:
Select the points with the tool bar or the selection menu
Format them with the format menu
Deselect to see the effect of the embellishments.
IMPORTANT
To manipulate (move, change etc.) the labels and the texts on a graphic, enlarge the frame.
For this you have to be in standard mode, that is: no selection mode button is highlighted,
and the status bar is empty.
43
44
SUZE
18
60
32
41
19
17
34
64
10
11
79
16
60
38
43
11
16
8
32
16
45
VODKA
25
69
38
75
17
13
45
45
13
53
83
65
70
5
49
16
19
28
82
84
GIN
23
68
39
70
19
11
42
46
12
50
82
69
67
6
51
18
19
25
80
81
MALIBU
25
69
39
61
14
13
46
37
13
48
80
76
67
8
61
17
17
21
43
72
BEER
59
74
72
19
80
29
68
41
85
54
90
89
81
7
60
49
40
4
40
67
46
47
0.1345
0.1345
48
50
70
69
82
51
68
48
21
91
43
87
61
88
61
76
77
61
42
80
87
62
69
67
83
92
46
64
39
39
61
69
9
27
12
29
42
16
67
11
43
41
40
90
60
37
43
12
74
60
34
32
68
72
45
88
50
49
18 11 12
6
19
25
17
17 13 13 8
14
30
41
11 21 4
22
42
24
27
13 12 13
18
18
16
11 17 10
38
85
GIN
MALIBU
WHISKY
MARTINI
59
40
49
29
61
19
SUZE
80
7
88
64
19
50
41
83
23
VODKA
64
60
81
17
79
51
19
83
60
16 13 13 5
85
56
16
89
54
87
36
79
32
76
19
46
35
85
25
Refreshing
45
Not elegant
Good during the day
Oldy. not trendy
38
Close to me
To relax oneself
45
67
25
72
Friendly product
70
By habits
80
69
81
83
49
65
With friends
75
Liked by youngs
82
53
28
Become expensive
84
We can mix it
Make snobish
The following graph has been designed with the SPAD Amado procedure.
Using the SCA results, rows and columns are ranked by decreasing first factor
coordinates. It gives a visual structure to the table. The width of a column is proportional
to its frequency.
BEER
78
49
36
38
26 24
12
PASTIS
( 2 categories )
( 2 categories )
( 5 categories )
( 5 categories )
( 5 categories )
( 4 categories )
( 5 categories )
50
51
d
c
d Retained coordinates
The number of retained coordinates is useful for the methods that follow the MCA
in the chain. These methods can be DEFAC (factors description) and RECIP/SEMIS
(clustering).
52
20.00
28 ASSOCIATE CATEGORIES
28 ASSOCIATE CATEGORIES
53
EIGENVALUES
COMPUTATIONS PRECISION SUMMARY : TRACE BEFORE DIAGONALISATION..
SUM OF EIGENVALUES............
2.8571
2.8571
54
55
56
57
58
59
60
61
63
64
65
c Starting partition
Three procedures are available .
9 The first one consists in searching stable clusters by crossing many partitions
provided by centrods randomly selected.
The item Number defines the number of partitions (2 by default) and Size
determines the number of centrods for each partition.
9 The others produce a single partition based on N centrods chosen by the SPAD
User or randomly selected.
66
10
48%
11%
11%
37%
9%
8%
20%
7%
8%
7%
30%
7%
8%
7%
14%
7%
8%
7%
14%
16%
8%
Display
Node number
Delete
node number
Display
aggregation criteria
67
Vertical or
horizontal tree
8%
The level index is the gain of inter-cluster inertia obtained by subdividing one node into
two nodes.
The larger bar corresponds the cut of the tree into two clusters.
68
PARTI - DECLA -
The PARTI procedure constructs partitions by pruning an aggregation tree. The procedure
creates the partitions requested by the user or by an automatic search for the best
partitions, by possibly improving them by iteration on mobile centers (consolidation). The
partitions created in this way will then be characterized automatically.
The DECLA procedure lets you describe the partitions determined by the PARTI
procedure.
We can define either each cluster of a partition, or globally the partition itself. All the
elements available (actives and illustrative) may participate in the characterization:
categories of categorical variables, categorical variables themselves, continuous variables,
the frequencies and the factorial axes.
69
70
CLUSTERING CONSOLIDATION
AROUND CENTERS OF THE
7 CLUSTERS ACHIEVED BY 10 ITERATIONS WITH MOVING CENTERS
BETWEEN-CLUSTERS INERTIA INCREASE
+-----------+---------------+---------------+---------------+
| ITERATION | TOTAL INERTIA | INTER-CLUSTERS|
RATIO
|
|
|
|
INERTIA
|
|
+-----------+---------------+---------------+---------------+
|
0
|
2.35008 |
0.77272 |
0.32881 |
|
1
|
2.35008 |
0.82435 |
0.35078 |
|
2
|
2.35008 |
0.82613 |
0.35153 |
|
3
|
2.35008 |
0.82630 |
0.35160 |
|
4
|
2.35008 |
0.82630 |
0.35160 |
+-----------+---------------+---------------+---------------+
STOP AFTER ITERATION 4. RELATIVE INCREASE OF BETWEEN-CLUSTER INERTIA
WITH RESPECT TO THE PREVIOUS ITERATION IS ONLY 0.000 %.
71
CLUSTERS REPRESENTATIVES
CLUSTER 1/ 7
COUNT: 128
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.51034|0980
|| 2|
0.56936|0091
|| 3|
0.58376|0485
|
| 4|
0.58376|0619
|| 5|
0.62658|0368
|| 6|
0.62658|0897
|
| 7|
0.63989|0704
|| 8|
0.66465|0184
|| 9|
0.66465|0232
|
| 10|
0.66465|0238
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 2/ 7
COUNT: 358
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.66989|0459
|| 2|
0.80053|0043
|| 3|
0.80753|0322
|
| 4|
0.86366|0393
|| 5|
0.86366|0450
|| 6|
0.86366|0780
|
| 7|
0.86366|0540
|| 8|
0.86366|0460
|| 9|
0.90535|0082
|
| 10|
0.91404|0593
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 3/ 7
COUNT:
72
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.58799|0741
|| 2|
0.60470|0940
|| 3|
0.61735|0639
|
| 4|
0.61735|0788
|| 5|
0.69764|0789
|| 6|
0.70722|0758
|
| 7|
0.78494|0766
|| 8|
0.78494|0806
|| 9|
0.82442|0742
|
| 10|
0.82442|0946
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 4/ 7
COUNT:
82
-----------------------------------------------------------------------------|RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. ||RK | DISTANCE | IDENT. |
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
| 1|
0.74814|0156
|| 2|
0.98976|0575
|| 3|
1.01170|0730
|
| 4|
1.07622|0569
|| 5|
1.12107|0721
|| 6|
1.12879|0148
|
| 7|
1.12879|0660
|| 8|
1.12879|0715
|| 9|
1.14287|0566
|
| 10|
1.14460|0360
||
|
|
||
|
|
|
+---+-----------+--------++---+-----------+--------++---+-----------+--------+
CLUSTER 5/ 7
COUNT:
67
72
73
7 CLUSTER
The characterizing elements are classified by order of importance with the help of a
statistical criterion (test-value) with which is associated a probability : the larger the testvalue is, the lower the probability, the better the element is defined.
In the case of the description of the classes by the categories of the categorical variables, an
option allows to sort the characterizing categories by decreasing test-values, or by
percentages.
CLUSTERS CHARACTERISATION BY ACTIVE CATEGORIES
CHARACTERISATION BY CATEGORIES OF CLUSTERS OR CATEGORIES
OF CUT "b" OF THE TREE INTO
7 CLUSTERS
Cluster 1 / 7
---------------------------------------------------------------------------------------------------------------T.VALUE PROB. ---- PERCENTAGES ---- CHARACTERISTIC
WEIGHT
GRP/CAT CAT/GRP GLOBAL CATEGORIES
OF VARIABLES
---------------------------------------------------------------------------------------------------------------12.80 Cluster 1 / 7
128
24.52 0.000
81.01 100.00 15.80 BEPC-BE-BEPS
Diploma in 5 categories
158
4.73 0.000
17.59
71.88 52.30 tenant
Occupation status of housing in 4 categories
523
3.10 0.001
18.31
40.63 28.40 25 to 34 yo
Age in 5 categories
284
3.08 0.001
17.61
46.09 33.50 Employee
Job category
335
2.85 0.002
20.67
24.22 15.00 Lower than 25 yo
Age in 5 categories
150
-2.04 0.021
8.73
15.63 22.90 Manager
Job category
229
-2.27 0.012
8.97
20.31 29.00 owner
Occupation status of housing in 4 categories
290
-2.33 0.010
2.08
0.78
4.80 Other
Job category
48
-2.72 0.003
3.61
2.34
8.30 Lower than 2.000
Urban area size (number of inhabitants)
83
-3.01 0.001
5.92
7.81 16.90 65 yo and more
Age in 5 categories
169
-3.28 0.001
6.22
10.16 20.90 35 to 49 yo
Age in 5 categories
209
-3.81 0.000
0.00
0.00
6.70 free housing, other Occupation status of housing in 4 categories
67
-4.49 0.000
0.00
0.00
8.70 2.000 - 20.000
Urban area size (number of inhabitants)
87
-6.27 0.000
0.00
0.00 15.00 University
Diploma in 5 categories
150
-7.06 0.000
0.00
0.00 18.20 Bac - Brevet sup.
Diploma in 5 categories
182
-7.22 0.000
0.00
0.00 18.90 No one
Diploma in 5 categories
189
-10.07 0.000
0.00
0.00 32.10 CEP
Diploma in 5 categories
321
----------------------------------------------------------------------------------------------------------------
Cluster
2 /
74
3 /
Cluster
4 /
Cluster
5 /
Cluster
6 /
75
7 /
76
77
78
79
OBJECT
The general purpose of this procedure, called VAREG, is to learn more about the
relationship between several independent or predictor variables and a dependent
continuous variable.
VAREG allows you to perform least squares adjustement models with a constant term. It
can be used for many different analyses including:
Simple regression
Multiple regression
Analysis of variance
Analysis of covariance
VAREG enables you to specify interactions (crossed effects) up to the 3rd order. Each
regression coefficient is associated with the null test, which is valid in the classical context
where the random term is assumed to be generated by a Laplace-Gauss law.
The REPEATED statement enables you to specify effects in the model that represent
repeated measurements on the same experimental unit for the same response.
The VAREG procedure generates automatically a rule file that allows you to create a new
data set (with the Deployment Archiving\Archiving\Predcitions method) containing
the input dataset in addition to predicted values and residuals.
The treatment of missing data is handled by the parameters.
OUTPUTS
Summary statistics on the variables of the model are output: Marginal distributions of the
categorical variables; mean, standard deviation, minimum and maximum of the
continuous variables. The method supplies the identification of the coefficients of the
model: coefficient of the continuous (endogenous) variables, the categories of the factors
and of the eventual interactions. Subsequently it is possible to output the matrix of the
variances and covariance, or the correlations matrix.
80
The procedure prints the coefficients, the estimation of their standard deviation, the
corresponding Students statistic, as well as the critical probability and the associated test
value. Also shown are the sum of the squares of the deviations, the multiple correlation
coefficient, and the estimate of the common factor variance of the residuals. Finally, the
test of simultaneous nullity of all the coefficients (test of an endogenous "y" constant) is
provided.
In the case of an analysis of variance, you also get the sum of the squared deviations
according to their source (residual, criteria or interaction) as well as Fishers statistics, the
critical probabilities and the associated test values. In the case of repeated observations,
the repeatability variance is displayed, as well as the estimates obtained including the
variance.
DEFINE A MODEL
The interface allows you to specify the definition of one or more models. The functions of
the CTRL, SHIFT keys are standard.
1. In the Selection list choose the TYPE of the variable(s) you want to define
81
2. Then in the Variables Available list, select one or more variables of the TYPE, and
confirm your choices with the transfer button. A double click on a variable
confirms the choice.
To delete a variable, or an interaction, of the model under construction, select it in the list
of models and confirm with the transfer button .
3. Save a model
Once that you have specified at least one endogenous variable and one exogenous
variable, click on the "Validate" Button to add the model to the Model list.
Delete a model
Select the model from the list and click on "Delete" button.
Change a model
Select the model in the list and click on "Modify" button.
PARAMETERS
The VAREG parameters allow you to handle missng data and to specify wether
measurments are repeated or not. With the printout parameters, you can specify the
desired ouputs.
82
If LSUPR = Deleted case, the cases presenting the missing data for one of the
variables of the model (endogenous or exogenous) will be eliminated from the
analysis.
If LSUPR = Mean imputation, the missing exogenous data will be replaced by the
corresponding variable.
Warning : if LSUPR = Mean imputation, the endogenous variable must not have any
missing data.
Missing data handling categorical variables (LZERO)
Possible values:
Re-coded / Deleted case
Choose LREP = Repetitions in sequence if the repetitions are one under the others
in the data table lines.
Choose LREP = Repetitions in confusion if the repetitions are unordered
Output Parameters
Summary statistics on the variables in the model (LSTAT)
Possible values :
Yes / No
If LSTAT = Yes, one obtains marginal distributions for the categorical variables of the
model, as well as the various statistics concerning the continuous variables : mean,
standard deviation, minimum and maximum.
No
(No output)
Variances, covariance
(Output the variance covariance matrix)
83
Correlations
If LABEL = short, we use 4 characters for categorical variable label and 20 for
continuous variable label.
84
Data
This dataset corresponds to the perception that has 100 companies of their furnishers.
Criteria are the following:
Delivery time
Price index
Price flexibility
Perceived quality
Service quality
Commercial image
Product quality
Satisfaction
The main goal is to find the best model explaining Satisfaction by a subset of the other
items.
85
Id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Price
Perceived Service Commercial
Company
Delivery Price
Index Flexibility
Quality
Quality
Image
Size
delay
< 50 employees
4,1
0,6
6,9
4,7
2,4
2,3
>= 50 employees
1,8
3
6,3
6,6
2,5
4
>= 50 employees
3,4
5,2
5,7
6
4,3
2,7
>= 50 employees
2,7
1
7,1
5,9
1,8
2,3
< 50 employees
6
0,9
9,6
7,8
3,4
4,6
>= 50 employees
1,9
3,3
7,9
4,8
2,6
1,9
< 50 employees
4,6
2,4
9,5
6,6
3,5
4,5
>= 50 employees
1,3
4,2
6,2
5,1
2,8
2,2
< 50 employees
5,5
1,6
9,4
4,7
3,5
3
>= 50 employees
4
3,5
6,5
6
3,7
3,2
2,4
1,6
8,8
4,8
2
2,8
< 50 employees
< 50 employees
3,9
2,2
9,1
4,6
3
2,5
>= 50 employees
2,8
1,4
8,1
3,8
2,1
1,4
3,7
1,5
8,6
5,7
2,7
3,7
< 50 employees
< 50 employees
4,7
1,3
9,9
6,7
3
2,6
< 50 employees
3,4
2
9,7
4,7
2,7
1,7
< 50 employees
3,2
4,1
5,7
5,1
3,6
2,9
< 50 employees
4,9
1,8
7,7
4,3
3,4
1,5
< 50 employees
5,3
1,4
9,7
6,1
3,3
3,9
< 50 employees
4,7
1,3
9,9
6,7
3
2,6
< 50 employees
3,3
0,9
8,6
4
2,1
1,8
< 50 employees
3,4
0,4
8,3
2,5
1,2
1,7
< 50 employees
3
4
9,1
7,1
3,5
3,4
>= 50 employees
2,4
1,5
6,7
4,8
1,9
2,5
< 50 employees
5,1
1,4
8,7
4,8
3,3
2,6
< 50 employees
4,6
2,1
7,9
5,8
3,4
2,8
>= 50 employees
2,4
1,5
6,6
4,8
1,9
2,5
< 50 employees
5,2
1,3
9,7
6,1
3,2
3,9
< 50 employees
3,5
2,8
9,9
3,5
3,1
1,7
>= 50 employees
4,1
3,7
5,9
5,5
3,9
3
>= 50 employees
3
3,2
6
5,3
3,1
3
< 50 employees
2,8
3,8
8,9
6,9
3,3
3,2
< 50 employees
5,2
2
9,3
5,9
3,7
2,4
>= 50 employees
3,4
3,7
6,4
5,7
3,5
3,4
>= 50 employees
2,4
1
7,7
3,4
1,7
1,1
>= 50 employees
1,8
3,3
7,5
4,5
2,5
2,4
>= 50 employees
3,6
4
5,8
5,8
3,7
2,5
< 50 employees
4
0,9
9,1
5,4
2,4
2,6
>= 50 employees
0
2,1
6,9
5,4
1,1
2,6
>= 50 employees
2,4
2
6,4
4,5
2,1
2,2
>= 50 employees
1,9
3,4
7,6
4,6
2,6
2,5
< 50 employees
5,9
0,9
9,6
7,8
3,4
4,6
< 50 employees
4,9
2,3
9,3
4,5
3,6
1,3
< 50 employees
5
1,3
8,6
4,7
3,1
2,5
>= 50 employees
2
2,6
6,5
3,7
2,4
1,7
< 50 employees
5
2,5
9,4
4,6
3,7
1,4
< 50 employees
3,1
1,9
10
4,5
2,6
3,2
>= 50 employees
3,4
3,9
5,6
5,6
3,6
2,3
< 50 employees
5,8
0,2
8,8
4,5
3
2,4
< 50 employees
5,4
2,1
8
3
3,8
1,4
< 50 employees
3,7
0,7
8,2
6
2,1
2,5
>= 50 employees
2,6
4,8
8,2
5
3,6
2,5
>= 50 employees
4,5
4,1
6,3
5,9
4,3
3,4
>= 50 employees
2,8
2,4
6,7
4,9
2,5
2,6
< 50 employees
3,8
0,8
8,7
2,9
1,6
2,1
< 50 employees
2,9
2,6
7,7
7
2,8
3,6
86
Product
Quality
5,2
8,4
8,2
7,8
4,5
9,7
7,6
6,9
7,6
8,7
5,8
8,3
6,6
6,7
6,8
4,8
6,2
5,9
6,8
6,8
6,3
5,2
8,4
7,2
3,8
4,7
7,2
6,7
5,4
8,4
8
8,2
4,6
8,4
6,2
7,6
9,3
7,3
8,9
8,8
7,7
4,5
6,2
3,7
8,5
6,3
3,8
9,1
6,7
5,2
5,2
9
8,8
9,2
5,6
7,7
87
Mean
Number of missing
values
Delivery Time
Price Index
Price Flexibility
Perceived Quality
Service Quality
Commercial Image
Product Quality
Usage Index
3,515
2,364
7,894
5,248
2,916
2,665
6,971
46,100
0
0
0
0
0
0
0
0
R criteria
Curve of R according to the number of variables
The following graph displays the evolution of the R criteria according to the number of
variables entered in the model. Higher is this criteria, better is the model.
But as this criterion increases automatically by entering new variables in the model, we
must evaluate the relative gain of adding each new variable. We will see further criteria
that penalize the R for each new entered variable: the adjusted R adjusted and the
Mallow C(p).
By looking at the graph below, we see that the R increases significantly up to 3 variables.
The next variables are redundant and do not bring any more information that could
improve significantly the model.
The R can be interpreted as the part of the variance explained by the model. It takes its
values between 0 and 1.
1
2
Number of
model's
variables
3
4
5
6
7
8
0.45
0.48
0.52
0.55
0.59
0.62
0.66
Value of R2
88
0.69
0.73
0.77
0.80
1 var
This output presents the 3 best adjustments with one exogenous variable.
Adjustments with 1 variable + constant DDL(Student) = 98
Adjustment 1 (Full printout)
R**2 = 0.5051
Fisher = 100.0162
Probability = 0.0000
Test-Value = 8.283
Variable label
Usage Index
Coefficient
0,0676
Student
10,00
Probability
0,000
Test-Value
8,28
Coefficient
0,4215
Student
8,48
Probability
0,000
Test-Value
7,33
Coefficient
0,7189
Student
8,06
Probability
0,000
Test-Value
7,04
89
6 vars
Adjustments with 6 variables + constant DDL(Student) = 93
Adjustment 1 (Full printout)
R**2 = 0.8009
Fisher = 62.3410
Probability = 0.0000
Test-Value = 11.408
Variable label
Delivery Time
Price Index
Price Flexibility
Perceived Quality
Commercial Image
Product Quality
Coefficient
0,3061
0,2446
0,2912
0,4324
-0,1978
-0,0470
Student
8,10
5,95
7,99
7,39
2,35
1,49
Probability
0,000
0,000
0,000
0,000
0,021
0,139
Test-Value
7,03
5,47
6,95
6,54
2,31
1,48
Coefficient
0,0777
0,2846
0,4210
0,4536
-0,1926
-0,0417
Student
1,49
7,84
7,13
5,87
2,28
1,33
Probability
0,140
0,000
0,000
0,000
0,025
0,188
Test-Value
1,47
6,85
6,35
5,40
2,24
1,32
Coefficient
-0,0624
0,2891
0,4167
0,5884
-0,1894
-0,0453
Student
1,14
7,83
7,03
7,93
2,23
1,42
Probability
0,256
0,000
0,000
0,000
0,028
0,159
Test-Value
1,14
6,84
6,28
6,91
2,20
1,41
Coefficient
0,3247
0,2291
0,2993
0,4303
-0,2100
Student
9,05
5,73
8,25
7,31
2,49
Probability
0,000
0,000
0,000
0,000
0,015
Test-Value
7,65
5,29
7,14
6,49
2,44
Number
of model's
variables
8
0.44
0.48
0.51
0.55
0.58
0.62
0.65
0.68
Value of R2 ajusted
91
0.72
0.75
0.79
1
2
Number
of
model's
variables
3
4
5
6
7
8
0.0
0.1
0.3
0.4
0.5
0.7
0.8
0.9
Value of Mallows Cp
92
1.1
1.2
1.3
SSE
SST
SSE : Error Sum of Squares
SST : Total Sum of Squares.
R = 1
2. R adjusted :
The R adjusted criterion is based on the standard R, but it imposes a penalty for each
additional explanatory variable that is used to build the model.
( n 1)(1 R )
R = 1
(n p)
n : the number of observations,
p : the number of variables used for the model plus one.
3. Mallows CP - C(p) :
The Mallows C(p) is positively related to the error (SSE) and the number of
explanatory variables in the model :a model with a lot of variables or with a high error
will be penalized by this criterion.
C ( p) =
SSE
+ 2p n
SST
References:
Furnival, G.M. and Wilson, R.W. (1974), Regression by Leaps and Bounds
Technometrics, 16, 499 -511.
93
Logistic Regression
LOGISTIC REGRESSION
LOGIT INTRODUCTION
The logistic regression means to explain the probability of a binary event. This probability
cannot be explained by a traditional regression model using the least squares method.
Thus, we perform a qualified LOGIT transformation whose process uses the generalized
linear model and establishes a method based on the research of maximum likelihood.
If P is the probability that we are trying to explain, the P/(1-P) ratio must be defined as
ODDS and the magnitude that is finally explained corresponds to this ODDS logarithm.
We want to explain P (Y = 1/ X 1 , X 2 )
Thus: P (Y = 1/ X 1 , X 2 ) + P (Y = 2 / X 1 , X 2 ) = 1
The logit of the probability P is the logarithm of the quotient
P
Logit ( P ) = Log
1 P
P
:
1 P
(1)
1/2
94
95
Logistic Regression
Table 1
D Matrix construction
RACE (categories)
D2
D3
D4
0
1
0
0
0
0
1
0
0
0
0
1
White (1)
Black (2)
Hispanic (3)
Others (4)
The logistic model is then written this way:
P (Y = 1/ Z = 1)
Logit
P (Y = 1/ Z = 2 )
P (Y = 1/ Z = 3)
P (Y = 1/ Z = 4 )
0 0 0
1 0 0
1
1
{
D0
0 +
0 1 0
0 0 1
14243
2
3
4
Thus, the explanatory variable Z with k categories is transformed into (k-1) design
variables, notated du. If the first category is the reference, the logit is written:
k
Logit P (Y = 1/ Z ) = 0 + u d u
u =2
Logit P (Y = 1/ Z ) = 0 + 2d 2 + 3d 3 + 4d 4
with
du = 1 if Z = u
du = 0 otherwise
96
Iterations number :
Specifies the maximum number of iterations to perform.
By default, Iterations number=25. If convergence is not attained in n iterations, the
displayed output created by the procedure contain results that are based on the last
maximum likelihood iteration.
Seuil alpha pour les tests (en %) :
Sets the level of significance for (100 )% confidence intervals for regression
parameters or odds ratios. The value must be between 0 and 100. By default, is
equal 5%.
97
Logistic Regression
A1
A2
A5
-1
-1
-1
GLM
Four columns are created to indicate group membership. The design matrix columns
for A are as follows.
GLM Coding
Design Matrix
A A1 A2 A5 A7
1
98
instance, if the reference level is 7 (REF='7'), the design matrix columns for A are as
follows.
Comparison to a Reference Coding
Design Matrix
A
A1
A2
A5
Variable selections :
The selection options are available only if the model contains simple factors (no
interaction).
No selection
The model is estimated with all the input variables, this is the default option.
Forward
The procedure first estimates parameters for factors forced into the model. These
factors are the intercepts and the first n explanatory factors in the model statement,
where n is the number specified by the Number of variables in initial model (n is
zero by default). Next, the procedure computes the score chi-square statistic for each
factor not in the model and examines the largest of these statistics. If it is significant at
the Threshold (%) for the variables entry in model level, the corresponding factor is
added to the model. Once a factor is entered in the model, it is never removed from the
model. The process is repeated until none of the remaining effects meet the specified
level for entry or until the Number of variables in final model value is reached.
Backward
Parameters for the complete model as specified in the model statement are estimated
unless the Number of variables in initial model option is specified. In that case, only
the parameters for the intercepts and the first n explanatory factors in the model
statement are estimated, where n is the Number of variables in initial model. Results
of the Wald test for individual parameters are examined. The least significant factor
that does not meet the Threshold (%) for a variable to stay in the model is removed.
Once a factor is removed from the model, it remains excluded. The process is repeated
until no other factor in the model meets the specified level for removal or until the
Number of variables in final model value is reached. Backward selection is often less
99
Logistic Regression
successful than forward or stepwise selection because the full model fit in the first step
is the model most likely to result in a complete or quasi-complete separation of
response values.
Stepwise
This option is similar to the FORWARD option except that factors already in the model
do not necessarily remain. Factors are entered into and removed from the model in
such a way that each forward selection step may be followed by one or more backward
elimination steps. The stepwise selection process terminates if no further factor can be
added to the model or if the factor just entered into the model is the only factor
removed in the subsequent backward elimination.
2 CATEGORIES
4 CATEGORIES
4 CATEGORIES
5 CATEGORIES
2 CATEGORIES
4 CATEGORIES
3 CATEGORIES
3 CATEGORIES
4 CATEGORIES
3 CATEGORIES
2 CATEGORIES
2 CATEGORIES
100
101
Logistic Regression
REGRESSION LOGISTIQUE
MODEL PRESENTATION
MODEL DEFINITION
================
RESPONSE VARIABLE ...............
NUMBER OF RESPONSE LEVELS .......
NUMBER OF OBSERVATIONS ..........
LINK FUNCTION ...................
OPTIMIZATION TECHNIQUE ..........
: Type of client
:
2
:
468
: BINARY LOGIT
: FISHER'S SCORING
RESPONSE PROFILE
================
VARIABLE RESPONSE : Type of client
==========================
ORDER RESPONSE FREQUENCY
-------------------------1
Good
237
2
Bad
231
==========================
PROBABILITY MODELED IS: Type of client = Good
102
103
Logistic Regression
ODDS RATIO ESTIMATES
=========================================================================
EFFECT
ESTIMATE
CONFIDENCE LIMITS *
------------------------------------------------------------------------Seniority 1
VS 5
0.244
0.109
0.548
2
VS 5
0.554
0.200
1.538
3
VS 5
1.417
0.535
3.755
4
VS 5
0.687
0.263
1.798
Salary domiciliation 1 VS 2
4.389
2.485
7.752
Size of savings 1
VS 4
1.488
0.101
22.004
2
VS 4
1.904
0.150
24.208
3
VS 4
1.457
0.126
16.898
Profession 1
VS 3
1.933
0.816
4.577
2
VS 3
1.301
0.745
2.271
=========================================================================
=========================================================================
EFFECT
ESTIMATE
CONFIDENCE LIMITS *
------------------------------------------------------------------------Age of client 1
VS 4
0.374
0.146
0.962
2
VS 4
0.764
0.350
1.668
3
VS 4
1.255
0.585
2.690
Family Situation 1
VS 4
4.300
0.851
21.734
2
VS 4
2.194
0.455
10.579
3
VS 4
0.906
0.166
4.960
Average outstanding 1
VS 3
0.190
0.041
0.882
2
VS 3
0.469
0.114
1.922
Average transactions 1 VS 4
0.336
0.154
0.732
2 VS 4
0.510
0.219
1.188
3 VS 4
0.676
0.325
1.404
Number of withdrawals 1 VS 3
7.534
3.164
17.939
2 VS 3
3.006
1.419
6.366
Overdraft 1
VS 2
0.876
0.519
1.479
Checkbook 1
VS 2
8.081
2.867
22.779
=========================================================================
* 95% WALD CONFIDENCE LIMITS
CONFUSION MATRIX
FREQUENCIES
-------------------------------| ESTIM Good
Bad |
TOTAL
-------------+------------------------------OBSERV Good |
191
45 |
236
Bad |
38
194 |
232
--------------------------------------------TOTAL
|
229
239 |
468
ROW PERCENTAGES
-------------------------------| ESTIM Good
Bad |
TOTAL
-------------+------------------------------OBSERV Good |
80.932 19.068 | 100.000
Bad |
16.379 83.621 | 100.000
--------------------------------------------TOTAL
|
48.932 51.068 | 100.000
COLUMN PERCENTAGES
-------------------------------| ESTIM Good
Bad |
TOTAL
-------------+------------------------------OBSERV Good |
83.406 18.828 |
50.427
Bad |
16.594 81.172 |
49.573
--------------------------------------------TOTAL
|
100.000 100.000 | 100.000
CLASSIFICATION
-------------------------------| CLASS. WELL
BAD |
TOTAL
-------------+------------------------------OBSERV Good |
80.932 19.068 | 100.000
Bad |
83.621 16.379 | 100.000
--------------------------------------------TOTAL
|
82.265 17.735 | 100.000
104
Dataset
The dataset is extracted from a survey where 100 respondents judge their suppliers. The
criteria are :
Delivery time
Prices level
Prices flexibility
Image
Services
Commercial image
Product quality
About the suppliers, we know the size of the company in two classes: more or less than 50
employees.
The goal of the analysis is to study the differences between these two classes.
105
Delivery
Time
4,1
1,8
3,4
2,7
6
1,9
4,6
1,3
5,5
4
2,4
3,9
2,8
3,7
4,7
3,4
3,2
4,9
5,3
4,7
3,3
3,4
3
2,4
5,1
4,6
2,4
5,2
3,5
4,1
3
2,8
5,2
3,4
2,4
1,8
3,6
4
0
2,4
1,9
5,9
4,9
5
2
5
3,1
3,4
5,8
5,4
Prices
Level
0,6
3
5,2
1
0,9
3,3
2,4
4,2
1,6
3,5
1,6
2,2
1,4
1,5
1,3
2
4,1
1,8
1,4
1,3
0,9
0,4
4
1,5
1,4
2,1
1,5
1,3
2,8
3,7
3,2
3,8
2
3,7
1
3,3
4
0,9
2,1
2
3,4
0,9
2,3
1,3
2,6
2,5
1,9
3,9
0,2
2,1
Prices
Flexibility
6,9
6,3
5,7
7,1
9,6
7,9
9,5
6,2
9,4
6,5
8,8
9,1
8,1
8,6
9,9
9,7
5,7
7,7
9,7
9,9
8,6
8,3
9,1
6,7
8,7
7,9
6,6
9,7
9,9
5,9
6
8,9
9,3
6,4
7,7
7,5
5,8
9,1
6,9
6,4
7,6
9,6
9,3
8,6
6,5
9,4
10
5,6
8,8
8
Image
Services
4,7
6,6
6
5,9
7,8
4,8
6,6
5,1
4,7
6
4,8
4,6
3,8
5,7
6,7
4,7
5,1
4,3
6,1
6,7
4
2,5
7,1
4,8
4,8
5,8
4,8
6,1
3,5
5,5
5,3
6,9
5,9
5,7
3,4
4,5
5,8
5,4
5,4
4,5
4,6
7,8
4,5
4,7
3,7
4,6
4,5
5,6
4,5
3
2,4
2,5
4,3
1,8
3,4
2,6
3,5
2,8
3,5
3,7
2
3
2,1
2,7
3
2,7
3,6
3,4
3,3
3
2,1
1,2
3,5
1,9
3,3
3,4
1,9
3,2
3,1
3,9
3,1
3,3
3,7
3,5
1,7
2,5
3,7
2,4
1,1
2,1
2,6
3,4
3,6
3,1
2,4
3,7
2,6
3,6
3
3,8
106
Commercial Product
Image
Quality
Supplier's
Company Size
2,3
4
2,7
2,3
4,6
1,9
4,5
2,2
3
3,2
2,8
2,5
1,4
3,7
2,6
1,7
2,9
1,5
3,9
2,6
1,8
1,7
3,4
2,5
2,6
2,8
2,5
3,9
1,7
3
3
3,2
2,4
3,4
1,1
2,4
2,5
2,6
2,6
2,2
2,5
4,6
1,3
2,5
1,7
1,4
3,2
2,3
2,4
1,4
< 50 employees
>= 50 employees
>= 50 employees
>= 50 employees
< 50 employees
>= 50 employees
< 50 employees
>= 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
>= 50 employees
>= 50 employees
>= 50 employees
< 50 employees
>= 50 employees
>= 50 employees
>= 50 employees
< 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
>= 50 employees
< 50 employees
< 50 employees
5,2
8,4
8,2
7,8
4,5
9,7
7,6
6,9
7,6
8,7
5,8
8,3
6,6
6,7
6,8
4,8
6,2
5,9
6,8
6,8
6,3
5,2
8,4
7,2
3,8
4,7
7,2
6,7
5,4
8,4
8
8,2
4,6
8,4
6,2
7,6
9,3
7,3
8,9
8,8
7,7
4,5
6,2
3,7
8,5
6,3
3,8
9,1
6,7
5,2
107
Fuwil 4
The Fuwil 4 excel sheet gives the main statistics of each class regarding to the
explanatory variables.
The column Within-group mean displays the means of each explanatory variable
respectively for the group 1 and 2. By default, the group 1 is the first category (in the list)
of the endogenous variable. In this example, the group 1 concerns small suppliers (< 50
employees) and the group 2 bigger suppliers (50 or more employees)
The column General mean displays the mean of each variable observed on the total set.
Missing data handling for exogenous variables
Missing values are replaced by within-groups means
Group Variable label
1
1
1
1
1
1
1
2
2
2
2
2
2
2
Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality
Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality
Within-group
General mean
mean
4,192
1,948
8,622
5,213
3,050
2,692
6,090
2,500
2,988
6,803
5,300
2,715
2,625
8,293
3,515
2,364
7,894
5,248
2,916
2,665
6,971
3,515
2,364
7,894
5,248
2,916
2,665
6,971
Number of
missing
values
0
0
0
0
0
0
0
0
0
0
0
0
0
0
This table is useful to detect the variables with the main average differences between the
class and the overall sample.
For example, the class number 2 (suppliers with more than 50 employees), obtains an
average quality score of 8.293, while the class number 1 obtains a score of 6.090.
The Image variable does not differentiate small suppliers than bigger ones.
With the DEMOD procedure (Descriptive statistics), we would get these results :
108
109
The R Criterion
Curve of R according to the number of explanatory variables
This graph displays the evolution of the R criterion according to the number of
explanatory variables included in the model. Higher is the R, better is the adjustment.
The R increases automatically with the number of explanatory variables.
Therefore, it is recommended to find a compromise between the best R and the smallest
model in terms of explanatory variables.
Some other criterions are available in the parameters tab such as : R adjusted, Mallows
CP.
The graph below shows that the R increases until the entry of the 4th explanatory variable;
then adding some other variables do not increase the R and the adjustments quality :
these variables are redundant.
The R can be interpreted as the part of variance explained by the linear discriminant
function. It goes from 0 to 1.
Number of
model's
variables
0.43
0.45
0.48
0.50
0.53
0.55
0.57
0.60
0.62
0.65
0.67
Value of R2
The excel sheets 1 var to 7 vars display the 3 best adjustments regarding to the R for
models from 1 to 7 explanatory variables.
110
1 var
This table lists the 3 best adjustments (R) with one single explanatory variable.
Adjustments with 1 variable + constante DDL(Student) = 98
Adjustment 1 (Full printout)
R**2 = 0.4680
Fisher = 86.2000
Probability = 0.0000
Test-Value = 7.845
Variable label
Product Quality
Coefficient
-0,4337
Student
9,28
Probability
0,000
Test-Value
7,85
Coefficient
0,4683
Student
8,38
Probability
0,000
Test-Value
7,26
Coefficient
0,4799
Student
8,04
Probability
0,000
Test-Value
7,03
111
6 vars
The two following adjustments contain both 6 explanatory variables.
Adjustments avec 6 variables + constante DDL(Student) = 93
Adjustment 1 (Full printout)
R**2 = 0.6718
Fisher = 31.7290
Probability = 0.0000
Test-Value = 9.210
Variable label
Delivery Time
Prices Level
Prices Flexibility
Services
Commercial Image
Product Quality
Coefficient
0,3005
0,1242
0,2418
-0,2308
0,1516
-0,2812
Student
1,12
0,45
4,40
0,45
1,85
5,90
Probability
0,264
0,656
0,000
0,657
0,067
0,000
Test-Value
1,12
0,44
4,18
0,44
1,83
5,42
Coefficient
0,1863
0,0070
0,2383
-0,0328
0,1833
-0,2790
Student
3,27
0,11
4,33
0,37
1,44
5,87
Probability
0,002
0,910
0,000
0,711
0,152
0,000
Test-Value
3,17
0,11
4,12
0,37
1,43
5,40
Coefficient
0,1844
0,2368
-0,0317
0,0029
0,1831
-0,2779
Student
2,35
4,34
0,36
0,02
1,44
5,88
Probability
0,021
0,000
0,722
0,980
0,153
0,000
Test-Value
2,31
4,13
0,36
0,02
1,43
5,41
For the first adjustment, the variables Prices Flexibility and Product Quality are the
only ones significant to 5% (probability that the related coefficient is null, lower than 5%).
112
3 Vars
Finally, we should search the best adjustments in the models with 3 or 4 explanatory
variables, where all the coefficients are significant and the models test values are the
highest.
Adjustments avec 3 variables + constante DDL(Student) = 96
Adjustment 1 (Full printout)
R**2 = 0.6591
Fisher = 61.8789
Probability = 0.0000
Test-Value = 9.660
Variable label
Delivery Time
Prices Flexibility
Product Quality
Coefficient
0,2031
0,2370
-0,2592
Student
3,64
4,55
5,79
Probability
0,000
0,000
0,000
Test-Value
3,51
4,32
5,35
Coefficient
0,3016
0,2206
-0,3097
Student
6,06
2,68
7,12
Probability
0,000
0,009
0,000
Test-Value
5,56
2,63
6,36
Coefficient
0,3018
0,1953
-0,3323
Student
6,02
2,38
7,46
Probability
0,000
0,019
0,000
Test-Value
5,53
2,34
6,61
113
Number of
model's
variables
4
5
6
7
0.42
0.45
0.47
0.49
0.52
0.54
0.56
0.59
0.61
0.63
0.66
Value of R2 ajusted
4 vars
The firs adjustment with 4 explanatory variables is the following:
Adjustments avec 4 variables + constante DDL(Student) = 95
Adjustment 1 (Full printout)
R2AJ = 0.6574
Fisher = 48.4911
Probability = 0.0000
Test-Value = 9.612
Variable label
Delivery Time
Prices Flexibility
Commercial Image
Product Quality
Coefficient
0,1840
0,2390
0,1476
-0,2788
Student
3,28
4,64
1,86
6,13
Probability
0,001
0,000
0,066
0,000
Test-Value
3,18
4,40
1,84
5,61
The R adjusted is about 0.6574; very close to the standard R of 0.6711. The explanatory
variables are meaningful, thus the penalty related to the R adjusted is very small.
114
Number
of
model's
variables
3
4
5
6
7
0.00 0.05 0.11 0.16 0.22 0.27 0.32 0.38 0.43 0.49 0.54
Value of Mallows Cp
4 vars
Adjustments avec 4 variables + constante DDL(Student) = 95
Adjustment 1 (Full printout)
C(P) = 2.2916
Fisher = 48.4607
Probability = 0.0000
Test-Value = 9.610
Variable label
Delivery Time
Prices Flexibility
Commercial Image
Product Quality
Coefficient
0,1840
0,2390
0,1476
-0,2788
115
Student
3,28
4,64
1,86
6,13
Probability
0,001
0,000
0,066
0,000
Test-Value
3,18
4,40
1,84
5,61
SSE
SST
SSE : Error Sum of Squares
SST : Total Sum of Squares.
R = 1
5. R adjusted :
The R adjusted criterion is based on the standard R, but it imposes a penalty for each
additional explanatory variable that is used to build the model.
( n 1)(1 R )
R = 1
(n p)
n : the number of observations,
p : the number of variables used for the model plus one.
6. Mallows CP - C(p) :
The Mallows C(p) is positively related to the error (SSE) and the number of
explanatory variables in the model :a model with a lot of variables or with a high error
will be penalized by this criterion.
C ( p) =
SSE
+ 2p n
SST
References :
Furnival, G.M. and Wilson, R.W. (1974), Regression by Leaps and Bounds
Technometrics, 16, 499 -511.
116
117
118
Dis2g 3
The following table describes the differences observed between the two classes, regarding
to the input explanatory variables.
Analyse discriminante linaire sur l'chantillon DE BASE
Description des chantillons
G1 :
< 50 salaris [ 60]
Libell de la variable
G2 :
>= 50 salaris [ 40]
Dlais de livraison
Moyenne
Ecart-type
Minimum
Maximum
4.192
1.029
2.100
6.100
T de
Student
Probabilit
8.045
0.000
8.378
0.000
9.284
0.000
2.500
1.006
0.000
4.900
8.622
1.154
5.100
10.000
6.803
0.879
5.000
8.500
Moyenne
Ecart-type
Minimum
Maximum
6.090
1.282
3.700
8.500
8.293
0.918
6.200
10.000
Qualit du produit
The first group G1 corresponds to the suppliers with less than 50 employees. There are 60
in the sample.
The second group G2 corresponds to the suppliers with 50 or more employees, there are
40.
SPAD displays the means, standard deviations, minima and maxima for each explanatory
variable by group.
The Student T column corresponds to the test that the two means of the two groups are
equal for each explanatory variable. We reject this hypothesis for the three variables
because the associated probabilities are lower than 1/10000.
The product quality is perceived as significantly higher for the suppliers with more than
50 employees (average score of 8.29 against 6.09).
Reversely, delivery times and prices flexibility are better for smaller suppliers.
119
Dis2g 4
This table displays all the correlation matrices associated with the discriminant analysis.
Correlation matrix
Correlation matrix on group 1 : < 50 employees (Cont = 60)
Delivery
Prices
Time
Flexibility
1,00
Delivery Time
0,32
1,00
Prices Flexibility
-0,17
0,04
Product Quality
Correlation matrix on group 2 : >= 50 employees (Cont = 40)
Delivery
Prices
Time
Flexibility
1,00
Delivery Time
-0,12
1,00
Prices Flexibility
0,07
-0,16
Product Quality
Product
Quality
1,00
Product
Quality
1,00
Delivery Time
Prices Flexibility
Product Quality
Delivery
Time
1,00
0,17
-0,09
Prices
Flexibility
Product
Quality
1,00
-0,01
1,00
Delivery
Time
1,00
0,51
-0,48
Prices
Flexibility
Product
Quality
1,00
-0,45
1,00
Total correlation
Delivery Time
Prices Flexibility
Product Quality
The first two correlation matrices display the correlations between explanatory variables
inside each group. For example, the correlation between delivery time and prices
flexibility is 0.32 for the group 1 and 0.12 for the group 2.
These two matrices allow us to determine redundancies between explanatory variables:
this is not the case in this example.
120
Dis2g 6
Classification table of the discriminant analysis
Result of the FISHER linear discriminant analysis on sample: TRAIN
Table of groups counts
Assignment
group: < 50
employees
50
4
Assignment
group: >= 50
employees
10
36
Total
60
40
Misclassified
Total
10
16,67
4
10,00
14
14,00
60
100,00
40
100,00
100
100,00
The adjustment presents a good classification rate on the current set: 50 of the 60 small
suppliers and 36 of 40 big suppliers, respectively 83% and 90%.
Globally, the good classification rate is 86% = (50+36)/100.
121
Dis2g 9
This table displays the characteristics of the linear discriminant function :
Linear discriminante function
R2 = 0.65913 Fisher = 61.87877 Probability = 0.0000
D2 (Mahalanobis) = 7.89599 T2 (Hotelling) = 189.50369 Probability =
Variable label
Delivery Time
Prices Flexibility
Product Quality
CONSTANT
Correlations
with D.L.F.
(Threshold =
0.201)
0,632
0,648
-0,686
D.L.F.
coefficients
Regression
coefficients
1,191760
1,390700
-1,521000
-3,774790
0,203073
0,236972
-0,259174
-0,777758
0.0000
Standard
T de Student
deviation
(regression)
(Regression)
0,0558
0,0521
0,0448
0,5981
3,6373
4,5482
5,7880
1,3005
Probability
0,0004
0,0000
0,0000
0,1966
The R is 0.659; it means that the between group variance (that expresses the differences
between the two groups) represents 65.9% of the total variance.
The Fisher statistic corresponds to the global model validation.
Higher is the between group variance, higher is the Fisher statistic. This criterion follows a
Fisher law with 1 and 96 degrees of freedom.
The 61.87 Fishers statistic corresponds to a probability lower than 1/10000 (0.0000).
The model is acceptable.
D is the Mahalanobis distance between the two groups. This distance takes into account
the relationships between explanatory variables (the common correlation matrix).
The T of Hotelling is a generalization of the Student test when we have more than one
explanatory variable. It tests the hypothesis that all the means are equal.
In this example, T of Hotelling is 189.503 ; the associated probability is lower than 1/1000:
differences between means are significant.
For each explanatory variable, SPAD displays its correlation with the F.L.D. (Linear
Discriminant Function). The threshold of 0.201 corresponds to the limit where we consider
a correlation as significant (the threshold is given in absolute value).
The correlations between each explanatory variable and the linear discriminant function
are significant and quite close: the linear discriminant function is a good well-balanced
compromise between these three variables.
The F.L.D. coefficients give the model equation: therefore the best linear combination of
the 3 explanatory variables to separate the two groups is the following:
S1(x) = 1.191 x Delivery Time + 1.39 x Prices Flexibility 1.52 x Product Quality 3.77.
This equation gives high scores to suppliers that provide good delivery times and prices flexibility
(group 1, < 50 employees), and low scores for suppliers that have good quality products (group 2,
>= 50 employees).
122
Of course, the following equation is equivalent to the previous one but inverses the sign of
scores :
S2(x) = - 1.191 x Delivery Time - 1.39 x Prices Flexibility + 1.52 x Product Quality
+ 3.77.
The suppliers hierarchy is not modified.
The regression' coefficients column is redundant with the discriminant function
coefficients column : they are proportional.
Linear discriminant analysis based on two groups is a particular case of multiple regressions.
This equation :
123
Dis2g - 12
Discriminant analysis by bootstrap estimations: 250 random samples
Classification table (Counts and percentages)
Training
Training
sample - Well
sample classified
Misclassified
Original group: < 50 employees
Original group: >= 50 employees
Total
50,00
83,33
36,00
90,00
86,00
86,00
10,00
16,67
4,00
10,00
14,00
14,00
Bootstrap Well
classified
Bootstrap Misclassified
Total
49,53
82,55
35,78
89,45
85,31
85,31
10,47
17,45
4,22
10,55
14,69
14,69
60,00
100,00
40,00
100,00
100,00
100,00
Dis2g 13
Bootstrap estimations for linear discriminant function
Variable label
Delivery Time
Prices Flexibility
Product Quality
CONSTANT
Correlations
with D.L.F.
(Mean)
0,637
0,648
-0,691
Standard
deviation
0,051
0,064
0,038
124
D.L.F
coefficients
(Mean)
1,296
1,500
-1,633
-4,163
Standard
deviation
0,379
0,513
0,327
4,680
Mean /
Standard
deviation
3,418
2,924
4,996
0,889
Dis2g 11
In this excel sheet, SPAD displays for each case their observed group, their assigned
group, the probability of being assigned to this group by the model and their discriminant
score.
The column Group of origin gives for each case has to be compared with the Group
assignment column. If the model is right, SPAD prints '=='.
The Fisher function or score is calculated by the model with the following equation:
S(x) = 1.191 x Delivery Time + 1.39 x Prices Flexibility 1.52 x Product Quality 3.77.
For example, for the case n79, the score 7.767 is calculated this way : (Delivery Time 1.00,
Prices Flexibility 7.1, and Product Quality 9.9)
-7.767 = 1.197 x 1.00 + 1.39 x 7.1 1.52 x 9.9 3.77.
Cases are listed by decreasing scores. Thus the case n79 gets the lowest score and
therefore the highest probability of assignment to the group 2 (50 and more employees).
Reversely, cases with high scores have higher probability of assignment to the group 1
(lower than 50 employees).
For each case, SPAD calculates the probabilities to be assigned to each group and assigns
the case to the group with the highest probability. The indifference point (equal
probabilities for the two groups) corresponds here to a zero Fisher score; it does not
appear in this example
The assignment probability is obtained from the Fisher Score (S(x)):
P (G1 / x) =
exp( S ( x))
and then P(G2 / x) = 1 P(G1 / x)
1 + exp( S ( x))
Sample: TRAINING
List of group assignments and related probabilities
Case identifier
Individu n 79
Individu n 39
Individu n 65
Individu n 93
Individu n 88
Individu n 84
Individu n 87
Individu n 13
Individu n 85
Individu n 25
Individu n 42
Individu n 5
==
==
==
Assignment
probability
1,000
1,000
1,000
Fisher
function
-7,767
-7,716
-7,661
<50
<50
<50
>=50
>=50
>=50
0,877
0,873
0,848
-1,962
-1,932
-1,720
>=50
>=50
>=50
<50
<50
<50
0,640
0,687
0,690
0,577
0,788
0,802
<50
<50
<50
==
==
==
1,000
1,000
1,000
8,623
9,763
9,882
Original
group
>=50
>=50
>=50
125
Assignment
General principles
This procedure outputs a linear discriminant analysis with two groups on the factorial
coordinates from a NOT NORMED principal components analysis using the classical
Fisher method.
It provides bootstrap estimates of the bias and the precision of the principal results of the
discrimination: coefficients, case classification probabilities, global classification
percentages. It also allows the modification of the a priori costs and probabilities of the
classification in the groups. It provides the management of the base cases, of the test cases
and of the anonymous cases.
The procedure offers a print preview of the descriptive statistics of the model variables in
each of the two groups. Next the results of the discriminant analysis are shown:
classification tables, discriminant function, and output of the assignment of cases.
The decision rule is finally expressed as a function of the original variables. The results of
the regression equivalent are only indicative, since the classical hypotheses of normality
are meaningless in this context.
If a bootstrap validation is requested, the results of the discrimination are repeated with
the bootstrap estimates. In particular, the bias and the precision of the global classifications
are shown with the direct classifications. For anonymous cases, the procedure calculates
their bootstrap probability assignment.
If an evaluation of the test cases is required, the procedure outputs the results of the
discrimination relative to these cases. If the assignment of anonymous cases is required,
only the assignments are output.
The procedure can archive the rules for the discriminant function so they can be applied
later to another file of the same structure.
126
127
Dis2g 1
This first excel sheet displays the studied model: the variable to explain is the same than in
the previous methods (Suppliers company size), the explanatory variables are the
principal factors get from the principal component analysis based on all the continuous
variables available in the dataset except Satisfaction index.
By default, SPAD assigns the prefix F and the number corresponding to each factor.
F1 , F2 , . We ordered SPAD to process this analysis on the 7 factors, that is to say
99.99% of the total inertia
Model : V8=F1+F2+F3+F4+F5+F6+F7
Variable number
8
1
2
3
4
5
6
7
Variable label
Supplier's Company Size
F 1
F 2
F 3
F 4
F 5
F 6
F 7
EIGENVALUES
COMPUTATIONS PRECISION SUMMARY : TRACE BEFORE DIAGONALISATION.. 89.9375
SUM OF EIGENVALUES............ 89.9375
HISTOGRAM OF THE FIRST
8 EIGENVALUES
+--------+------------+-------------+-------------+----------------------------------------------+
| NUMBER | EIGENVALUE | PERCENTAGE | CUMULATED |
|
|
|
|
| PERCENTAGE |
|
+--------+------------+-------------+-------------+----------------------------------------------+
|
1
| 81.8822
|
91.04
|
91.04
| *********************************************|
|
2
|
4.0759
|
4.53
|
95.58
| ****
|
|
3
|
1.4053
|
1.56
|
97.14
| **
|
|
4
|
1.2298
|
1.37
|
98.51
| **
|
|
5
|
0.7842
|
0.87
|
99.38
| *
|
|
6
|
0.3903
|
0.43
|
99.81
| *
|
|
7
|
0.1617
|
0.18
|
99.99
| *
|
|
8
|
0.0081
|
0.01
|
100.00
| *
|
+--------+------------+-------------+-------------+----------------------------------------------+
128
Assignment
group: < 50
employees
54
0
Assignment
group: >= 50
employees
6
40
Total
60
40
Misclassified
Total
6
10,00
0
0,00
6
6,00
60
100,00
40
100,00
100
100,00
The adjustment presents good classification rate on this sample: it assigns correctly 54 of
60 small suppliers and all the big ones, respectively 90% and 100%.
The global good classification rate is 94% = (54+40)/100.
Comparison with the model of the previous chapter
We can notice that this model obtains better results than the previous one that only used
three predictors (Delivery time, Prices flexibility and Quality product).
Classification table of the previous model :
Result of the FISHER linear discriminant analysis on sample: TRAIN
Table of groups counts
Assignment
group: < 50
employees
50
4
Assignment
group: >= 50
employees
10
36
Total
60
40
Well
classified
50
83,33
36
90,00
86
86,00
Misclassified
Total
10
16,67
4
10,00
14
14,00
60
100,00
40
100,00
100
100,00
In our current model, we have kept all the available information (almost all the
explanatory variables and all the factors), it is normal to get better results.
129
Correlations
with D.L.F.
(Threshold =
0.201)
-0,380
-0,651
0,240
0,036
0,028
0,278
-0,101
D.L.F.
coefficients
Regression
coefficients
-0,291099
-2,234360
1,403290
0,226999
0,221504
3,090510
-1,747540
1,009960
-0,041896
-0,321575
0,201965
0,032670
0,031879
0,444793
-0,251510
0,000000
0.0000
Standard
T de Student
deviation
(regression)
(Regression)
0,0062
0,0277
0,0472
0,0504
0,0632
0,0895
0,1391
0,0559
6,7769
11,6055
4,2798
0,6477
0,5046
4,9676
1,8078
0,0000
Probability
0,0000
0,0000
0,0000
0,5188
0,6150
0,0000
0,0739
1,0000
The R is 0.7120 ; it means that the between group variance represents 71.20 % of the total
variance.
The Fisher statistic is 32.50 corresponding to a probability lower than 1/10000 (0.0000).
Thus, the model is accepted.
All the statistics displayed in the above table are described in the previous section page 19.
We can see that the factors 4 and 5 present coefficient none significantly different from
zero (probabilities 0.5188 and 0.6150). The factor 7 presents also a probability greater than
0.05.
The coefficients of the Linear Discriminant Function give the following equation:
S1(x) = - 0.291 x F1 - 2.23 x F2 + 1.40 x F3 + 0.226 x F4 + 0.221 x F5 + 3.09 x F6
- 0.14 x F7 + 1.0099.
130
Category label
Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality
Frequency of use
CONSTANTE
D.L.F.
coefficients
Regression
coefficients
2,018560
0,590870
2,771660
-0,198512
1,257900
1,674080
-1,764960
-0,327111
-9,065840
0,290515
0,085039
0,398903
-0,028570
0,181039
0,240937
-0,254017
-0,047079
-1,450130
Standard
T de Student
deviation
(regression)
(Regression)
0,0588
4,9366
0,0573
1,4851
0,0684
5,8360
0,0831
0,3437
0,0430
4,2083
0,1206
1,9979
0,0453
5,6085
0,0130
3,6132
Probability
0,0000
0,1410
0,0000
0,7319
0,0001
0,0487
0,0000
0,0005
The variables Image et Prices Level are not significant (respective probabilities of
0.7319 et 0.141). The small contribution of the variable Image is not surprising, we get
131
the same result than the ones obtained with the automatic characterization. (see table
below)
About the variable Prices level , it is surprising to find it not significant in the model
while it appears significant in the automatic characterization.
This is due to the correlations existing between the explanatory variables: the prices level
is related to the variables Delivery Time , Prices Flexibility These variables tend to
reduce the specific effect due to the prices level.
(Weight =
Characteristic variables
Prices Flexibility
Delivery Time
Frequency of use
Services
Commercial Image
5,213
1,948
6,090
Image
Prices Level
Product Quality
>= 50 employees
(Weight =
Characteristic variables
Product Quality
Prices Level
Image
Commercial Image
Services
Frequency of use
Delivery Time
Prices Flexibility
60.00 Count =
60 )
Category
Category Std. Overall Std.
Overall mean
mean
deviation
deviation
8,622
7,894
1,154
1,380
4,192
3,515
1,029
1,314
48,767
46,100
8,724
8,944
3,050
2,916
0,584
0,747
2,692
2,665
0,859
0,767
5,248
2,364
6,971
1,281
1,018
1,282
1,126
1,190
1,577
40.00 Count =
40 )
Category
Category Std. Overall Std.
Overall mean
deviation
deviation
mean
8,293
6,971
0,918
1,577
2,988
2,364
1,156
1,190
5,300
5,248
0,838
1,126
2,625
2,715
42,100
2,500
6,803
2,665
2,916
46,100
3,515
7,894
0,601
0,905
7,690
1,006
0,879
132
0,767
0,747
8,944
1,314
1,380
Test-value
Probability
6,43
6,27
3,63
2,18
0,42
0,000
0,000
0,000
0,014
0,336
-0,38
-4,26
-6,81
0,354
0,000
0,000
Test-value
Probability
6,81
4,26
0,38
0,000
0,000
0,354
-0,42
-2,18
-3,63
-6,27
-6,43
0,336
0,014
0,000
0,000
0,000
Correlations
with D.L.F.
(Threshold =
0.201)
-0,380
-0,651
0,240
0,279
D.L.F.
coefficients
Regression
coefficients
-0,279138
-2,142560
1,345630
2,963530
0,951688
-0,041896
-0,321575
0,201965
0,444793
0,000000
Standard
T de Student
deviation
(regression)
(Regression)
0,0062
0,0278
0,0474
0,0900
0,0562
6,7436
11,5484
4,2588
4,9432
0,0000
Probability
0,0000
0,0000
0,0000
0,0000
1,0000
Factors are orthogonal, then the Student statistics do not change except the rounding error
: we keep the same hierarchy and the same relative importance of the factors.
The new linear discriminant function is now written :
S1(X) = - 0.27 x F1 - 2.14 x F2 +1.34 x F3 + 2.96 x F6 + 0.95.
Category label
Delivery Time
Prices Level
Prices Flexibility
Image
Services
Commercial Image
Product Quality
Frequency of use
CONSTANTE
D.L.F.
coefficients
Regression
coefficients
2,043930
0,472917
2,460700
0,572879
1,261200
0,103654
-1,669530
-0,297403
-8,387220
0,306772
0,070980
0,369324
0,085983
0,189292
0,015557
-0,250579
-0,044637
-1,401670
Standard
T de Student
deviation
(regression)
(Regression)
0,0353
8,6795
0,0461
1,5395
0,0605
6,1044
0,0340
2,5320
0,0390
4,8564
0,0196
0,7920
0,0300
8,3423
0,0128
3,4890
Probability
0,0000
0,1272
0,0000
0,0131
0,0000
0,4304
0,0000
0,0007
We find the same opposition between the characteristic variables of the small suppliers
(Delivery Time and Prices Flexibility) and the bigger ones (Product Quality).
The variable Commercial Image is still not significant, but the variable Image
becomes significant. Moreover, its positive coefficient indicates a characteristic of the small
companies. However, it is recommended to interpret this result with care, because the
automatic characterization shows that small suppliers have a lower image score than the
big ones (average of 5.21 compared to 5.3). This is due to the correlations existing between
variables. Working on a restricted number of factors was not sufficient to erase them.
Finally, by eliminating non significant variables, principal factors, or variables whose the
coefficients sign is not coherent, we get back to the model of the previous chapter with the
following variables Delivery Time , Prices Flexibility and Product Quality .
Even if it discriminates less good than the other models studied in this chapter, we may
keep this one because of its coherence regarding to the relative contributions and effects
signs.
133
The steps 2 and 3 are implemented in the DISCO procedure of the scoring chain.
The SPAD scoring method performs a multiple correspondence analysis for the following
reasons :
9 The linear discriminant analysis is a method that requires only input continuous
variables.
9 The MCA transforms qualitative variables into continuous factorial coordinates that
can be used for the discriminant analysis.
9 The factorial coordinates are orthogonal and we are liberated from the multicolinearity
problems.
9 At last, the factorial coordinates selection optimizes the results.
134
AGE
GE 50 years
LT 23 years
GE 23 LT 40 years
GE 23 LT 40 years
LT 23 years
GE 23 LT 40 years
GE 50 years
GE 50 years
GE 40 LT 50 years
GE 50 years
GE 50 years
GE 40 LT 50 years
GE 23 LT 40 years
GE 23 LT 40 years
GE 40 LT 50 years
GE 40 LT 50 years
GE 50 years
GE 50 years
MARITAL
Single
Single
Widowed
Divorced
Single
Single
Married
Married
Single
Single
Married
Married
Single
Married
Divorced
Divorced
Single
Widowed
SENIORITY
GT 12 years
LE 1 year
GT 6 LT 12 years
GT 1 LE 4 years
GT 6 LT 12 years
LE 1 year
GT 6 LT 12 years
GT 12 years
GT 1 LE 4 years
GT 4 LE 6 years
GT 12 years
LE 1 year
GT 4 LE 6 years
GT 6 LT 12 years
GT 4 LE 6 years
GT 6 LT 12 years
GT 12 years
GT 12 years
SALARY
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
NO SALARY
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
NO SALARY
NO SALARY
SALARY AT THE BANK
NO SALARY
SALARY AT THE BANK
SALARY AT THE BANK
SALARY AT THE BANK
SAVINGS
No saving
No saving
No saving
LT 10 KF Sav.
No saving
No saving
No saving
No saving
No saving
No saving
No saving
LT 10 KF Sav.
No saving
No saving
LT 10 KF Sav.
No saving
No saving
No saving
JOB
Employee
Employee
Employee
Employee
Employee
Employee
Executive
Executive
Employee
Employee
Employee
Executive
Other
Employee
Executive
Employee
Other
Other
GOOD_BAD
AGE
MARITAL STATUS
SENIORITY
SALARY
SAVINGS
JOB
CHECKING ACCOUNT
AVERAGE TRANSACTIONS
WITHDRAWALS
NEGATIVE ACCOUNT BALANCE
CHEQUE AUTHORIZATION
( 2 categories )
( 4 categories )
( 4 categories )
( 5 categories )
( 2 categories )
( 4 categories )
( 3 categories )
( 3 categories )
( 4 categories )
( 3 categories )
( 2 categories )
( 2 categories )
135
In the favourites tab, select the Scoring rubric and double click on
Discriminant analysis on categorical variables and scoring.
136
DISCO PARAMETERS
The configuration of the DISCO procedure start by defining the model to build : the
endogenous variable and the qualitative exogenous variables :
137
138
DISCO RESULTS
Linear Discriminant Function
Model
V1=F1+F2+F3+F4+F5+F6+F7+F8+F9+F10+F11+F12+F13+F14+F15+F16+F17+F18+F19+F20
+F21+F22+F23+F24+F25
Linear discriminant function
R2 = 0.41398 Fisher = 12.48967 Probability = 0.0000
D2 (Mahalanobis) = 2.81410 T2 (Hotelling) = 329.19614 Probability =
Axis label
F 1
F 2
F 3
F 4
F 5
F 6
F 7
F 8
F 9
F 10
F 11
F 12
F 13
F 14
F 15
F 16
F 17
F 18
F 19
F 20
F 21
F 22
F 23
F 24
F 25
CONSTANT
Correlations
with D.L.F.
D.L.F.
(Threshold coefficients
= 0.093)
-0,475
0,290
0,104
0,170
-0,007
-0,057
-0,022
0,061
0,139
-0,045
0,004
-0,028
-0,030
-0,070
0,045
0,002
-0,017
-0,105
0,049
-0,008
-0,074
0,024
0,068
-0,061
0,019
-3,228700
2,342510
0,897833
1,532160
-0,072457
-0,571836
-0,227015
0,641800
1,515070
-0,502921
0,051269
-0,319744
-0,356309
-0,847106
0,567041
0,023938
-0,219652
-1,389350
0,676453
-0,119744
-1,071810
0,367523
1,151150
-1,190570
0,608556
0,018039
Regression
coefficients
-0,950022
0,689267
0,264181
0,450828
-0,021320
-0,168259
-0,066797
0,188845
0,445797
-0,147981
0,015086
-0,094082
-0,104841
-0,249255
0,166848
0,007043
-0,064631
-0,408807
0,199041
-0,035234
-0,315374
0,108141
0,338719
-0,350316
0,179063
0,000000
0.0000
Ratio
Standard
Coefficient /
deviation
ST.
(Regression)
Deviation
0,0729
0,0867
0,0925
0,0967
0,1057
0,1077
0,1099
0,1130
0,1173
0,1192
0,1224
0,1237
0,1279
0,1300
0,1350
0,1359
0,1405
0,1425
0,1487
0,1546
0,1553
0,1624
0,1819
0,2089
0,3351
0,0364
-13,0262
7,9474
2,8551
4,6611
-0,2018
-1,5617
-0,6076
1,6705
3,8017
-1,2411
0,1233
-0,7605
-0,8197
-1,9170
1,2364
0,0518
-0,4599
-2,8691
1,3381
-0,2279
-2,0303
0,6659
1,8622
-1,6768
0,5343
0,0000
The factors with a ratio whose the absolute value is greater than 1,96 are displayed in bold.
These factors are to be included in the optimal model.
To build this model, we need to return to the Disco configuration and click on the button
Calculation Options .
139
Now that the optimal model is available, we want to partition the dataset into two subsets
: one to perform the analysis, the other one to confirm and validate the analysis. This part
is called validation. We talk about Learning set and Testing set or test-cases in the
following tab.
In this example, we
choose to select randomly
25 % of the cases to test
the model based on the 75
% remaining cases.
Validation is very useful
for testing that the model
does not overfit the data
and has a good predictive
power.
140
Assignment
group:
GOOD
150
35
Assignment
group: BAD
Total
28
138
178
173
Misclassified
Total
28
15,73
35
20,23
63
17,95
178
100,00
173
100,00
351
100,00
Assignment
group:
GOOD
50
21
Assignment
group: BAD
Total
9
37
59
58
Misclassified
Total
9
15,25
21
36,21
30
25,64
59
100,00
58
100,00
117
100,00
141
Axis label
F 1
F 2
F 3
F 4
F 9
F 18
F 21
CONSTANT
D.L.F.
coefficients
Regression
coefficients
-3,070890
2,228010
0,853949
1,457270
1,441010
-1,321450
-1,019430
0,015909
-0,950022
0,689267
0,264181
0,450828
0,445797
-0,408807
-0,315374
0,000000
0.0000
Ratio
Standard
Coefficient /
deviation
(Regression) ST. Deviation
0,0733
0,0872
0,0930
0,0972
0,1179
0,1432
0,1561
0,0366
-12,9600
7,9070
2,8406
4,6374
3,7824
-2,8545
-2,0200
0,0000
Category label
D.L.F.
coefficients
Regression
coefficients
Age of client
-4,413170
1,944030
1,169940
-0,425731
0,009629
1,427100
-3,502810
-6,459620
-6,138460
-8,631250
9,017780
2,082220
9,972050
4,923760
-10,236200
-1,401220
3,659600
6,438820
12,519100
3,238490
2,657760
-5,709430
-12,409300
2,567540
6,859870
-3,598390
-0,489471
1,643420
3,306170
6,128640
-0,076000
-7,615890
-1,481820
1,125290
1,654080
-12,951800
0,015909
-1,365270
0,601412
0,361938
-0,131706
0,002979
0,441492
-1,083640
-1,998370
-1,899020
-2,670200
2,789770
0,644162
3,084990
1,523230
-3,166720
-0,433488
1,132150
1,991940
3,872970
1,001870
0,822213
-1,766290
-3,839000
0,794304
2,122200
-1,113210
-0,151425
0,508415
1,022810
1,895980
-0,023512
-2,356080
-0,458423
0,348125
0,511713
-4,006810
0,000000
Family Situation
Seniority
Salary domiciliation
Size of savings
Profession
Average outstanding
Average transactions
Number of withdrawals
Overdraft
Checkbook
142
Standard
Ratio
deviation
Coefficient /
(Regression) ST. Deviation
0,4690
-2,9112
0,2750
2,1873
0,2642
1,3698
0,3968
-0,3319
0,3449
0,0086
0,2180
2,0249
0,1992
-5,4407
0,9437
-2,1177
0,3039
-6,2495
0,3826
-6,9787
0,5665
4,9243
0,4131
1,5592
0,6592
4,6798
0,1359
11,2049
0,2826
-11,2049
0,0864
-5,0190
0,2840
3,9865
0,6526
3,0524
1,1221
3,4515
0,3962
2,5287
0,1786
4,6033
0,2101
-8,4074
0,4235
-9,0644
0,1589
4,9987
0,4772
4,4468
0,3180
-3,5007
0,2069
-0,7318
0,4045
1,2571
0,2583
3,9605
0,2382
7,9587
0,2219
-0,1060
0,3085
-7,6361
0,3886
-1,1795
0,2951
1,1795
0,0658
7,7814
0,5149
-7,7814
143
144
Category label
Less than 23 years
From 23 to 40 years
From 40 to 50 years
Over 50 years
Family Situation
Linear
discriminant
function
coefficient
0,010
1,427
-3,503
-6,460
Category label
Single
Married
Divorced
Widow
Seniority
Linear
discriminant
function
coefficient
-6,138
-8,631
9,018
2,082
9,972
Category label
1 year or less
From 1 to 4 years
From 4 to 6 years
From 6 to 12 years
Over 12 years
Salary domiciliation
Linear
discriminant
function
coefficient
4,924
-10,236
Category label
Sal. domiciliated
Sal. not domicil.
Size of savings
Linear
discriminant
function
coefficient
-1,401
3,660
6,439
12,519
Category label
No saving
Less than 10 KF
From 10 to 100 KF
More than 100 KF
Profession
Linear
discriminant
function
coefficient
3,238
2,658
-5,709
Category label
executive
employee
other
Average outstanding
Linear
discriminant
function
coefficient
-12,409
2,568
6,860
Category label
Less than 2 KF
From 2 to 5 KF
More than 5 KF
Average transactions
Linear
discriminant
function
coefficient
-3,598
-0,489
1,643
3,306
Category label
Less than 10 KF
From 10 to 30 KF
From 30 to 50 KF
More than 50 KF
Number of withdrawals
145
Score
function
coefficient
0,00
49,66
43,62
31,15
Score
function
coefficient
50,54
61,61
23,10
0,00
Score
function
coefficient
19,47
0,00
137,88
83,69
145,33
Score
function
coefficient
118,43
0,00
Score
function
coefficient
0,00
39,54
61,25
108,75
Score
function
coefficient
69,90
65,37
0,00
Score
function
coefficient
0,00
117,00
150,53
Score
function
coefficient
0,00
24,29
40,95
53,94
The user can define a rate called the Classification Error tolerance Abbreviated as CET in
the parameters Tab of the Score method. In this example, we chose 10% by default.
This rate supports the calculation of regions on the score function scale :
The low boundary 528 has been chosen to assign 10.0% of the real good customers to the
weak scores group (misclassified) and the high boundary 655 to assign 9.7% of the real bad
to high scores group (misclassified).
These boundaries are shrinkable if the user wants to modify the misclassification rates.
146
A "red" region of low scores containing most of the cases in BAD category - and therefore
correctly classified - and a percentage not exceeding the CET of cases of GOOD and
therefore miss-classified.
In this example : 64.5% of the real BAD are well assigned and 9.7% of the real GOOD are
misclassified.
An intermediary Orange region between the boundaries of the red and green regions,
where group assignment is left undecided. This region of indecision shrinks when the user
increases the CET.
In this example : 25.5% of the real BAD and 27.8% of the real GOOD are assigned to the
Orange region.
Sometimes, it is not necessary to keep this intermediary region, for direct marketing
campaigns. Then, by clicking on the Single score checkbox, we keep only two regions (Red
and Green) and one single boundary.
Modifying the boundaries by using the scores table
This part of the user interface allows us to modify manually the CET and
therefore the boundaries.
147
148
149
Density curves
This graph draws respectively the density curves of the real BAD and the real GOOD
customers.
150
151
Sensitivity :
Specificity :
1-Specificity :
Closer is the curve to the upper left part of the graph, better is the separation between the
two categories of the target variables.
When the densities are equal, the ROC curve is confounded with the diagonal of the
square.
152
153
The IDT procedure produces decision trees from a data set. It is a discriminant procedure
for predicting the values of a categorical variable (variable to explain, with K groups) from
a set of explanatory variables that may be categorical, ordinal or continuous.
The IDT procedure gives the user a choice of three well-established methods in Data
Mining: CHAID, C4.5 and C&RT. The model produced by the method is a Decision Tree,
which can be evaluated with a test sample or by crossed validation. The procedure
includes additional information that lets you refine the results: integration of adjustment
with the a priori group inclusion probabilities, and the introduction of a cost matrix for
incorrect assignment.
The IDT procedure lets the user interactively manipulate the decision tree produced by the
method: pruning from the root, interactive segmentation of a node, and description of the
properties of a segmentation. The procedure also offers a fully interactive mode, in which
the construction of the tree is entirely based on the user's ideas. Several supporting tools (a
list of the best segmentations, descriptive statistics et al.) lets you choose the tree which
best corresponds to the problem to be solved.
At all stages of the design conceived by the user, it is possible to output the reports in
HTML format: on the complete decision tree, or locally on each node including a subset of
the database analyzed.
To illustrate this method, we use the same dataset than for the scoring function: the Credit
English.sba dataset.
154
Counts
237
231
%
50,64
49,36
468
100,00
JOB
Categories label
Executive
Employee
Other
Overall
AGE
Categories label
LT 23 years
GE 23 LT 40 years
GE 40 LT 50 years
GE 50 years
Overall
MARITAL
Categories label
Single
Married
Divorced
Widowed
Overall
SENIORITY
Categories label
LE 1 year
GT 1 LE 4 years
GT 4 LE 6 years
GT 6 LT 12 years
GT 12 years
Overall
SALARY
Categories label
SALARY AT THE BANK
NO SALARY
Overall
SAVINGS
Categories label
No saving
LT 10 KF Sav.
GE 10 LT 100 KF Sav.
GE 100 KF Sav.
Overall
Counts
77
237
154
%
16,45
50,64
32,91
468
100,00
Counts
98
308
62
%
20,94
65,81
13,25
468
100,00
Counts
88
150
122
108
%
18,80
32,05
26,07
23,08
CHECKIN ACCOUNT
Categories label
LT 2KF Account
GE 2 LT 5KF Account
GE 5KF Account
468
100,00
Overall
Counts
170
221
61
16
%
36,32
47,22
13,03
3,42
AVERAGE TRANSACTIONS
Categories label
Counts
154
LT 10 KF Trans.
71
GE 10 LT 30 KF Trans
129
GE 30 LT 50 KF Trans
114
GE 50 KF Trans.
%
32,91
15,17
27,56
24,36
468
100,00
Overall
468
100,00
Counts
199
47
69
66
87
%
42,52
10,04
14,74
14,10
18,59
Counts
171
161
136
%
36,54
34,40
29,06
468
100,00
468
100,00
Counts
316
152
%
67,52
32,48
468
100,00
Counts
370
58
32
8
%
79,06
12,39
6,84
1,71
468
100,00
WITHDRAWALS
Categories label
LT 40 With.
GE 40 LT 100 With.
GE 100 With.
Overall
155
%
43,16
56,84
Overall
100,00
468
CHEQUE AUTHORIZATION
Categories label
Counts
CHEQUE OK
415
NO CHEQUE
53
%
88,68
11,32
Overall
100,00
468
IDT 1
The IDT1 procedure prepares the data for the construction of the tree (Procedure IDT2). In
particular, it handles the missing data of the selected variables. The procedure outputs a
report of the treatment of the missing data.
You also have available by default an automatic characterization of the variable to
discriminate by the set of the selected explanatory variables.
This characterization gives you a better selection of the explanatory variables; for example,
by removing all of those that have no connection with the variable to discriminate.
156
IDT 2
The IDT2 procedure constructs an initial segmentation tree as a function of the chosen
method (CHAID, C&RT, C4.5) and the associated parameters.
After the execution of the procedure, to the right of the method you have available a
graphical icon to handle the initial tree.
157
The CHAID algorithm that is particularly well suited for the analysis of larger datasets
and for a first exploration of the data.
158
159
160
161
Partitioning
By random sampling:
If you did not select a variable Samples definition in the IDT1 method, the choice of
the samples is done by random sampling. You have to define the percentages for the
learning set, the testing set and the pruning set (for CART only).
The parameter Random sampling initialization allows you to define different
samples from the same size.
By categories value:
If you have chosen a variable Samples definition in the IDT1 method, the categories
are listed in the window to be assigned a specific sample: learning, testing and pruning
for CART only.
162
IDT2 parameters
Type of analysis:
By default: Automatic
Automatic:
The tree is growth automatically with respect to the stopped criterions defined by the user.
Automatic and crossed validation:
The tree is growth automatically and the procedure evaluates the error by crossed validation.
In this case, you have to define the number of division or subsets to be used for the crossedvalidation.
Interactive :
The tree is not growth at all. In the graphic interface, the user can develop it manually.
163
Thresholds:
164
165
166
167
168
Characteristics of the Tree: shows the properties of the Decision Tree produced by the
method, as well as the number of nodes in the tree, the number of terminal node leaves
and its maximum depth. Also shown is the size of the sample used for training, for the
test and, if required, for the pruning.
Impact of the attributes: shows the role of each attribute in the elaboration of the Tree.
The value indicated represents the weighted mean of the impact of each attribute on all
the segmentation candidates. Less importance is given to the impacts measured on the
lower parts of the Tree.
Confusion Matrix: lists the confrontation between the predictions of the tree and the
values observed on the dependent variable to predict. The matrix may be measured on
the training sample, on the test sample, or in crossed validation. These last two options
are active if they were requested during the parameter setting for the procedure.
Profile: presents the current confusion matrix in the form of a row profile (e.g. to
measure sensitivities), or in the form of a column profile (to measure the specifics).
169
170
171
172
173
174
you can either save the Tree by overwriting the previous version,
or save a new version of the Tree for the same problem with a suitable title.
At anytime you can reload into IDT a decision tree that you have saved. The different
versions are identified by the titles you have given them.
Warning! Changing the analysis parameters automatically deletes all the saves carried
out for the problem analyzed. If you want to save your work permanently, you are
advised to use reports or exports.
Save the current tree
On the execution of the IDT procedure, a tree with the title Default Analysis is
automatically created. This is the tree shown when the IDT procedure starts up. The user
can personalize this tree, and save the results of their changes permanently.
In general, it is possible to save any tree on which the user is working.
Procedure
1. Click on the File Save menu
2. IDT deletes the old version of the tree and replaces it with the new one.
Save a new version of the Tree
When working on an analysis, the user may wish to work in parallel on several different
scenarios corresponding to multiple trains of thought: you therefore have the option to
save individual versions of the Tree with different names.
Procedure
The user wants to save a new version of the Tree, from which they have pruned a branch.
1. Proceed with pruning a part of the Tree
2. Then click on the File -- Save as... menu
3. A dialogue box appears, asking the user to give a new title to this new version of
the Tree. This operation is obligatory, since the different versions are
distinguished by their titles.
4. Click on the OK button
Warning! by clicking on File -- Save, the user overwrites the version in memory.
175
176