Sie sind auf Seite 1von 4

Update for 2018/01/20 v1.

Tests Brooks V3.6

The tests were applied using a validation model containing 4 groups and 31 elements
in total.

To use the test model simply enter the validation file address:
original <- "/home/leandro/R/brooks/data/Metrics.xlsx"

Once you have added the validation model, it is necessary to indicate the folder address
of the files containing the tests used by the brook’s tool:
dir <- "/home/leandro/R/brooks/data/GD2-285/"

The script compares all groups of all files in the directory against the validation tem-
plate. Statistical calculations are performed to evaluate the probability of success
among the models of the test group. The combinations of lower probability groups
are identified from an evolutionary euristic, and the smallest sum of all probabilities in
each group is selected as the best model, or the model closest to the validation model.

Let’s see an example to illustrate.

Figure 1: The ”original” line corresponds to the validation model, and the line ”Dy-
namic Trre Cut” to one of the strategies resulting from the Brooks tool. The colors
describe the groups identified in the two approaches, for each element.

To determine the probability of success, a hypergeometric test was used for each group
identified by the test set.

The turquoise group of the two line, for example, what is the probability of settling
against the brown group of line 1?
tma.sty notes Peter McFarlane W2770175

In a mathematical perspective we can say that: we have an initial population of 31


elements, 10 brown and 21 non-brown. From this was taken a sample of 16 elements
(turquoise set), what is the probability of getting 10 hits?

M N −M
 
k n−k
P (X = k) = N
 (0.1)
n

Where, N = 31; n = 16, M = 10; k = 10. It is valid to observe that the lower the
probability, the greater the chances of the choice not being made at random, which
validates the test.

In the case of Figure1 we have 3 test groups, against 4 validation groups. The probability
was calculated for the 12 possible combinations, and the results ordered in an matrix,
as we can see in the table 1.
Blue Brown Green Red
turquoise 0.11523220 0.0003014622 1.0000000000 0.28804711
blue 0.06746087 1.0000000000 0.0005098257 0.09138415
grey 1.0000000000 1.0000000000 1.0000000000 0.02419355

Table 1: The groups identified in the columns of the array belong to the validation set,
and in the row to the test set. In this example (highlighted) the brown group is more
related to the turquoise group; the blue with the green; and the gray with the red.

To identify the best pairs i, j without repeating elements, whose sum was the smallest
value, a genetic algorithm was used. The sum of the lowest probabilities is used as the
classification score between the tests. Add one to each missing group, as in the example
that contains 3 test groups for 4 validation groups.

Results

Among the 48 strategies were tested, two had the best score.

# NUMBER 1

Apply Clustering : Collum


Distance Metric : euclidean
Type of analysis : NbClust analysis
Linkage Algorithm : average
cutting k : 4

page 2 of 4
tma.sty notes Peter McFarlane W2770175

group_valid group_teste
"Red" "turquoise"
"Green" "yellow"
"Brown" "brown"
"Blue" "blue"

Hit rate : 0.2331454

# NUMBER 2

Apply Clustering : Collum


Distance Metric : euclidean
Type of analysis : NbClust analysis
Linkage Algorithm : single
cutting k : 4

group_valid group_teste
"Red" "turquoise"
"Green" "yellow"
"Brown" "brown"
"Blue" "blue"

Hit rate : 0.2331454

The distribution between the elements of the two strategies remained symmetrically
the same. Figure2 exemplifies the comparison of the groups.

Figure 2: Distribution of the groups of the two best clustering strategies identified in
the test.

page 3 of 4
tma.sty notes Peter McFarlane W2770175

Blue Brown Green Red


turquoise 1.0000000000 1.0000000000 1.0000000000 1.550093e-08
blue 0.01439099 1.0000000000 0.0005098257 1.0000000000
brown 0.25021850 4.433267e-06 0.104792362 1.0000000000
yellow 1.0000000000 1.0000000000 0.218750000 1.0000000000

Table 2: Best arrangement of elements iedntificado by AG.

The Figure 3 shows the score distribution of all 48 tests performed, and highlighted the
two best approaches identified.

Figure 3: Distribution of the groups of the two best clustering strategies identified in
the test.

page 4 of 4

Das könnte Ihnen auch gefallen