Beruflich Dokumente
Kultur Dokumente
Farhan Hafeez
9-15-2016
Table of Contents
Table of Contents ..................................................................................................................................................................... 2
1
Rationale .................................................................................................................................................................. 3
1.2
Methodology .................................................................................................................................................................... 4
2.1
Sample....................................................................................................................................................................... 4
2.2
Measures .................................................................................................................................................................. 4
2.3
Analysis .................................................................................................................................................................... 5
Results ................................................................................................................................................................................ 6
3.1
3.2
3.3
3.4
Clustering .............................................................................................................................................................. 11
3.4.1
3.4.2
3.4.3
Agglomerative Clustering...................................................................................................................... 13
Conclusions..................................................................................................................................................................... 14
4.1
4.2
Implication ............................................................................................................................................................ 14
4.3
Limitations ............................................................................................................................................................ 14
4.4
Bibliography ............................................................................................................................................................................ 15
1.1 Rationale
Motivation behind this research question is solving complex problem of classification and easing up
the process of investigation. Classical techniques are still in use which are infeasible and inefficient.
Here by solving this problem accuracy of machine learning algorithms will be demonstrated and
opportunities for further research will be highlighted.
2 Methodology
2.1 Sample
The sample is taken from UCI machine learning repository [2]. The data comprises of real attributes
of N=214 glass samples. This data was made public by B. German Home Office Forensic science
service and was prepared by Dr J Locke of the Central Research Establishment.
2.2 Measures
A total of 10 variables are available and all of them are considered for analysis. Explanatory variables
are all quantitative and are as follows
1. Refractive Index
2. Sodium
3. Magnesium
4. Aluminum
5. Silicon
6. Iron
7. Potassium
8. Calcium
9. Barium
The concentration of all the elements are expressed in terms of weight percent of their respective
oxides.
Response variable is type of glass, following are the categories
1. Building windows float processed
2. Building windows non-float processed
3. Vehicle windows float processed
4. Vehicle windows non-float processed
5. Containers
6. Tableware
7. Headlamps
The response variable is categorized into 7 categories as mentioned above and represented by
numbers ranging from 1 to 7 in the order described above. The data does not contain any sample for
vehicle windows float processed glass.
2.3 Analysis
Descriptive and bi variate analysis are used to demonstrate the complexity of problem and analyzing
the relationship of explanatory variables with the response variable. The association between
response variable and explanatory variables is determined based on contribution of each
explanatory variable in random forest.
Supervised learning is used for predicting response variable. Random forest method is used for
supervised learning. Data is split into training and testing data sets. Random forest is trained with
training data set and evaluation of results is done through accuracy scores based on confusion matrix.
The contribution of each variable in determining type of glass is also calculated.
Clustering is used to examine how accurately machine learning algorithm can differentiate between
the discussed types. For clustering three methods are used namely
1. K-Means Clustering
2. Spectral Clustering
3. Agglomerative Clustering
Results are evaluated based on the true values of response variables. Here clustering is used only to
demonstrate and evaluate potential for future research and development.
3 Results
3.1 Descriptive Statistics
The data consists of six types of glass, distribution of samples among various categories can be seen
through frequency distribution. The number of samples for each category of glass are shown in Table
1.
Type of Glass Number of Samples
1
70
2
76
3
17
5
13
6
9
7
29
Table 1: Distribution of Samples
There are 70 samples of type 1 and 76 of type 2 which collectively constitutes 146 samples out of
214. The least number of samples are 9 for type 6, this may affect the accuracy of our machine
learning algorithm.
It is clear from figure 1 refractive index for each type overlaps with the other five; we can say that the
seven categories of glass are inseparable as per their refractive index. Now we will look at other
variables. Figure 2 and Figure 3 shows distribution of each explanatory variable with respect to
different types of glass.
From Figure 2, we can see Sodium (Na) content for type 1,2,3 and 5 is similar while for type 6 and
type 7 it is similar, similarly overlapping exists for other oxides contents. Again from Figure 3 we
reach a conclusion that glasses are not separable as per composition of their constituents also.
19
Type 2
22
Type 3
Type 5
Type 6
Type 7
The above table shows 12 samples were wrongly classified. Five samples of type 1 glass have been
classified as type 2 and 3 of type 2 as type 1. There was only one sample of type 3 glass in prediction
set and it was wrongly classified as type 1. The importance of explanatory variables in determination
of type of glass is as follows
Attribute Importance
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%
18%
18%
13%
13%
10%
8%
8%
7%
5%
3.4 Clustering
3.4.1 k- means Clustering
Clustering is performed through k-Means method. Elbow method is used to identify optimal number
of clusters. There are 5 main bends in the elbow curve as shown in figure 5, the bends are at
n=2,3,4,5,6 clusters. Since we know that there are 6 types of glass, hence n=6 is considered for
clustering.
A scatter plot is plotted to demonstrate the division of clusters as shown in figure 6. There exists
overlapping among various clusters
Clusters
1 2 3
Type of Glass
1
0 48 0 0
0 22
0 61 7 0
0 13 0 0
0 0 3 10
0 0 0
7 23
1 0 2
From the above table it can be concluded that k-Means clustering was only able to filter out type 5
and 7. 48 samples of type 1 glass, 61 samples of type 2 and 13 samples of type 3 glass are classified
together in a single category. Hence our algorithm is unable to differentiate between type 1,2 and 3.
3 4 5
Type of Glass
1 18 38
0 14 0 0
2 17 42
0 15 2 0
7 0 0
0 5 2
0 2 0
1 21
0 0 0
Just like k-Means clustering, spectral clustering is also unable to distinguish between type 1,2 and 3.
It has successfully classified only type 7 glass.
3 4 5
1 0 54 16
0 0 0
2 0 61
0 5 6
3 0 14
0 0 0
5 3
0 8 0
6 0
2 3 0
7 2
3 22 1 0
Type of Glass
Same problem is observed here, as in the two cases before. Only type 7 glass has been classified
correctly.
4 Conclusions
4.1 Key Findings
The project used random forest for determining types of glass based on composition of 8 oxides and
refractive index for N=214 samples gathered by Dr J Locke of the Central Research Establishment.
Random forest was able to predict type of glass with an accuracy of 81.53%. It was also found out
that refractive index and oxide content of magnesium are the two variables with highest significance
in the predicting model, both variables have a contribution of 18% individually in predicting glass
type. The variable with least significance is oxide content of Iron with a significance of 5% in Random
forest.
Unsupervised learning for this problem failed for all three methods. Each technique failed to
distinguish between the first 3 types. Only type 7 samples were correctly identified in each case.
4.2 Implication
A basic model for determining glass types has been established. This will help field of forensic science
to remain up-to-date and in line with the modern advancements and techniques. This technique will
also assist in speeding up and easing the process of investigation and resulting in clear cut
conclusions.
4.3 Limitations
The main problem faced is, the number of samples are few for glass type 3,5 and 6. With a large data
set we can train Random forest with higher accuracy and evaluate the results with more certainty as
more data for validation is available.
Here the clustering algorithms are used with the default features as per the guidelines of the course.
Using different attributes of these algorithms may generate better results but use of these attributes
is out of context of this course and requires more time.
Bibliography
[1]
[2]
I. W. Evett and E. Spiehler, "Rule induction in forensic science," KBS in Goverment, Online
Publications, pp. 107-118, 1987.
L. M, "UCI Machine Learning Repository," University of California, Irvine, School of
Information and Computer Sciences, 2013.