Sie sind auf Seite 1von 15

Coursera

Predicting types of glass based


on their composition
Data Analysis and Interpretation Capstone Project

Farhan Hafeez
9-15-2016

Table of Contents
Table of Contents ..................................................................................................................................................................... 2
1

Introduction to Research Question ........................................................................................................................ 3


1.1

Rationale .................................................................................................................................................................. 3

1.2

Implication of Research Question ................................................................................................................. 3

Methodology .................................................................................................................................................................... 4
2.1

Sample....................................................................................................................................................................... 4

2.2

Measures .................................................................................................................................................................. 4

2.3

Analysis .................................................................................................................................................................... 5

Results ................................................................................................................................................................................ 6
3.1

Descriptive Statistics .......................................................................................................................................... 6

3.2

Bi-Variate Analysis .............................................................................................................................................. 6

3.3

Random Forest ...................................................................................................................................................... 9

3.4

Clustering .............................................................................................................................................................. 11

3.4.1

k- means Clustering ................................................................................................................................. 11

3.4.2

Spectral Clustering ................................................................................................................................... 12

3.4.3

Agglomerative Clustering...................................................................................................................... 13

Conclusions..................................................................................................................................................................... 14
4.1

Key Findings ......................................................................................................................................................... 14

4.2

Implication ............................................................................................................................................................ 14

4.3

Limitations ............................................................................................................................................................ 14

4.4

Future Directions ............................................................................................................................................... 14

Bibliography ............................................................................................................................................................................ 15

1 Introduction to Research Question


This research question is derived from the field of forensic science. The glass sample obtained at the
time of a murder or robbery is a significant evidence. Glass fragments are often found attached to
clothes of criminals who have performed a robbery by breaking through a house glass or a car [1].
Refractive index and composition of these fragments can be easily determined. It would be very
helpful to investigation if this data can be analyzed and used to infer which type of glass these
fragments belong to.
This research question will help in developing a prediction model which would facilitate the classical
process of determination of glass type.

1.1 Rationale
Motivation behind this research question is solving complex problem of classification and easing up
the process of investigation. Classical techniques are still in use which are infeasible and inefficient.
Here by solving this problem accuracy of machine learning algorithms will be demonstrated and
opportunities for further research will be highlighted.

1.2 Implication of Research Question


The results will help in determining types of glass with higher confidence and providing way forward
for real world classification problems not only in field of forensic but also for other sciences.

2 Methodology
2.1 Sample
The sample is taken from UCI machine learning repository [2]. The data comprises of real attributes
of N=214 glass samples. This data was made public by B. German Home Office Forensic science
service and was prepared by Dr J Locke of the Central Research Establishment.

2.2 Measures
A total of 10 variables are available and all of them are considered for analysis. Explanatory variables
are all quantitative and are as follows
1. Refractive Index
2. Sodium
3. Magnesium
4. Aluminum
5. Silicon
6. Iron
7. Potassium
8. Calcium
9. Barium
The concentration of all the elements are expressed in terms of weight percent of their respective
oxides.
Response variable is type of glass, following are the categories
1. Building windows float processed
2. Building windows non-float processed
3. Vehicle windows float processed
4. Vehicle windows non-float processed
5. Containers
6. Tableware
7. Headlamps
The response variable is categorized into 7 categories as mentioned above and represented by
numbers ranging from 1 to 7 in the order described above. The data does not contain any sample for
vehicle windows float processed glass.

2.3 Analysis
Descriptive and bi variate analysis are used to demonstrate the complexity of problem and analyzing
the relationship of explanatory variables with the response variable. The association between
response variable and explanatory variables is determined based on contribution of each
explanatory variable in random forest.
Supervised learning is used for predicting response variable. Random forest method is used for
supervised learning. Data is split into training and testing data sets. Random forest is trained with
training data set and evaluation of results is done through accuracy scores based on confusion matrix.
The contribution of each variable in determining type of glass is also calculated.
Clustering is used to examine how accurately machine learning algorithm can differentiate between
the discussed types. For clustering three methods are used namely
1. K-Means Clustering
2. Spectral Clustering
3. Agglomerative Clustering
Results are evaluated based on the true values of response variables. Here clustering is used only to
demonstrate and evaluate potential for future research and development.

3 Results
3.1 Descriptive Statistics
The data consists of six types of glass, distribution of samples among various categories can be seen
through frequency distribution. The number of samples for each category of glass are shown in Table
1.
Type of Glass Number of Samples
1
70
2
76
3
17
5
13
6
9
7
29
Table 1: Distribution of Samples

There are 70 samples of type 1 and 76 of type 2 which collectively constitutes 146 samples out of
214. The least number of samples are 9 for type 6, this may affect the accuracy of our machine
learning algorithm.

3.2 Bi-Variate Analysis


The problem arises when we try to classify types of glass based on single explanatory variables.
Refractive index ranges from 1.5115 to 1.53393. Figure 1 shows how refractive varies from one type
of glass to another.

Figure 1: Refractive Index and Glass Type

It is clear from figure 1 refractive index for each type overlaps with the other five; we can say that the
seven categories of glass are inseparable as per their refractive index. Now we will look at other
variables. Figure 2 and Figure 3 shows distribution of each explanatory variable with respect to
different types of glass.

Figure 2: Glass Type and Composition 1

Figure 3:Glass Type and Composition 2

From Figure 2, we can see Sodium (Na) content for type 1,2,3 and 5 is similar while for type 6 and
type 7 it is similar, similarly overlapping exists for other oxides contents. Again from Figure 3 we
reach a conclusion that glasses are not separable as per composition of their constituents also.

3.3 Random Forest


All explanatory variables are considered for random forest. The distribution consists of 149 samples
that is 70% data used for training while 30% that is 65 samples of data are considered for testing.
Number of trees considered are 25. The accuracy of the model is found to be 81.53% which is very
high and desirable. The confusion matrix is as follows

Type 1 Type 2 Type 3 Type 5 Type 6 Type 7


Type 1

19

Type 2

22

Type 3

Type 5

Type 6

Type 7

Table 2: Confusion Matrix

The above table shows 12 samples were wrongly classified. Five samples of type 1 glass have been
classified as type 2 and 3 of type 2 as type 1. There was only one sample of type 3 glass in prediction
set and it was wrongly classified as type 1. The importance of explanatory variables in determination
of type of glass is as follows

Attribute Importance
20%
18%
16%
14%
12%
10%
8%
6%
4%
2%
0%

18%

18%
13%

13%
10%
8%

8%

7%
5%

Figure 4: Importance of Attributes

3.4 Clustering
3.4.1 k- means Clustering
Clustering is performed through k-Means method. Elbow method is used to identify optimal number
of clusters. There are 5 main bends in the elbow curve as shown in figure 5, the bends are at
n=2,3,4,5,6 clusters. Since we know that there are 6 types of glass, hence n=6 is considered for
clustering.

Figure 5: Elbow Graph

A scatter plot is plotted to demonstrate the division of clusters as shown in figure 6. There exists
overlapping among various clusters

Figure 6: Scatter Plot for K-Means Clustering

The result of k-means clustering is not so satisfactory as shown in Table 3.

Clusters

1 2 3

Type of Glass
1

0 48 0 0

0 22

0 61 7 0

0 13 0 0

0 0 3 10

0 0 0

7 23

1 0 2

Table 3: Cross Tab for K-Means Clustering

From the above table it can be concluded that k-Means clustering was only able to filter out type 5
and 7. 48 samples of type 1 glass, 61 samples of type 2 and 13 samples of type 3 glass are classified
together in a single category. Hence our algorithm is unable to differentiate between type 1,2 and 3.

3.4.2 Spectral Clustering


Spectral clustering is used to determine if unsupervised learning can be used for the addressed
problem. The result of spectral clustering with n=6 clusters is shown in Table 4.
Clusters

3 4 5

Type of Glass
1 18 38

0 14 0 0

2 17 42

0 15 2 0

7 0 0

0 5 2

0 2 0

1 21

0 0 0

Table 4: Cross Tab for Spectral Clustering

Just like k-Means clustering, spectral clustering is also unable to distinguish between type 1,2 and 3.
It has successfully classified only type 7 glass.

3.4.3 Agglomerative Clustering


For the last clustering method, we will utilize agglomerative clustering. The results can be seen in
Table 5.
Clusters

3 4 5

1 0 54 16

0 0 0

2 0 61

0 5 6

3 0 14

0 0 0

5 3

0 8 0

6 0

2 3 0

7 2

3 22 1 0

Type of Glass

Table 5: Cross Tab for Agglomerative Clustering

Same problem is observed here, as in the two cases before. Only type 7 glass has been classified
correctly.

4 Conclusions
4.1 Key Findings
The project used random forest for determining types of glass based on composition of 8 oxides and
refractive index for N=214 samples gathered by Dr J Locke of the Central Research Establishment.
Random forest was able to predict type of glass with an accuracy of 81.53%. It was also found out
that refractive index and oxide content of magnesium are the two variables with highest significance
in the predicting model, both variables have a contribution of 18% individually in predicting glass
type. The variable with least significance is oxide content of Iron with a significance of 5% in Random
forest.
Unsupervised learning for this problem failed for all three methods. Each technique failed to
distinguish between the first 3 types. Only type 7 samples were correctly identified in each case.

4.2 Implication
A basic model for determining glass types has been established. This will help field of forensic science
to remain up-to-date and in line with the modern advancements and techniques. This technique will
also assist in speeding up and easing the process of investigation and resulting in clear cut
conclusions.

4.3 Limitations
The main problem faced is, the number of samples are few for glass type 3,5 and 6. With a large data
set we can train Random forest with higher accuracy and evaluate the results with more certainty as
more data for validation is available.
Here the clustering algorithms are used with the default features as per the guidelines of the course.
Using different attributes of these algorithms may generate better results but use of these attributes
is out of context of this course and requires more time.

4.4 Future Directions


A base case for future research has been developed, this study provides a basis for application of
machine learning to real world complex problem. The next step is training with more types and
samples of glass and exploring new machine learning features to increase accuracy of predictions
with higher confidence.

Bibliography
[1]
[2]

I. W. Evett and E. Spiehler, "Rule induction in forensic science," KBS in Goverment, Online
Publications, pp. 107-118, 1987.
L. M, "UCI Machine Learning Repository," University of California, Irvine, School of
Information and Computer Sciences, 2013.

Das könnte Ihnen auch gefallen