Beruflich Dokumente
Kultur Dokumente
R.J.RamaSree
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
one determines the threshold for the percentage of distinct
values out of the number of total samples [3]. The default value
for freqCut is 95/5 and it is 10 for uniqueCut. User may
provide any other values other than the defaults if intended so
and the threshold values are influenced accordingly.
The other approach to reduce the dimensionality is by
discarding terms that do not occur in the whole text corpus
equal or more than the user specified threshold. R statistical
language provides findFreqTerms function in the tm package
that identifies frequent terms in a document-term or termdocument matrix. The function takes important parameters
lowfreq and highfreq. The function returns words that occur
more or equal often than lowfreq times and less or equal often
than highfreq times. The default values for lowfreq and
highfreq are 0 and Inf which means all words are returned by
the function if the user did not specify thresholds [4] [5]. As
the default values do not s uffice the needs, user is expected to
experiment with various lowfreq and highfreq values in order
to identify values that yield performance measurements.
While it is very evident that both nearZeroVar and
findFreqTerms functions can be used for dimension ality
reduction in text corpus, the research covered under this paper
focuses on identifying the better technique among these two
from a user stand point. In order to compare the functions, the
research takes into account the reduction in the number of
features, the accuracy, kappa, Mean Absolute Error (MAE), F
Score, Area Under the Curve (AUC) obtained with models
built using the reduced feature sets and the ease of
implementing the functions as performance metrics. The goal
of the research is to identify a quickly implementable and
effective function amongst the two in order to accomplish
optimal dimensionality of the text corpus with respect to the
models performance metrics.
The rest of this paper is organized as follows. Section 2
details the previous work undertaken in this area of research
and the research motivation that makes the research covered
under this paper unique from others. Section 3 discusses the
data used, experimental setup, the preliminaries of the tools
and techniques used. Section 4 describes the models built and
measurements made during the experiments. Analysis of
experimental results and conclusions are dealt with in Section
5.
II. RELAT ED WORK AND RESEARCH MOT IVAT ION
While there is enormous amount of literature available on
the details of nearZeroVar and findFreqTerms R language
functions and their applications in text mining area, it is
interesting that there is no literature available that compares
these two techniques. This is despite the fact that both the
techniques make use of the philosophy of presence of features
a certain number of times in the text corpus. It is important to
derive a general principle in terms of which of these two
techniques to use for dimensionality reduction purposes in text
classification. Deriving a general principle eliminates the need
to repeat the testing of both these dimensionality reduction
techniques every time there is a need to reduce text corpus
dimension. Also, it is observed that there is no literature
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
3.3. Hardware and the software used for the research
All experiments performed were executed on a Dell
Latitude E5430 laptop. The laptop is configured with Intel
Core i5 -3350M CPU @ 2.70 GHz and with 4 GB RAM
however Weka workbench is configured to use a maximum of
1 GB. The laptop runs on Windows 7 64 bit operating system.
For the purpose of designing and evaluating the
experiments, R statistical language and Weka machine learning
workbench are used. R is used for all pre-processing tasks
including dimensionality reduction where was Weka is used
for building the models using the pre-processed data and to
obtain the model performance statistics.
R statistical language is used for the research specially for
computing near zero R is a GNU project that offers a language
and environment for statistical computing and graphics. R
provides a wide variety of statistical and graphical techniques
such as linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, etc., R is
highly extensible as well [10].
A machine learning workbench called Weka is used for the
experiments. Weka stands for Waikato Environment for
Knowledge Analysis and it is a free offering from University
of Waikato, New Zealand. This workbench has a user-friendly
interface and it incorporates numerous options to develop and
evaluate machine learning models [11] [12]. These models can
be utilized for a variety of purposes, including automated essay
scoring.
3.4. Data pre-processing
To be able to build models required for the research, the
raw datasets that consists the answers and scores need to be
transformed into a format that can be used by Weka. The steps
involved in transformation of the raw data into a usable form is
termed as pre-processing. All required pre-processing for
experimentation is done using an R language text mining
library called tm [13].
The first step is to read the raw data file into an R vector.
Post reading, the vector is used to create a corpus, which refers
to a collection of text documents. Each text document in this
case refers to an answer in the raw data file. The corpus was
then passed through series of R functions which are again part
of tm package so as to change the case to lower case, s trip
white spaces, remove numbers, and remove stop words,
punctuations.
From the corpus, A sparse matrix is created using
DocumentTermMatrix function in which the rows of the matrix
indicate documents and the columns indicate features i.e.,
words. Each cell in the matrix contains a value that represents
the number of times the word in the column heading appears in
the document specified in the row. The last step is to convert
the sparse matrix cell values into categorical values from
numeric type. This is required because the number of times a
word appeared in a sentence does not relate to the score
obtained rather it is the presence of word in the sentence that is
more relevant in scoring. A custom function is written to
convert the numeric values into Yes or No categorical
values. A cell with a value equal to 1 or more is replaced with
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
retained in data to be used in model building. The onus is on
the user to provide a value to the parameters so as to define the
word retention threshold. A general approach towards this
problem is to search the values space to identify values for the
parameters. Since the value space ranges from 0 to maximum
number of words in the corpus, it is not a possibility to evaluate
each value to identify optimal value therefore a random value
space is chosen for experimentation. For the purpose of this
experiment, values 2, 5, 10, 20, 50, 100 are chosen as
candidates in the value space that can be used as parameter for
the findFreqTerms function. findFreqTerms function is applied
on the sparse matrix with each of the candidates of value space
and the output from the function is eliminated from the spare
matrix to create a final reduced feature set that is used to build
models. These models are then tested for measurements
through repeating 10 fold cross validation 10 times. Table 5
shows the various model performance metrics measurements
from reduced feature sets of findFreqTerms function.
V. RESULT S DISCUSSION AND CONCLUSIONS
From the measurements, it is evident that the number of
features are reduced multi-fold through the application of
nearZeroVariance and findFreqTerms techniques. With the
application of nonZeroVariance using default values for
parameters, it is observed that at a minimum 94.9% features are
eliminated. Even with the application of findFreqTerms and
word frequency threshold set to 2, the datasets are reduced by
at least 50% of dimensions. Therefore a confirmation that both
the dimensionality reduction techniques are effective in terms
of reducing the number of features. However, due to the
amount of effort involved in performing the iterations,
nearZeroVariance with default parameters is quick to
implement and is very effective in reducing the number of
features when compared to trial and error approach that is
taken with findFreqTerms.
Though the number of features are reduced, the accuracy
obtained from models build from both the techniques triumph
the baseline accuracy. There is an accuracy dip observed with
dataset 2 with nearZeroVariance however the difference
between accuracy from baseline models and nonZeroVariance
is 0.4% therefore can be considered negligible. With accuracy
measurement from findFreqTerms techniques, it is observed
that the accuracy dipped in 14 instances spreading across all
datasets. However, there were only three instances that
recorded an accuracy dip percentage of more than 1% and the
maximum dip percentage being 2.34%. It is observed that the
highest accuracy in dataset is not yielded by a single word
frequency value, it appears random across the datasets
therefore no clear pattern is observed in terms of the word
frequency value to be used with findFreqTerms function in
order to obtain high accuracy. For each dataset, when the
highest accuracy measurement obtained from findFreqTerms
datasets is compared with corresponding accuracy
measurements from nonZeroVariance datasets, it is observed
that for datasets 1, 2 and 4 the findFreqTerms yielded better
accuracy than nonZeroVariance. For datasets 3 and 5 the
nonZeroVariance technique yielded better accuracy than
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
Minimum score
1st quartile
Median
Mean
3rd quartile
Maximum score
1.493
1.012
0.711
0.296
0.271
21.7
27
31.6
19.7
23
52.8
24.2
Not Applicable
37.2
54.5
8.3
Not Applicable
76.8
18.8
2.4
83.2
9.2
4.9
2.7
Number of features
1348
Accuracy
52.41
Kappa
0.36
MAE
0.31
F Measure
0.69
AUC
0.9
1307
47.27
0.1
0.39
0.31
0.58
1671
70.33
0.46
0.29
0.71
0.8
1597
81.06
0.47
0.27
0.9
0.79
1738
87
0.54
0.27
0.95
0.85
T ABLE 4. MEASUREMENTS FROM MODELS BUILT FROM DATA P OST DIMENSIONALITY REDUCTION THROUGH NEARZERO VARIANCE FUNCTION
Datase t
Number of features
Accuracy
Kappa
MAE
F Measure
AUC
66
55.9
0.4
0.3
0.71
0.89
66
51.62
0.01
0.37
0.02
0.57
70
71.36
0.48
0.29
0.73
0.82
45
80.66
0.45
0.27
0.9
0.77
32
87.06
0.53
0.27
0.95
0.84
T ABLE 5. MEASUREMENTS FROM MODELS BUILT FROM DATA P OST DIMENSIONALITY REDUCTION THROUGH FIND FREQ T ERMS FUNCTION
Datase t
Word Frequency
Threshold
2
Number of features
Accuracy
Kappa
MAE
F Measure
AUC
607
52.27
0.36
0.31
0.69
0.9
319
52.96
0.37
0.31
0.68
0.9
10
217
54.52
0.39
0.3
0.7
0.9
20
140
56.64
0.41
0.3
0.72
0.91
50
71
55.29
0.39
0.3
0.7
0.9
100
41
54.07
0.38
0.3
0.73
0.9
616
47.36
0.11
0.39
0.32
0.58
321
47.66
0.1
0.39
0.32
0.58
10
199
46.91
0.08
0.39
0.3
0.59
20
119
50.31
0.1
0.38
0.31
0.58
50
70
51.41
0.02
0.37
0.07
0.57
100
40
52.52
0.37
0.58
IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
836
69.72
0.45
0.29
0.7
0.8
443
70.3
0.46
0.29
0.71
0.81
10
261
69.41
0.45
0.3
0.7
0.8
20
153
68.75
0.44
0.3
0.7
0.8
50
78
70.76
0.47
0.29
0.73
0.82
100
39
67.99
0.39
0.3
0.68
0.79
606
81.4
0.48
0.27
0.9
0.79
287
80.78
0.46
0.27
0.9
0.78
10
160
81.08
0.46
0.27
0.9
0.77
20
114
81.35
0.47
0.27
0.91
0.78
50
57
81.31
0.47
0.27
0.91
0.79
100
27
81.23
0.47
0.27
0.9
0.78
629
86.3
0.52
0.27
0.95
0.86
299
86.12
0.51
0.27
0.94
0.86
10
178
85.7
0.5
0.27
0.94
0.85
20
92
86.32
0.52
0.27
0.94
0.85
50
43
86.83
0.53
0.27
0.95
0.85
100
25
86.99
0.53
0.27
0.95
0.83
REFERENCES
[1] Brank J., Grobelnik M ., M ilic-Frayling N.,M ladenic D., Interaction
of Feature Selection M ethods and Linear Classification M odels,
Proc. of the 19th International Conference on M achine Learning,
Australia, 2002.
[2] Thiago G. M artins, Near-zero variance predictors. Should we remove
them?,
http://tgmstat.wordpress.com/2014/03/06/near-zerovariance-predictors/#ref1, Accessed : 17 September 2014.
[3] M ax Kuhn.,Allan Engelhardt., Identification of near zero variance
predictors,
http://www.insider.org/packages/cran/caret/docs/nearZeroVar,
Accessed
:
17
September 2014.
[4]
Find
Frequent
Terms,
http://www.insider.org/packages/cran/tm/docs/findFreqTerms,
Accessed
:
17
September 2014.
[5] Ingo Feinerer, Introduction to the tm Package Text M ining in R,
http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf,
Accessed : 17 September 2014.
[6]
The Hewlett Foundation: Automated Essay Scoring,
http://www.kaggle.com/c/asap -aes, Accessed : 17 September 2014.
[7]
Code
for
evaluation
metric
and
benchmarks,
https://www.kaggle.com/c/asap -aes/data?Training_M aterials.zip , 10
February 2012 , Accessed : 17 September 2014.
[8] Sunil Kumar, R J Rama Sree, Assessment of Performances of
Various M achine Learning Algorithms During Automated
Evaluation of Descriptive Answers, in: ICTACT journal on soft
computing Special Issue on Soft Computing in System Analysis,
Decision and Control 04 (04) (2014),pp 781 -786.
[9] Sunil Kumar, R J Rama Sree, Experiments towards determining best
training sample size for automated evaluation of descriptive answers
through sequential minimal optimization, in: ICTACT journal on
soft computing 04 (02) (2014),pp 710 -714.
[10] The R Project for Statistical Computing, http://www.r-project.org/,
Accessed : 17 September 2014.