Sie sind auf Seite 1von 6

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

Dimensionality Reduction in Automated Evaluation


of Descriptive Answers through Zero Variance, Near
Zero Variance and Non Frequent Words Techniques
A Comparison
Sunil Kumar C

R.J.RamaSree

Research and Development Center


Bharathiar University
Coimbatore, India
sunil_sixsigma@yahoo.com

Department of Computer Science


Rashtriya Sanskrit Vidyapeetha
Tirupati, India
rjramasree@yahoo.com

AbstractIn this paper, we studied the performances of models


when features that are not very common are not included to train
the models used for automated evaluation of descriptive answers
which can otherwise be viewed as a text classification problem.
Two
different
techniques
namely
nonZeroVar and
findFreqTerms were independently employed on the text corpus
in order to identify the non-common features and eliminate them
to attain dimensionality reduction. The implementation details of
both the techniques were discussed. Models were built from
reduced datasets and 10 fold cross validation was repeated 10
times in order to obtain the performance measurements.
Quantitative analysis of the usefulness of these feature selection
techniques were studied using ease of implementation of the
technique, number of features retained post dimensionality
reduction, accuracy, kappa, mean absolute error, F score and
Area under the curve as performance metrics. Based on the
measurements, it was concluded that both nonZeroVar and
findFreqTerms techniques help with eliminating uncommon
features from training set and doing so improved the model s
performance during auto evaluation of descriptive answers.
Another significant conclusion is that nonZeroVar technique
with its default values is quick and easy to use when compared
with findFreqTerms technique that requires a trial and error
approach to set the parameter values that yield optimal
performance measurements. Though nonZeroVar technique is
easy to implement, it does not compromise on any lesser
performance when compared with performance yielded from
models of findFreqTerms technique. A final inference is made
that for dimensionality reduction of text classification,
nonZeroVar technique is a better technique when compared to
findFreqTerms technique.
Index TermsText classification, dimensionality reduction,
feature selection, nonZeroVar, findFreqTerms, sequential
minimal optimization.

I. INT RODUCT ION


Automated evaluation of descriptive answers can be viewed
as a text classification problem. One particular trait with text
classification task is that depending on the document size, the
number of features can be very large sometimes spanning into

thousands. The huge number of features is a major problem for


training algorithm to perform effective learning and execution.
The high dimensions issue on-hand can be eliminated
successfully through employing various feature selection
techniques. Feature selection essentially means that from the
set of features, only subset of features that contribute well to
the models performance are chosen and used during the
classification process [1]. Multiple statistical techniques are
available to perform feature selection however the focus in this
paper are two techniques that are based on the frequency of
features appearance in the text corpus. One of the techniques is
feature selection through identifying and eliminating features
that have non-zero variance and near zero variance. Other one
is identifying and eliminating features that does not meet a
specific frequency threshold. Both these techniques have a
preconception that features are not valuable for model building
if they are extremely uncommon. A feature that only appears
once or twice out of thousand examples is not going to lead to
a principle that can generalized in the model.
Predictors with a constant value across samples are termed
as zero variance predictors and those predictors with a value
that is almost constant across samples is termed as a near zero
variance predictors. Both these predictors are most times non informative and at times can break the models as well [2]. A
technique that could come to rescue with the dimensionality
problem is to be able to identify such zero variance and near
zero variance predictors in the training datasets and eliminate
them prior to training the models. R statistical language
provides a nearZeroVar function in caret package which helps
with not only identifying and removing predictors with only
one unique value across samples but the function also identifies
and removes predictors that have few unique values relative to
the number of samples and large ratio of the frequency of the
most common value to the frequency of the second most
common value [2]. freqCut and uniqueCut are important
parameters used by nearZeroVar function. The former
parameter determines the threshold for the ratio of the most
common value to the second most common value and the latter

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
one determines the threshold for the percentage of distinct
values out of the number of total samples [3]. The default value
for freqCut is 95/5 and it is 10 for uniqueCut. User may
provide any other values other than the defaults if intended so
and the threshold values are influenced accordingly.
The other approach to reduce the dimensionality is by
discarding terms that do not occur in the whole text corpus
equal or more than the user specified threshold. R statistical
language provides findFreqTerms function in the tm package
that identifies frequent terms in a document-term or termdocument matrix. The function takes important parameters
lowfreq and highfreq. The function returns words that occur
more or equal often than lowfreq times and less or equal often
than highfreq times. The default values for lowfreq and
highfreq are 0 and Inf which means all words are returned by
the function if the user did not specify thresholds [4] [5]. As
the default values do not s uffice the needs, user is expected to
experiment with various lowfreq and highfreq values in order
to identify values that yield performance measurements.
While it is very evident that both nearZeroVar and
findFreqTerms functions can be used for dimension ality
reduction in text corpus, the research covered under this paper
focuses on identifying the better technique among these two
from a user stand point. In order to compare the functions, the
research takes into account the reduction in the number of
features, the accuracy, kappa, Mean Absolute Error (MAE), F
Score, Area Under the Curve (AUC) obtained with models
built using the reduced feature sets and the ease of
implementing the functions as performance metrics. The goal
of the research is to identify a quickly implementable and
effective function amongst the two in order to accomplish
optimal dimensionality of the text corpus with respect to the
models performance metrics.
The rest of this paper is organized as follows. Section 2
details the previous work undertaken in this area of research
and the research motivation that makes the research covered
under this paper unique from others. Section 3 discusses the
data used, experimental setup, the preliminaries of the tools
and techniques used. Section 4 describes the models built and
measurements made during the experiments. Analysis of
experimental results and conclusions are dealt with in Section
5.
II. RELAT ED WORK AND RESEARCH MOT IVAT ION
While there is enormous amount of literature available on
the details of nearZeroVar and findFreqTerms R language
functions and their applications in text mining area, it is
interesting that there is no literature available that compares
these two techniques. This is despite the fact that both the
techniques make use of the philosophy of presence of features
a certain number of times in the text corpus. It is important to
derive a general principle in terms of which of these two
techniques to use for dimensionality reduction purposes in text
classification. Deriving a general principle eliminates the need
to repeat the testing of both these dimensionality reduction
techniques every time there is a need to reduce text corpus
dimension. Also, it is observed that there is no literature

available demonstrating the application of the nearZeroVar and


findFreqTerms techniques to the automated evaluation of
descriptive answers domain. These gaps identified are
addressed through the research covered under this research
paper therefore making the aspects covered under the research
paper very unique from the existing literature and a significant
contribution to the existing knowledge.
III. EXPERIMENT AL SET UP
The setup in which the experiments are conducted for this
paper are specified in this section.
3.1. Data collection
In February 2012, The William and Flora Hewlett
Foundation (Hewlett) sponsored the Automated Student
Assessment Prize (ASAP) [6] to machine learning specialists
and data scientists to develop an automated scoring algorithm
for student-written essays. As part of this competition, the
competitors are provided with hand scored essays under 8
different prompts which are questions to which answers were
obtained from Students. These answers are the datasets. 5 of
the 8 essays prompts are used for the purpose of this research.
3.2. Data characteristics
All the graded essays from ASAP are according to specific
data characteristics. All responses were written by students of
Grade 10. On an average, each essay is approximately 50
words in length. Some are more dependent upon source
materials than others [6]. All the documents are in ASCII text
followed by a human score, a resolved final score was given in
cases there is a variance found with scores provided by two
human scorers [7]. For the purpose of evaluation of the
performance of the model, the score predicted by the model
needs to comply with the resolved human score in training
example.
The data used for training and validation of the models are
answers written by students for 5 different questions. Data for a
question is considered as one unique dataset. So, we have a
total of 5 datasets. The questions that students are asked to
provide responses to are from Chemistry, English Language
Arts and Biology.
In each of the 5 training datasets used for the research, the
training set is 1000 samples in size however only 900 training
samples were used for training as 10 fold cross validation is
applied for performance measurement. This essentially means
900 samples were used in each of the folds for training and 100
for testing. The datasets have two fields, one field is named
EssayText and the other is Scores. Table 1 shows the summary
of scores field in each of the datasets. Previous research for
determining appropriate sample size for automated essay
scoring using Sequential Minimal Optimization (SMO) [8]
revealed that using 900 samples for training proved to yield
slightly better results than using other sample sizes [9]
therefore the decision to use 900 samples as the training sample
size. Table 2 shows the percentages of score distribution in
each of the datasets.

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
3.3. Hardware and the software used for the research
All experiments performed were executed on a Dell
Latitude E5430 laptop. The laptop is configured with Intel
Core i5 -3350M CPU @ 2.70 GHz and with 4 GB RAM
however Weka workbench is configured to use a maximum of
1 GB. The laptop runs on Windows 7 64 bit operating system.
For the purpose of designing and evaluating the
experiments, R statistical language and Weka machine learning
workbench are used. R is used for all pre-processing tasks
including dimensionality reduction where was Weka is used
for building the models using the pre-processed data and to
obtain the model performance statistics.
R statistical language is used for the research specially for
computing near zero R is a GNU project that offers a language
and environment for statistical computing and graphics. R
provides a wide variety of statistical and graphical techniques
such as linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, etc., R is
highly extensible as well [10].
A machine learning workbench called Weka is used for the
experiments. Weka stands for Waikato Environment for
Knowledge Analysis and it is a free offering from University
of Waikato, New Zealand. This workbench has a user-friendly
interface and it incorporates numerous options to develop and
evaluate machine learning models [11] [12]. These models can
be utilized for a variety of purposes, including automated essay
scoring.
3.4. Data pre-processing
To be able to build models required for the research, the
raw datasets that consists the answers and scores need to be
transformed into a format that can be used by Weka. The steps
involved in transformation of the raw data into a usable form is
termed as pre-processing. All required pre-processing for
experimentation is done using an R language text mining
library called tm [13].
The first step is to read the raw data file into an R vector.
Post reading, the vector is used to create a corpus, which refers
to a collection of text documents. Each text document in this
case refers to an answer in the raw data file. The corpus was
then passed through series of R functions which are again part
of tm package so as to change the case to lower case, s trip
white spaces, remove numbers, and remove stop words,
punctuations.
From the corpus, A sparse matrix is created using
DocumentTermMatrix function in which the rows of the matrix
indicate documents and the columns indicate features i.e.,
words. Each cell in the matrix contains a value that represents
the number of times the word in the column heading appears in
the document specified in the row. The last step is to convert
the sparse matrix cell values into categorical values from
numeric type. This is required because the number of times a
word appeared in a sentence does not relate to the score
obtained rather it is the presence of word in the sentence that is
more relevant in scoring. A custom function is written to
convert the numeric values into Yes or No categorical
values. A cell with a value equal to 1 or more is replaced with

Yes otherwise is replaced with No. The custom function is


called for each cell in the sparse matrix. Post execution of the
function on all cells in the sparse matrix, all the cells will only
have factors Yes or No. At this stage models are built so as
to obtain measurements for baseline purposes. Once
measurements are obtained, the matrix is used to apply
dimensionality reduction techniques. Models are built using the
output matrices obtained from applying dimensionality
reduction techniques. Measurements are obtained from these
models and compared with baseline to arrive at conclusions of
the research.
3.5. Model building
All models are built using John Platt's Sequential Minimal
Optimization (SMO) algorithm [14] for training a support
vector classifier and polynomial kernel is used along with other
default parameters as available in Weka.
3.6. Model performance evaluation metrics
The measurements obtained from 10 fold cross validation
repeated 10 times is used as indicator to obtain models future
performance with unseen data. The measurements recorded are
the size of the reduced datasets post the application of
dimensionality reduction techniques, accuracy, Cohens kappa
(kappa), mean absolute error (MAE), area under the receiver
operating curve (AUC), F score and the ease of use of the
dimensionality reduction technique.
IV. M EASUREMENT S
Various models built during the experiments, the
measurements obtained are described in this section.
4.1. Baseline measurements
The sparse matrix obtained from raw data post application
of pre-processing techniques but prior to the application of any
dimensionality reduction technique is used for building
models. The various baseline measurements with the datasets is
shown in Table 3.
4.2. Measurements from models built from data post
dimensionality reduction through nearZeroVariance
function output
nearZeroVariance function was applied on the sparse
matrix and all features that resulted as zero variance and near
zero variance are eliminated from the sparse matrix. Models
are then built with the resultant sparse matrix and testing for
measurements was done through repeating 10 fold cross
validation 10 times. Table 4 shows the various model
performance metrics measurements from reduced feature sets
of nonZeroVariance function.
4.3. Measurements from models built from data post
dimensionality reduction through findFreqTerms function
output
Unlike the nearZeroVariance function, findFreqTerms
function does not carry any default values for the key
parameter that defines the threshold specifying the number of
times a word has to repeat in the corpus to qualify it to be

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015
retained in data to be used in model building. The onus is on
the user to provide a value to the parameters so as to define the
word retention threshold. A general approach towards this
problem is to search the values space to identify values for the
parameters. Since the value space ranges from 0 to maximum
number of words in the corpus, it is not a possibility to evaluate
each value to identify optimal value therefore a random value
space is chosen for experimentation. For the purpose of this
experiment, values 2, 5, 10, 20, 50, 100 are chosen as
candidates in the value space that can be used as parameter for
the findFreqTerms function. findFreqTerms function is applied
on the sparse matrix with each of the candidates of value space
and the output from the function is eliminated from the spare
matrix to create a final reduced feature set that is used to build
models. These models are then tested for measurements
through repeating 10 fold cross validation 10 times. Table 5
shows the various model performance metrics measurements
from reduced feature sets of findFreqTerms function.
V. RESULT S DISCUSSION AND CONCLUSIONS
From the measurements, it is evident that the number of
features are reduced multi-fold through the application of
nearZeroVariance and findFreqTerms techniques. With the
application of nonZeroVariance using default values for
parameters, it is observed that at a minimum 94.9% features are
eliminated. Even with the application of findFreqTerms and
word frequency threshold set to 2, the datasets are reduced by
at least 50% of dimensions. Therefore a confirmation that both
the dimensionality reduction techniques are effective in terms
of reducing the number of features. However, due to the
amount of effort involved in performing the iterations,
nearZeroVariance with default parameters is quick to
implement and is very effective in reducing the number of
features when compared to trial and error approach that is
taken with findFreqTerms.
Though the number of features are reduced, the accuracy
obtained from models build from both the techniques triumph
the baseline accuracy. There is an accuracy dip observed with
dataset 2 with nearZeroVariance however the difference
between accuracy from baseline models and nonZeroVariance
is 0.4% therefore can be considered negligible. With accuracy
measurement from findFreqTerms techniques, it is observed
that the accuracy dipped in 14 instances spreading across all
datasets. However, there were only three instances that
recorded an accuracy dip percentage of more than 1% and the
maximum dip percentage being 2.34%. It is observed that the
highest accuracy in dataset is not yielded by a single word
frequency value, it appears random across the datasets
therefore no clear pattern is observed in terms of the word
frequency value to be used with findFreqTerms function in
order to obtain high accuracy. For each dataset, when the
highest accuracy measurement obtained from findFreqTerms
datasets is compared with corresponding accuracy
measurements from nonZeroVariance datasets, it is observed
that for datasets 1, 2 and 4 the findFreqTerms yielded better
accuracy than nonZeroVariance. For datasets 3 and 5 the
nonZeroVariance technique yielded better accuracy than

findFreqTerms. However, in all cases the difference between


the accuracies is less than 1%. Given the effort involved with
identifying the appropriate word frequency value to be used
with findFreqTerms function, nonZeroVariance function with
default values for the parameters proves to be a better option
given that it is yielding almost the same accuracy as
findFreqTerms. Given all the background, nonZeroVariance
function is a clear winner when looked from an accuracy
measurement perspective.
The lesser the MAE than the baseline MAE, the better
model the model is. Reviewing the measurements reveals that
measurements from both nonZeroVariance and findFreqTerms
models measured the same as baseline MAEs. With
findFreqTerms and only in dataset 3, the MAE is dipped by
0.01 than that of baseline MAE. However the dip is very
insignificant. Considering the measurements, MAE is not a
significant measure to compare nonZeroVariance and
findFreqTerms
techniques. However nonZeroVariance
technique can be considered a better technique given the efforts
invested in implementing the technique.
With
models
built
from
implementation
of
nonZeroVariance technique, F score measurements are
improved or at least matched the F scores recorded as baseline
measurements. However, with dataset 2, a 93.55% dip is
recorded which is significant. With models built from
implementation of and findFreqTerms technique,
The F
scores recorded with datasets 1,3, 4 and 5 showed improved or
at least matched the F scores recorded as baseline
measurements. There are dips in 4 cases however the dip is less
than 1% therefore very insignificant. With datasets 2,
significant dip in F scores are observed. With dataset 2, with
word frequency threshold of 100, the model recorded F score
of 0 confirming that the accuracy cannot relied on at all. A
similar case is observed with word frequency threshold of 50
for dataset 2. Considering the measurements and conclusions,
F score appears to be not a significant measure to compare
nonZeroVariance and findFreqTerms techniques.
The AUC measurements from models created using both
techniques show that the Area under the curve has either
improved or it remained the same as the baseline AUC. There
are some cases where AUC dip is seen however the dip is less
2% therefore is insignificant. Considering the measurement
and conclusions, AUC appears to be not a significant measure
to compare nonZeroVariance and findFreqTerms techniques.
Given the five different measurements and conclusions,
nonZeroVariance technique is proved to be a better
dimensionality reduction technique when compared to
findFreqTerms technique. nonZeroVariance provisions a quick
implementing by adopting default parameters which is not the
case while using findFreqTerms technique. Though lesser
effort is required with nonZeroVariance when compared to
findFreqTerms, one can obtain fewer dimensions in the data,
while retaining the accuracy, kappa, MAE, F Score and AUC
that
can
be
obtained
with
findFreqTerms.

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

T ABLE 1. SUMMARY OF SCORES FIELD IN THE DATASETS


Datase t / Summary property

Minimum score

1st quartile

Median

Mean

3rd quartile

Maximum score

1.493

1.012

0.711

0.296

0.271

T ABLE 2. P ERCENTAGES OF SCORES DISTRIBUTION IN THE DATASETS


Dataset / Score

21.7

27

31.6

19.7

23

52.8

24.2

Not Applicable

37.2

54.5

8.3

Not Applicable

76.8

18.8

2.4

83.2

9.2

4.9

2.7

T ABLE 3. BASELINE MEASUREMENTS


Datase t
1

Number of features
1348

Accuracy
52.41

Kappa
0.36

MAE
0.31

F Measure
0.69

AUC
0.9

1307

47.27

0.1

0.39

0.31

0.58

1671

70.33

0.46

0.29

0.71

0.8

1597

81.06

0.47

0.27

0.9

0.79

1738

87

0.54

0.27

0.95

0.85

T ABLE 4. MEASUREMENTS FROM MODELS BUILT FROM DATA P OST DIMENSIONALITY REDUCTION THROUGH NEARZERO VARIANCE FUNCTION
Datase t

Number of features

Accuracy

Kappa

MAE

F Measure

AUC

66

55.9

0.4

0.3

0.71

0.89

66

51.62

0.01

0.37

0.02

0.57

70

71.36

0.48

0.29

0.73

0.82

45

80.66

0.45

0.27

0.9

0.77

32

87.06

0.53

0.27

0.95

0.84

T ABLE 5. MEASUREMENTS FROM MODELS BUILT FROM DATA P OST DIMENSIONALITY REDUCTION THROUGH FIND FREQ T ERMS FUNCTION
Datase t

Word Frequency
Threshold
2

Number of features

Accuracy

Kappa

MAE

F Measure

AUC

607

52.27

0.36

0.31

0.69

0.9

319

52.96

0.37

0.31

0.68

0.9

10

217

54.52

0.39

0.3

0.7

0.9

20

140

56.64

0.41

0.3

0.72

0.91

50

71

55.29

0.39

0.3

0.7

0.9

100

41

54.07

0.38

0.3

0.73

0.9

616

47.36

0.11

0.39

0.32

0.58

321

47.66

0.1

0.39

0.32

0.58

10

199

46.91

0.08

0.39

0.3

0.59

20

119

50.31

0.1

0.38

0.31

0.58

50

70

51.41

0.02

0.37

0.07

0.57

100

40

52.52

0.37

0.58

IEEE Sponsored 9th International Conference on Intelligent Systems and Control (ISCO)2015

836

69.72

0.45

0.29

0.7

0.8

443

70.3

0.46

0.29

0.71

0.81

10

261

69.41

0.45

0.3

0.7

0.8

20

153

68.75

0.44

0.3

0.7

0.8

50

78

70.76

0.47

0.29

0.73

0.82

100

39

67.99

0.39

0.3

0.68

0.79

606

81.4

0.48

0.27

0.9

0.79

287

80.78

0.46

0.27

0.9

0.78

10

160

81.08

0.46

0.27

0.9

0.77

20

114

81.35

0.47

0.27

0.91

0.78

50

57

81.31

0.47

0.27

0.91

0.79

100

27

81.23

0.47

0.27

0.9

0.78

629

86.3

0.52

0.27

0.95

0.86

299

86.12

0.51

0.27

0.94

0.86

10

178

85.7

0.5

0.27

0.94

0.85

20

92

86.32

0.52

0.27

0.94

0.85

50

43

86.83

0.53

0.27

0.95

0.85

100

25

86.99

0.53

0.27

0.95

0.83

REFERENCES
[1] Brank J., Grobelnik M ., M ilic-Frayling N.,M ladenic D., Interaction
of Feature Selection M ethods and Linear Classification M odels,
Proc. of the 19th International Conference on M achine Learning,
Australia, 2002.
[2] Thiago G. M artins, Near-zero variance predictors. Should we remove
them?,
http://tgmstat.wordpress.com/2014/03/06/near-zerovariance-predictors/#ref1, Accessed : 17 September 2014.
[3] M ax Kuhn.,Allan Engelhardt., Identification of near zero variance
predictors,
http://www.insider.org/packages/cran/caret/docs/nearZeroVar,
Accessed
:
17
September 2014.
[4]
Find
Frequent
Terms,
http://www.insider.org/packages/cran/tm/docs/findFreqTerms,
Accessed
:
17
September 2014.
[5] Ingo Feinerer, Introduction to the tm Package Text M ining in R,
http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf,
Accessed : 17 September 2014.
[6]
The Hewlett Foundation: Automated Essay Scoring,
http://www.kaggle.com/c/asap -aes, Accessed : 17 September 2014.
[7]
Code
for
evaluation
metric
and
benchmarks,
https://www.kaggle.com/c/asap -aes/data?Training_M aterials.zip , 10
February 2012 , Accessed : 17 September 2014.
[8] Sunil Kumar, R J Rama Sree, Assessment of Performances of
Various M achine Learning Algorithms During Automated
Evaluation of Descriptive Answers, in: ICTACT journal on soft
computing Special Issue on Soft Computing in System Analysis,
Decision and Control 04 (04) (2014),pp 781 -786.
[9] Sunil Kumar, R J Rama Sree, Experiments towards determining best
training sample size for automated evaluation of descriptive answers
through sequential minimal optimization, in: ICTACT journal on
soft computing 04 (02) (2014),pp 710 -714.
[10] The R Project for Statistical Computing, http://www.r-project.org/,
Accessed : 17 September 2014.

[11] M ark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer,


Peter Reutemann, Ian H. Witten (2009); The WEKA Data M ining
Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
[12] Ian H. Witten and Eibe Frank (2005). Data M ining: Practical
M achine Learning Tools and Techniques. 2nd Edition, M organ
Kaufmann, San Francisco.
[13] I. Feinerer, K. Hornik, and D. M eyer. Text mining infrastructure in
R. Journal of Statistical Software, 25(5):154, M arch 2008. ISSN
1548-7660. URL http://www.jstatsoft.org/v25/i05, Accessed : 17
September 2014.
[14] J. Platt: Fast Training of Support Vector M achines using Sequential
M inimal Optimization. In B. Schoelkopf and C. Burges and A.
Smola, editors, Advances in Kernel M ethods - Support Vector
Learning, 1998.

Das könnte Ihnen auch gefallen