An Educational Model Based On Knowledge Discovery in Databases (KDD) To Predict Learner's Behavior Using Classification Techniques

An Educational Model Based on Knowledge
Discovery in Databases (KDD) to

Predict Learner’s Behavior Using
Classification Techniques
Benilda Eleonor V. Comendador, Lorena W. Rabago and Bartolome T. Tanguilig III

Graduate Programs Office
Technological Institute of the Philippines
Quezon City, Philippines
email: bennycomendador@yahoo.com, lwr823@yahoo.com, bttanguilig_3@yahoo.com
Abstract— This paper examined the students’ history of model) based on the training set and the values (class labels)
accessing the university Learning Management System (LMS) in a classifying attribute and uses it in classifying new data.
data. Classification techniques are used to build an educational On the other hand, prediction models continuous-valued
model based on Knowledge Discovery in Databases (KDD) to functions, i.e., predicts unknown or missing values.
predict learner’s behavior. It identified the most valuable
influencer for learning outcomes of the learners; it generated Meanwhile, some educational institutions are using
prediction models using the J48 decision tree algorithm and software like Learning Management System (LMS) or Course
Multiple linear regression; and it determined how likely is a Management System (CMS) to manage the courses offered on
Distance Education (DE) learners to get a mark of “Passed” in a the Internet. Currently, there are more than 200 LMS
certain course which may offer vital information to the teachers products. Study revealed that Blackboard is one of the leading
and university administrators for program planning and learner commercial LMS (or CMS) software packages used by North
support strategies. The proponents conducted experiments to American and European universities. Nevertheless, the
predict the students’ final rating based on their history of Modular Object-Oriented Dynamic Learning Environment
accessing the data in the university LMS. Based on the derived (Moodle) is the most recommended Open Source Software
model, the score obtained from the participation in the online (OSS) [4].
activities was the most valuable influencer for learning outcomes
of the DE learners. Thus, the successful completion of the With this innovation, students and enrollment in distance
program depends on how the students interact with the activities education courses using LMS became attractive for most
posted in the LMS. As such, the generated model may be utilized learners. However, the report on course drop out and failure
to identify DE learners who need early intervention for better rates is more incessantly increasing in this mode of learning.
academic achievements and meaningful online learning In the study done for Greek Distance Education University,
environment. the most significant reasons cited by the respondents for not
completing the course is the underestimation of the actual time
Keywords—classification techniques; educational data mining; required for studying versus their other obligations in which
learning management system; predictive indicator
the dropout numbers are more evident at the undergraduate
level than the postgraduate level [5]. The dropout rate is one
I. INTRODUCTION measure of the effectiveness of a course program, and its
Nowadays, research on knowledge discovery in databases quality can be determined, in part, by calculating student
or data mining and e-Leaning technologies are getting popular completion rates [6]. Other researchers have found that
in the academe because of its potential to improve services in computer literacy and confidence, reading ability and time
the education domain. These innovations can provide management skills play a role in successful course completion
ubiquitous tool and powerful platform that may be used for [7],[8],[9].
higher educational institutions [1], [2]. Educational data Less research has been done on methods to improve the
mining develops methods that discover knowledge from data success and retention of nontraditional students who are
originating from educational environments. It uses many enrolled in courses offered in distance education. The process
techniques such as Decision Trees, Neural Networks, Naïve of understanding and analyzing the factors for poor
Bayes, K- Nearest neighbor, and many others [3]. It is performance is challenging because it is hidden in historical
interesting to note that classification predicts categorical class and current educational information [10].
labels (discrete or nominal). It classifies data (constructs a
978-1-5090-2708-8/16/$31.00 ©2016 IEEE ICSPCC2016

Thus, the ultimate goal of this study is to explore the A. Data Pre-Processing Phase
Knowledge Discovery in Databases (KDD) based on the In this study, the proponents used the data set from the
learners’ history of accessing the data in the university students’ history of accessing the Polytechnic University of
Learning Management System (LMS). Specifically, it will the Philippines (PUP) eMabini Learning Portal. The said
achieve the following objectives: (1) categorize the typical portal is powered by the Modular Object-Oriented Dynamic
online behaviors of the Distance Education (DE) learners, as Learning Environment (Moodle) system. They initially
discovered through data mining techniques; and (2) identify extracted the 99,233 records user’s log file and other student’s
the most valuable influencer for learning outcomes of the related activities from the Moodle database. The authors had
learners that may offer vital information to the teachers and chosen the 248 records of student’s data of the 3 programs in
university administrators for program planning and learner the PUP Open University System. It consists of 185 students
support strategies. of Bachelor in Entrepreneurship (BSEM), 32 students of
To cite a few studies presented in the educational data Master of Science in Information Technology (MSIT) and 31
mining conferences, there is a college completion model based students of Post-Baccalaureate Diploma in Information
on k-means clustering data mining techniques[11],[13]. The Technology. These students were enrolled during the First
said model was based on high school general weighted Semester 2015-2016. The student performance prediction
average, college entrance test, gender, scholarship grant, grade model was generated using the data set based on the derived
point average in first and second semester. They mentioned variables of the student’s interaction in the eLearning platform
that future researchers may use additional predictor variables as shown in Table I. After which, the authors converted the
related to students’ college completion. The inclusion of the extracted data into the required format by Waikato
records of currently enrolled students is highly recommended Environment for Knowledge Analysis (WEKA) tool and
to monitor their progression and for early intervention for performed steps to preprocess the data.
those who may be considered at-risk [12].
B. Data Mining Phase
Much of the work found in literature and surveyed has
utilized enrollment data on conventional secondary and During the data mining phase, the proponents conducted
tertiary education. As such, this study will focus on the experiments to evaluate the appropriate classification
improvement of online teaching and learning activities using algorithms that can be utilized for predicting students’ final
classification technique which may work to both numerical rating based on their usage data in the LMS. The objective is
attributes and categorical attributes. to classify students with equal final marks into different
groups depending on the activities carried out in a web-based
course. Based on the community of the researchers, decision
II. THE EDUCATIONAL KDD PROCESS trees are often used in classification and prediction. It is
Fig.1 depicts the process of the educational Knowledge simple yet a powerful way of knowledge representation.
Discovery in the Databases (KDD) used in this study. It
consists of three major phases such as (a) data pre-processing; TABLE I. STUDENTS RELATED VARIABLES
(b) data mining and (c) pattern analysis.
Attribute Name Attribute Description
Student year of (e.g. 1988) – to calculate the age of the student

birth
Gender The gender of the student (male or female
Department Course Program where the student enrolled
Study Year Year when the student admitted to the

University
Type of study Type of student (full time, part time)
Employment Status of the student if employed (no yes)
Registration Old Student, New Student
Total frequency of student’s login in the LMS
Log Freq within the rating period
Total frequency when was the last time
Mat Access Freq accessed course materials
Exam condition (no, yes)
Exam points (0 – 100)
Activity points (0 – 100)
Final Grade Student’s final grade in numerical form which
was used as classifier
Final Mark Student’s final grade remarks either passed or

Fig. 1. The educational Knowledge Discovery in the Databases (KDD) withdrawn
process.
ICSPCC2016
In this study, the authors utilized Reduced Error Pruning B. Key Influencer for Student’s Performance
Tree (REPTree), Classification and Regression Tree (CART) Weka tool supports different attribute selection techniques
and J48 tree algorithms which all work to both numerical and that can be applied on educational database. It is essential to
categorical attributes. They also used Multiple Linear conduct a test for better understanding of the importance of
Regression (MLR) to predict the Final_grade which involve the initial input attributes. The authors chose a subset of input
more than one predictor variables which was solved by variables by eliminating features, which are irrelevant or of no
extension of least square method. predictive information using the three feature selection
1) C4.5 or J48 is a decision tree classifier that implements techniques namely: (a) Chi-Squared Attribute Evaluation
C4.5 and the successor of ID3 which works best in dealing (CH); (b) Gain-Ratio Attribute Evaluation (GR); and (c)
Information-Gain Attribute Evaluation (IG). The authors
with numeric attributes, missing data, noisy data, and
obtained the final attributes which were used for classification
generating rules from the tree. The algorithm works in of students’ performance in an eLearning environment after
heuristic based reasoning where the candidate cuts off a they applied the three feature selection techniques: CH, GR
smallest number of instances on the numeric attributes. Based and IG.
on the heuristic observation of Quinlan (1986), if there is an S
candidate on a certain numeric attribute at the node, it is Table II shows the average rank of each attribute according
to its importance as follows: students’ obtained score in the
considered splitting log 2(S)/N is subtracted from the
online activity, examination rating and its condition, year of
information gain where N is the number of instances at the admission, their frequency of log into the portal, frequency of
node which prevents over fitting [14]. This algorithm uses accessing the material in the portal, department, age, type of
gain ratio to overcome the problem (normalization to study, gender and registration. The generated result describes
information gain) using the following formula: that the score obtained from the participation in the online
v | D
j | |Dj | activities was the most valuable influencer to successful
SplitInfo A ( D ) = −  × log 2 ( ) completion of the program which garnered an average rate of
j =1 | D | |D |
71.8. However, the department, age, type of study, gender,
and registration were not identified as significant in the final
2) Classification and Regression Tree (CART) Algorithm classification of students’ performance in an eLearning
was introduced by Breiman et al., in 1984. It handles both environment as these variables obtained an average of 2.60
categorical and continuous attributes to build a decision tree. It and below after applying the three feature selection
handles missing values. CART uses Gini Index as an attribute techniques.
selection measure to build a decision tree. CART uses cost
complexity pruning to remove the unreliable branches from C. Student Performance Predictive Model
the decision tree to improve the accuracy [12]. CART The authors applied discretization during preprocessing
algorithm, which is one of the most famous methods in tasks and tested the REPTree, CART and J48 algorithms on
decision tree, creates binary branches based on one field. It the data sets selected using 10-fold cross validation in WEKA.
means that each group is divided into two other groups.
3) Multiple Linear Regression (MLR) works for multiple TABLE II. RESULT OF TESTS AND AVERAGE RANK
explanatory variables to create a model that predicts the Chi- Info Gain
specific outcome being researched. Multiple linear regression Attributes Square Gain Ratio Average
works in a very similar way to simple linear regression. Activity_Points 214.5 0.6 0.2 71.8
C. The Pattern Analysis Phase Exam_Points 155.1 0.4 0.2 51.9
During the pattern analysis phase, the authors identified Exam _Condition 113.4 0.3 0.6 38.1
interesting rules and patterns for decision-making. If the Study_Year 21.2 0.1 0.1 7.1
results were not considered remarkable, the phases of the data
mining as well as the pattern analysis were repeated. Log_Freq 18.6 0.0 0.0 6.2
Mat_Access_Freq 16.4 0.0 0.0 5.5
III. RESULTS AND DISCUSSIONS Department 7.8 0.0 0.0 2.6
Age 4.3 0.0 0.0 1.4
A. Descriptive Analysis
The data used in this study, was gathered through the PUP- Type_Study 3.8 0.0 0.0 1.3
LMS eMabini Learning portal. The file consists of 99,233 Gender 1.2 0.0 0.0 0.4
records user’s log file of the 248 students. Based on the result,
Registration 0.1 0.0 0.0 0.0
forty percent (40%) of the students are males while the other
sixty (60%) are females. Students are from the age group 18 to
58 years. A total of 195 /248 DE learners or around 78.62% Table III describes the performance of these three (3)
has a greater chance of completing their respective program on classifiers in terms of: (a) time to build the model; (b)
time. Consequently, 53/248 or about 21.37% of them need Correctly Classified Instances (CCI); (c) Incorrectly Classified
intervention to complete the program on time. Instances (ICI) and (d) accuracy.
ICSPCC2016
TABLE III. PERFORMANCE OF THE CLASSIFIERS
Evaluation Criteria Classifiers
REP CART J48

Tree
Time to Build the Model (in
Sec) 0.01 0.9 0.11
Correctly Classified 237 237 241
Instances (CCI)
Incorrectly Classified 11 11 7
Instances (ICI)
Accuracy (%) 95.56 95.56 97.17
Based on the experiment, J48 or C4.5 in WEKA identified

97.17% Correctly Classified Instances (CCI) and 2.9 %
Incorrectly Classified Instances (ICI). It indicates that the Fig. 3. LMS prediction model using C4.5 or J48 algorithm
model is still accurate and can handle unknown data or any
changes that may be applied to it in the future. Meanwhile, Considering the selected value, below is the predictive J48
the REPTree and CART algorithm obtained a CCI of 96.37% model:
and 3.6% ICI which is an indicator that the model is also Final_rating (predicted) = Withdrawn:[Activity_Points ≤ 1] ;
vulnerable to handle unknown data or future data that can be
applied to it. However, in terms of time to build the model, =Passed: [Activity_Points > 2 ];
REPTree is the fastest algorithm among the three (3) =Passed: [ Activity_Points ≤ 2 ]+ [Mat_Access_Count > 4];
classifiers because it builds a decision tree based on the
information gain which took 0.01 second to build the model. =Passed: [ Activity_Points ≤ 2 ]+ [ Mat_Access_Count ≤ 4 +
The authors used the top five attributes generated by the [ Exam_Points > 80 ];
feature selection techniques (see Table II) to construct the = Withdrawn: [ Activity_Points ≤ 2 ]+ [ Mat_Access_Count ≤
predictive decision tree model. It consists of students’ 4+
obtained score in the online activity, examination rating and [ Exam_Points ≤ 80 ].
its condition, year of admission, their frequency of log into the
portal which served as valuable indicators to the predictive
decision tree model. The identified variables may facilitate Based on the attributes selected as shown in Fig. 3 the
the course specialists (teachers) to predict student’s learning activity points becomes the root of the generated prediction
performance. model then followed by material access count and exam
Fig. 2 depicts the J48 or C4.5 Decision Tree Model that is points.
constructed in the train phase. The model generated 97.17% Correctly Classified
Instances (CCI) as shown in Table IV. There are only 7
Incorrectly Classified Instances (ICI) which indicate that the
model is incorrect for 2.82 % of the cases in the data set. Out
of the 52 students who withdrawn, 5 instances have been
misclassified as “Passed” and from the 189 students 2 of them
were misclassified as “Withdrawn”.
TABLE IV. CONFUSION MATRIX OF C4.5 OR J48 DECISION

TREE MODEL
Actual Class Predicted Class
Passed Withdrawn Percent Correct

Passed 189 5 99.97%
Withdrawn 2 52 99.96%
Overall Percentage 99.98% 99.90% 97.17%
Fig. 2. J48 or C4.5 decision tree model (constructed in the train phase)
ICSPCC2016
The authors also used Multiple linear regression to predict interact with the activities posted in the Learning Management
the Final_grade which involve more than one predictor System.
variables. It was solved by extension of least square method.
The training data is of the form: Based on the derived model, a DE learner is predicted to
withdraw from the course if he will just participate in only 1
online activity within 6 weeks. It is further predicted that if he
(X1, y1), (X2, y2),…, (X|D|, y|D|) will just participate in 2 activities he has to access the LMS
materials for more than 4 times or he has to obtain a rating of
|D | more than 80% in the major examination in order to get a
 ( x i − x )( y i − y ) mark of “Passed” in a certain course. The model depicts that a
w = i=1
total of 195 /248 DE learners or around 78.62% has a greater
1 
|D |
(xi − x )2
i=1 chance of completing their respective program on time.
Consequently, 53/248 or about 21.37% of them need
w = y − w x intervention to complete the program on time. The results
0 1 demonstrated that J48 decision tree algorithm and Multiple
________________________________________ linear regression can be harnessed to build an LMS prediction
model which may provide powerful educational tool that can
y = w0 + w1 x1+ w2 x2 +….+ wk xk analyze and predict the performance of the DE learners
scientifically.
where:
w0 is called the intercept and the
FUTURE WORKS
Xk are called slopes or coefficients In the future, the authors would like to extend their work
________________________________________ by applying and testing the generated LMS prediction model
using more complex students’ database which may involve all
Considering the selected value, the predictive model online course activities. Furthermore, the school
administrators should involve more faculty members in the
using the multiple regression algorithm are the following: study so that more data can be generated and analyzed. Also,
educational Knowledge Discovery in Databases (KDD) can be
Final_Grade (predicted) = w0 + w1 * Exam_pts performed in other data derived for the improvement of
students’ services such as the data produced in students’
+ w2 * Activity_pts admission, guidance, and other online services of the
Final_Grade (predicted) = 17.3806+0.369863 * University.
Exam_pts + 0.45284881 * Activity_pts
References
______________________________________________
[1] V. Kumar and A.A. Chadha, “An empirical study of the applications of
Thus, if we apply the generated multi-regression model in data mining techniques in higher education”, International Journal of
Advanced Co
the LMS data sets, the student must get a score of 85 and 80 in
[2] mputer Science and Applications (IJACSA), Vol., 2 No., 3 March 2011.
the Exam_pts and in the Activity_pts respectively, in order to
[3] D. G. Bayyou, 2013. Cloud computing implementation in higher
obtain a Final_grade of 85. 4617. educational institutions using thin client, Book of Abstracts International
Research Conference in Higher Education, pp.87-88, ISBN 978-971-
IV. CONCLUSION 781-037-9.
[4] S. K. Yadav, and S. Pal, 2012. Data mining: a prediction for
In this study, the proponents examined the students’ performance improvement of engineering students using classification,
history of accessing the PUP-LMS eMabini learning portal World of Computer Science and Information Technology
data, which is powered by the Moodle system. The authors Journal(WCSIT),Vol.2,No. 2,51-56.
obtained the final attributes which were used for classification [5] J.A. Itmazi and M. G. Megías, “Survey: comparison and evaluation
of students’ performance in an eLearning environment after studies of Learning Content Management Systems”, 2005.
they applied the three feature selection techniques: CH, GR [6] C. Pierrakeas, M. Xenos, C. Panagiotakopoulos, & D. Vergidis, “ A
and IG. The average rank of each attribute according to its comparative study of dropout rates and causes for two different distance
education courses”. International Review of Research in Open and
importance as follows: students’ obtained score in the online Distance Learning, 2004. Retrieved from
activity, examination rating and its condition, year of http://files.eric.ed.gov/fulltext/EJ853871.pdf
admission, their frequency of log into the portal, frequency of [7] D. M. Gabrielle, “Distance Learning: An examination of perceived
accessing the material in the portal, department, age, type of effectiveness and student satisfaction in higher education”. Proc. SITE
study, gender and registration. 2001, Orlando, FL: AACE, 183–188, 2001.
[8] M. D. Miller, R. K. Rainer & J. K. Corley, “Predictors of engagement
The generated result describes that the score obtained from and participation in an online course”. Online Journal of Distance
the participation in the online activities was the most valuable Learning Administration, 2003. Retrieved from
influencer to successful completion of the program which http://www.westga.edu/%7Edistance/ojdla/spring61/miller61.htm
garnered an average rate of 71.8. Thus, the successful
completion of the program depends on how the students
ICSPCC2016
[9] V. Osborn, “Identifying at-risk students in videoconferencing and web-
based distance education”. The American Journal of Distance Education,
15 (1) , 41-54,2001.
[10] A. Rovai, “In search of higher persistence rates in distance education
online programs”. Internet And Higher Education, 6 (1) , 1-16,2003.
[11] R. D. Nash, “Course completion rates among distance learners:
identifying possible methods to improve retention”.
[12] A.M. Paz, B. D. Gerardo and B.T.Tanguilig III, “Development of
college completion model based on K-Means clustering algorithm”,
International Journal of Computer and Communication Engineering,
Vol. 3, No. 3, May 2014.
[13] S. Pal, “Mining educational data using classification to decrease dropout
rate of students”, International Journal Of Multidisciplinary Sciences
And Engineering, Vol. 3, No. 5, May 2012.
[14] B. K. Baradwi, and S. Pal, “Mining educational data to analyze student’s
performance”, International Journal of Advanced Computer Science and
Applications (IJACSA), Vol., 2 No., 6, 2011.
[15] S. A. Abaya,B.D. Gerardo , and B.T. Tanguilig III, Comparison of
classification techniques in education marketing, Proceedings of the
International MultiConference of Engineers and Computer Scientists
2014 Vol I, IMECS 2014, March 12 - 14, 2014, Hong Kong
ICSPCC2016

An Educational Model Based On Knowledge Discovery in Databases (KDD) To Predict Learner's Behavior Using Classification Techniques

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

An Educational Model Based On Knowledge Discovery in Databases (KDD) To Predict Learner's Behavior Using Classification Techniques

Hochgeladen von

Copyright:

Verfügbare Formate

An Educational Model Based on Knowledge

Discovery in Databases (KDD) to

Benilda Eleonor V. Comendador, Lorena W. Rabago and Bartolome T. Tanguilig III

978-1-5090-2708-8/16/$31.00 ©2016 IEEE ICSPCC2016

Student year of (e.g. 1988) – to calculate the age of the student

Study Year Year when the student admitted to the

Final Mark Student’s final grade remarks either passed or

C. The Pattern Analysis Phase Exam_Points 155.1 0.4 0.2 51.9

REP CART J48

Accuracy (%) 95.56 95.56 97.17

Based on the experiment, J48 or C4.5 in WEKA identified

TABLE IV. CONFUSION MATRIX OF C4.5 OR J48 DECISION

Passed Withdrawn Percent Correct

Das könnte Ihnen auch gefallen