You are on page 1of 15

1

A Proposed Predictive Model for Student Population


of Taguig City University
by: Meljun P. Cortes

Institutional Research Paper

Presentated to the

6th Metro South Universities and Colleges (MSUC) National Research Congress

July 7, 2018

TCU Auditorium, Taguig City University

Central Becutan, Taguig City


2

A Proposed Predictive Model for Student Population


of Taguig City University
by: Meljun P. Cortes

Introduction
The Pamantasang Taguig also known as Taguig City University was established through
an ordinance crafted by the City Sangguniang Bayan numbered 29, series of 2004. During the school year
2006-2007, Vice Mayor George Elias was appointed its first president, organized the university and initiated
the construction of the main building. The next president was Bro. Rolando Dizon who served from 2010 to
2011. In 2011, Mayor Ma. Laarni L. Cayetano appointed Atty. Lutgardo B. Barbo as the third president and
it was in 2011. The fourth president of the university, Hon. Aurelio Paulo R. Bartolome was appointed in
November 2013. Now, Dr. Juan C. Birion was appointed as the present university president in 2018. (Birion
& Tolentino, 2017)

The university envisions itself to be an eminent center of excellence higher education


towards societal advancement. The City mayor, Ma. Laarni L. Cayetano, declared that under her
education-based administration, TCU continues to be fully locally funded by the City Government of
Taguig. (Birion & Tolentino, 2017)

Since the university offered free tuitions for those graduated from Taguig Secondary
Education institutions within Taguig City, the population of students are increasing as the first batch of K-12
of 2018 graduated. The main problem of the university is the accomdation of the classroom as respect to
the students population.

The assignment of classroom is one of the great problem of a certain university as the
population of student augmented. In this paper, the researcher used a predictive analytics to analyze this
main problem. Predictive Analytics is the branch of the advanced analytics which is used to make
prediction about unknown future events. Predictive Analytics uses many techniques from data mining,
statistics, modeling, machine learning and artificial intelligence to analyze current data to make predictions
about future. The patterns found in historical and transactional data can be used to identify risks and
opportunities for future. (predictiveanalyticstoday.com)

Background of the Study


Predictive analytics is the roof of advanced analytics which is to predict the future events.
Predictive analytics is capsuled with the data collection and modelling, statistics and deployment. Predictive
analytics encapsulated with data mining and machine learning which are used to analyse the current and
3

historical facts to found the predictions about future. The data mining its main process is to collect, extract
and store the valuable information. The two main objectives of predictive analytics are regression and
classification. It is composed of various analytical and statistical techniques used for developing models
which predicts the future occurrence, probabilities or events. Predictive analytics deals with both data that
are continuous changes and discontinuous changes.

Phases of Predictive Modeling Process of Predictive Modeling

Data Collection

Data Modeling

Statistics

Deployment

Predictive Analytics

A data mining function is regression which predicts a number. In this paper, the researcher applied
the regression analysis in order to come build the proposed model. Regression Techniques are used to
predict the age, weight, distance, and temperature. In regression task starts with a dataset in that the target
values are known. Common applications of regression are trend analysis, biomedical and financial
forecasting . There were various regression algorithms which are generalized linear models and support
vector machines.

Regression Analysis is a predictive modeling technique. It estimates the relationship between a


dependent (target) and an indeoendent variable (predictor). The researcher used the “Linear Regression”
,when there is a linear relationship between independent and dependent variables.

Objective of the Study


The main goal of this study is to devise a model to predict the number of classroom based
on the given set of attributes of data which is student population using the selected algorithm.
4

Statement of the Problem


This study determined the best proposed model in predicting the number of classroom assignment
with regards to student population.

Specifically, the study aimed to answer the following:

1. How to predict the number of classroom in relation to student population?


2. What are the attributes of data to be considered as data features selection?
3. What is the best mathematical algorithm according to the variable mentioned in problem
number 2?
4. What is the best proposed model to deploy for prediction?

Theoretical Framework of the Study


Malthusian Theory of Population Growth

In his 1798 work, An Essay on the Principle of Population, Malthus examined the relationship between
population growth and resources. From this, he developed the Malthusian theory of population growth in
which he wrote that population growth occurs exponentially, so it increases according to birth rate.

The researcher has adopted this kind of theory where relate the increase of student population as to the
resources needed which is the number of classroom of Taguig City University.
5

Conceptual Framework of the Study


IVDV
Independent Variable Dependent Variable
Predictor ( IV) Target (DV)
Enrollment Report Number of
Data Set (2013-2017) Classroom
Assignment
Student Population
(Predicted Data)
(Historic Data)

Linear Regression
Proposed Predicted
Algorithm
Model

Model Assumptions
1. The predictor variable X is non – random
2. The error term E is random
3. Error term follows normal distribution
4. Standard Devation of error is independent of X
5. The data being used to estimate the parameters should be independent of each other.
6. If any of the above assumptions are violated, modelling procedure must be modified

Significance of the Study


This study would be of great benefit to the following:

Taguig City University Registrar. This will help them understand the trends of enrollment, analysis of the
student population and this will be useful for determining the allocation of classroom needed for the
upcoming year.

Taguig City University Administration. This study will guide them for the preparation for next three year or
five year in terms of readiness and for the strategic palnning of the university.

To the Future Researchers. This study will guide them in their future researchers especially where the
enchancement of the study and future reference.
6

Scope and Limitation of the Study


1. The study is focus only on proposed predictive modeling.
2. The study is using the dataset which is the last five year enrollment report of university registrar of
Taguig City University.
3. The study is not develop a web-based software to predict the number of classroom against with
student population of Taguig City University.
4. The study is employed the proposed mathematical model to predict the number of classroom
against with student population of Taguig City University.
5. The study is using the R Studio of Anaconda Software to determine the confusion matrix with
accuracy, sensitivity and statistical result of the proposed model.

Definition of Terms
TCU Enrollment Data Set is the set of enrollment data report from TCU Registrar for the last five year
starting 2013 to 2017.

TCU Number of Classroom Data Set is the set of number of classroom data from TCU Monitoring Services
for the whole campus of the university.

Confusion Matrix is the statistical techniques for evaluating the accuracy, sensitivity, and effiecient of the
proposed predictive model.

R Studio Software is a software tool using python script for simulating the validation and computation of
the statistical measurement in terms of evaluation and the goodness of fitness of the proposed predictive
model.

Linear Regression is a statistical technique where the score of a variable Y is predicted from the score of a
second variable X. X is refrered to as the predictorvariable and Y as the criterion variable.

Regression Model is the equation that repesents how an independent variable is related to a dependent
variable and an error term is a regression model. Y = B0 + B1x + E , where, B0 and B1 are called
parameters of the model, E is a random variable called error term.

Regression Analysis is mainly foucses on finding a erlationship between a dependent variable and one or
more independent variables. Predict the value of a dependent variable based on one or more independent
variables. Coefficient expalins the impact of changes in an independent variable on the dependent variable

Predictive analytics is used to make predictions about future events which are unknown, the roof of
advanced analytics which is to predict the future events.

Regression and Classification are the two main objective of predictive analytics . It is composed of various
analytical and statistical techniques used for developing models which predicts the future occurrence,
probabilities or events.
7

Y- axis
Related Literature
124

 Regression Analysis is a predictive 120


modeling technique.
116
 It estimates the relationship
between a dependent (target) and 112
an independent variable (predictor)
 Linear Regression is a statistical 108
technique where the score of a
104
variable Y is predicted from the
score of a second variable X. X is 100
referred to as the predictor variable
5.50 5.75 6.00 6.25 6.50 6.75 7.00 X - axis
and Y as the criterion variable.

Input Value = 7.00


Predicted outcome = 123.9

(E-book:The Predictive Retailer, by Andrew Pearson, 2017,First published by Intelligentsia,


ISBN-13: 978-1979079525)
Related Works
1. V. Smith. and D. Huston proposed a predictive modeling to forecast student outcomes and drive
effective interventions in online community college course.This case study from a community
college utilizing learning analytics and the development of predictive models to identify at-risk
students based on dozens of key variables.
(Journal of Asynchronous Learning Networks, Volume 16: Issue 3, 2014)
2. N. Mishra and S. Silakari conducted a study on predictive analytics: a survey, trends,
applications,oppurtunities & Challenges. The study is more on predictive analytics that uses data-
mining techniques in order to make predictions about future events, and make recommendations
based on these predictions. The process involves an analysis of historic data and a model can be
created to predict using predictive analytics modeling techniques. The form of these predictive
models varies depending on the data they are using. Regression is employed on this predictive
analytics. (International Journal of Computer Science and Information Technologies, Vol. 3 (3) , 2012)
3. M. Rahman developed a predictive model which his paper entitled “Dengue Epidemic Prediction
with Regression Model” for dengue fever where he find the relationships between climate and
dengue epidemic and he had came up a prediction model using linear regression algorithm.
(International Journal of Computer Science and Information Technologies, Vol. 8 (3) , 2010 Md. Muminur
Rahman,University of Derby)
8

Methodology

Quantitative research method was used utilizing the analysis of data and documentary
analysis approach. Descriptive and Experimental research was the research design being applied
during the analysis, collection and preparation of dataset up to the building of model and evaluation
of the model..

Data Set and Data Sources


The data collection is coming from the university registrar which the last five-year
enrollment report from 2013 – 2017.
Over-all-Total (Enrollee) Number of Classroom
1 Semester 2 Semester
st nd

School Year 2013-2014 15,288 13,821 69


School Year 2014-2015 12,072 11,500 69
School Year 2015-2016 11,693 11,723 69
School Year 2016-2017 9,185 8, 349 69 * No First year
School Year 2017-2018 6,031 5,752 63 * No First year

The data source for the number of classroom of whole campus is the monitoring services
department headed by Mr. Dionico Jurada.

Predictive Models Classification


Regression Models

Univariate Multivariate

Linear Non Linear


Linear Non Linear

Simple Multiple
The researcher adopted the simple linear regression model on which the
dataset had inputted into the regression equation.
9

Regression Model
The equation that represents how an independent variable is related to a dependent
variable and an error term is a regression model.

Research Process
The researcher’s research process is modeled based on the CRISP-DM model. CRISP-DM model
is a machine learning process model that describes commonly used approaches that machine learning
experts use to tackle problems. A review and critique of machine learning process models in 2009 called
the CRISP-DM the “de factor standard for developing machine learning and knowledge discovery projects”
. The researcher applied this kind of model in developing the web-based application software where the
machine learning algorithm which is “Linear Regression Algorithm” integrated on the Php and Python
source code. The development of software is a future work after the proposed model has been tested.

Process:
1 Business Understanding
2. Data Understanding
3. Data Preparation
4. Data Modeling
5. Model Evaluation
6. Model Deployment

1. Business Understanding. The selection of predictive model should start by defining its goals in terms of
business requirements which are the trends of student population. This specification should then be
converted into a proposed predictive model problem definition.
10

2 Data Understanding. To effectively operate on the data in the later phases, some knowledge has to be
obtained on the characteristics of the data itself. It’s very important to understand the enrollment report data
and number of classroom from Registrar and Monitoring of Taguig City University.
3 Data Preparation. This is the process of producing the enrollment data as training data which is
independent variable. Typical pre-processing tasks are noise-cleaning, feature extraction, feature reduction
and feature selection of selected data.
4) Data Modeling. In this phase, a number of statistical techniques are proposed and their parameters are
adjusted to the specific problem.
5) Model Evaluation. This stage involves further evaluation of the techniques of sufficient quality.
Particular attention has to be directed to possible problems that have not been previously considered. It is
also necessary to be confident that the methods will actually deal with the original goals of the proposed
predictive model.
6) Model Deployment. This last phase involves the necessary steps to make the user able to exploit the
predictive algorithm developed in the previous steps.

Simple Linear Regression Model


The simple linear regression classification method was selected to predict the number of classroom
of upcoming year. Linear Regression is a probabilistic classification method with a long history of research
and application. The method is commonly cited for its accuracy, robustness, and efficiency [15, 16]. A linear
regression model was employed to generate estimated probabilities of number of classroom of Taguig City
university for the next three year or five year which were then mapped if the level of student population
warning level is high. Ultimately, the linear regression classification model was chosen because it offered
significant advantages in several key areas compared to other model or methods. For instance, the linear
regression algorithm is computationally inexpensive, which is an important parameter due to the large
student population at Taguig City University. Also, linear regression is very scalable, meaning that the
addition of more students or input variables will cause a dramatic increase of resources of the university
which is number of classroom. Finally, as mentioned previously, linear regression has demonstrated a
strong record of accuracy in a variety of domains over many years of academic research.

Model Building Process


The following are the series steps in building the proposed predicted model.

 Data Acquisition Data acquisition


 Divide Dataset
 Exploratory Analysis Divide Dataset
 Implement Model Exploratory Analysis
 Optimize Model
Implement Model
 Model Validation
 Prediction Optimize Model

Model Validation
Prediction
11

1.) Data Acquisition. During this step, the researcher acquired the data from university registrar and
the office of monitoring services. The dataset are the student population which is the enrollment
report of the last five year and the number of classroom. These are dataset for linear regression
analysis.

Independent Variable : student population


Dependent Variable : number of classroom
The researcher load the file enrollment.csv in R studio IDE of Anaconda Software.

2.) Divide Dataset. During this step, the researcher divide entire dataset into two subsets as:

Training Dataset – to train the proposed predictive model


Testing Dataset – to validate and make prediction
The data is in 7:30 ratio such that 70% will be present as training set and remaining 30% as the
testing set.

3.) Exploratory Analysis. The researcher had used the scatter plot matrix to determine the positive
linear trend between student population and number of classroom. The researcher had cheched
the correlation which is the important factor in order to know dependencies. The correlation
analysis gives the researcher an insight between mutual relationship among two varfiables.
4.) Implement Model. The researcher had described the model by using the summary () function that
is summary(model). Two important values are R-squared value and P-value.
5.) Optimize Model. The researcher had used the Python Script Programming Language to execute
in R Studio to see the result of residual standard error, multiple r-squared and p-value.
6.) Model Validation. The researcher had used the proposed predicted model to predict the output of
the testing dataset. Using the R script code of Python Programming Language such as
>predict <-predict (model, test)

7.) Prediction. The researcher had implemented the prediction by executing the R code script of
Python Programming Language such as >predic <-predict (model, number_room)

Statistical Treatment of the Proposed Predicted Model


The researcher had adopted the confusion matrix and statistics to test the accuracy of the
proposed predicted model. To validate the model, the researcher had computed the sensitivity, specificity,
prevalence and P-value. The R studio of Anaconda Software had used as a tools to determine all
necessary statistical measure.

The researcher had implemented also the root mean square error to determine the predicted error.
The lower root mean square error, the better the proposed model as being chosen which is linear
regression model.
12

Result and Discussion


The Proposed Predicted Model had written in R code script or Python Programming script.
The following r script code is the code in building the Linear Regression Model using the lm function.

Proposed Predicted Model:

> lm (number_room..stud_population = train_reg) -> mod_regress // used the lm function

Dependent Variable
Independent Variable

>predict (mod_regress, tes_reg) - > result_regress // prediction using test data

// store the result all predicted

>cbind (actual = test_reg [ number_room.predicted – result_regress] // final data

>as.data.frame(final_data) - > final_data

>view(final_data)

*****View the Final Data using the actual predicted value*******

>(final_data$actual – final_data_predicted) - > error

>cbind(final_data, error) - > final_data

>view

>view(final_data)

> summary(model)

****** confusion matrix ******

>confusionmatrix(table(pred_$number_room, $table_test$test(sk()(number_room))
13

The researcher had adopted the measurement of the accuracy of the proposed predicted model
known as the confusion matrix and statistics. The accuracy or the overall success rate is a metric
defining the rate at which a model has classified the records correctly. A good model should have a high
accuracy score.

Confusion Matrix and Statistics


pred_number_classroom: No Yes
No 54 66
Yes 63 59

Accuracy : 0.7113
95% CI : (0.6494, 0.7679)
No Information Rate : 0.6151
P-Value [Acc > NIR ] : 0.001172
Accuracy = 71.13%
Kappa : 0.3631
Mcnemar’s Test P-Value : 0.016053
Sensitivity : 0.8367
Specificity : 0.5109
Positive Predicted Value : 0.7321
Negative Predicted Value : 0.6620
Prevalence : 0.6151
Detection Rate : 0.5146
Detection Prevalence : 0.7029
Balanced Accuracy : 0.6738
Positive Class : No

Interpretation:

Based from the above calculation, it shows that the proposed predicted model has the accuracy of
71.13% means that the model has a high accuracy score. The P-value has a value of 0.016053 which is
lower than 5 means that the proposed predicted model has a better performance in terms of predicting the
estimates propability of number of classroom.

The root mean square error has a value of 1020.616 for error in prediction. It had shown that the
proposed predicted model has lower rmse therefore the model is better performance. The researcher had
found out also that the implementation of build simple linear regression model has a good impact.
14

Findings and Solutions

Statement of the Problem


1. How to predict the number of classroom in relation to student population?

Solution: By using the simple regression model and by following the series of steps in model building
process, the researcher came up final end model to determine the estimated number of classroom. From
the equation that represents how an indepoendent variable is related to a dependent variable and an error
term is a regression model.

Finding: The proposed predictive model has the high accuracy score of 71.13% with the P-value of less
than 5 which is 0.016053.

2. What are the attributes of data to be considered as data features selection?

Solution: The dataset composed of training dataset and testing dataset with the ratio of 7:3. The criteria in
selecting the features of data had concentrated on the student population which is the last five year report
of enrollment and number of classroom from the monitoring services department of Taguig City University.
The data had been used for training the model and also to test the model for validation.

Finding: The result of root mean square error known as error of prediction is 1020.616 means the lower of
rmse the better the model. The actual data had been tested using R script code using LM function.

3. What is the best mathematical algorithm according to the variable mentioned in problem
number 2?

Solution: By using lm funtion of linear regression model, the proposed model had validated and tested
during the simulation of R studio of anaconda software. The best methematical equation is the simple linear
regression where the score of a variable Y is predicted from the score of a second variable X. X is referred
to as the predictor variable and Y as the criterion variable.

Finding: From the result of confusion matrix, it shows that the detection rate has a value of 0.5146 and
datection prevalence has a value of 0.6738. It means that the proposed predicted model had detected a
better performance and high accuracy which is 71.13%

4. What is the best proposed model to deploy for prediction?

Solution: simple linear regression was written on the R script code to deploy for prediction.
> lm (number_room..stud_population = train_reg) -> mod_regress

>predict (mod_regress, tes_reg) - > result_regress


15

Conclusion and Future Work


In this work work, the researcher had derived the linear regression process and analysis of the given
dataset from university registrar and monitoring services of Taguig City University which are student
population and number of classroom as an experimental simulation of the proposed model. The result of
confusion matrix and statistics including the root mean square error is not so guarantee to be a 100%
excellent performance of the proposed predictive model. There will be need more simulation of other data
to be inputted on the suggested model that has been code in R studio using python programming
language.

The researcher had planned to level up in developing a web-based application on which the proposed
predictive model integrated using Php and Python programming language and that will be the future works.

References
Proposed Intervention Measures on the Scholastic Defiency in the General Education Subjects of Taguig City University Students: Implications
on the City Scholarships Program ,Action Research of Birion & Tolentino, 2017

David F Andrews. 1974. A robust method for multiple linear regression. Technometrics 16, 4 (1974), 523–531. Irvan Bastian Arief Ang, Flora
Dilys Salim, and Margaret Hamilton. 2016. Human occupancy recognition with multivariate ambient sensors. In Pervasive Computing
and Communication Workshops (PerCom Workshops), 2016 IEEE International Conference on. IEEE, 1–6.

Study.com / Malthusian Theory of Population Growth: Definition & Overview. (2016, Jan 5 of publication). Retrieved from
https://study.com/academy/lesson/malthusian-theory-of-population-growth-definition-lesson-quiz.html (Malthusian Theory of Population
Growth: Definition & Overview.)
E-book: What is Predictive Modeling ? - Editor Review, User Reviews, Features, Pricing and Comparison in 2018 -
Predictive An E-book:The Predictive Retailer, by Andrew Pearson, 2017,First published by Intelligentsia,
ISBN-13: 978-1979079525 anlytics Today

Journal of Asynchronous Learning Networks, Volume 16: Issue 3, 2014)

International Journal of Computer Science and Information Technologies, Vol. 3 (3) , 2012

International Journal of Computer Science and Information Technologies, Vol. 8 (3) , 2010 Md. Muminur Rahman,University of Derby

Baepler, P., and Murdoch, C. Academic Analytics and Data Mining in Higher Education. International Journal for the
Scholarship of Teaching and Learning 4(2) (2010).

Anton Bezuglov and Gurcan Comert. 2016. Short-term freeway traffic parameter prediction: Application of grey system theory models. Expert
Systems with Applications 62 (2016), 284–292.

A multi-agent system architecture for smart grid management and forecasting of energy demand in virtual power plants. IEEE Communications Magazine 51, 1
(2013), 106–113.

Smith, V., and Lange, A. Predictive Modeling to Forecast Student Outcomes and Drive Effective Interventions. Research Paper
Presentation, Council for the Study of Community Colleges, Seattle, Washington, April 16-17, 2010.

Lange, A., Corona, S., and Ushveridze, A. Improving Student Persistence and Success through Predictive Modeling Analytics.
WCET Annual Conference, Phoenix, AZ (November 2008).
Macfadyen, L., Dawson, S., Mining LMS data to develop an ‘‘early warning system” for educators: A proof of
concept. Computers & Education 54: 588-599 (2010).
Auguste, B., Cota, A., Jayaram, K., and Laboissiere, M. Winning by degrees: the strategies of highly productive higher- education institutions.
Education 66 (2010).
S. K. Tso, et al., "Data mining for detection of sensitive buses and influential buses in a power system subjected to disturbances," IEEE Trans.
Power Syst., vol. 19, no. 1, pp. 563-568, Feb. 2004.