MELJUN CORTES Research Paper 6 Metro South JULY 9 2018

You are on page 1of 15

of Taguig City University

by: Meljun P. Cortes

Presentated to the

6th Metro South Universities and Colleges (MSUC) National Research Congress

July 7, 2018

2

of Taguig City University

by: Meljun P. Cortes

Introduction

The Pamantasang Taguig also known as Taguig City University was established through

an ordinance crafted by the City Sangguniang Bayan numbered 29, series of 2004. During the school year

2006-2007, Vice Mayor George Elias was appointed its first president, organized the university and initiated

the construction of the main building. The next president was Bro. Rolando Dizon who served from 2010 to

2011. In 2011, Mayor Ma. Laarni L. Cayetano appointed Atty. Lutgardo B. Barbo as the third president and

it was in 2011. The fourth president of the university, Hon. Aurelio Paulo R. Bartolome was appointed in

November 2013. Now, Dr. Juan C. Birion was appointed as the present university president in 2018. (Birion

& Tolentino, 2017)

towards societal advancement. The City mayor, Ma. Laarni L. Cayetano, declared that under her

education-based administration, TCU continues to be fully locally funded by the City Government of

Taguig. (Birion & Tolentino, 2017)

Since the university offered free tuitions for those graduated from Taguig Secondary

Education institutions within Taguig City, the population of students are increasing as the first batch of K-12

of 2018 graduated. The main problem of the university is the accomdation of the classroom as respect to

the students population.

The assignment of classroom is one of the great problem of a certain university as the

population of student augmented. In this paper, the researcher used a predictive analytics to analyze this

main problem. Predictive Analytics is the branch of the advanced analytics which is used to make

prediction about unknown future events. Predictive Analytics uses many techniques from data mining,

statistics, modeling, machine learning and artificial intelligence to analyze current data to make predictions

about future. The patterns found in historical and transactional data can be used to identify risks and

opportunities for future. (predictiveanalyticstoday.com)

Predictive analytics is the roof of advanced analytics which is to predict the future events.

Predictive analytics is capsuled with the data collection and modelling, statistics and deployment. Predictive

analytics encapsulated with data mining and machine learning which are used to analyse the current and

3

historical facts to found the predictions about future. The data mining its main process is to collect, extract

and store the valuable information. The two main objectives of predictive analytics are regression and

classification. It is composed of various analytical and statistical techniques used for developing models

which predicts the future occurrence, probabilities or events. Predictive analytics deals with both data that

are continuous changes and discontinuous changes.

Data Collection

Data Modeling

Statistics

Deployment

Predictive Analytics

A data mining function is regression which predicts a number. In this paper, the researcher applied

the regression analysis in order to come build the proposed model. Regression Techniques are used to

predict the age, weight, distance, and temperature. In regression task starts with a dataset in that the target

values are known. Common applications of regression are trend analysis, biomedical and financial

forecasting . There were various regression algorithms which are generalized linear models and support

vector machines.

dependent (target) and an indeoendent variable (predictor). The researcher used the “Linear Regression”

,when there is a linear relationship between independent and dependent variables.

The main goal of this study is to devise a model to predict the number of classroom based

on the given set of attributes of data which is student population using the selected algorithm.

4

This study determined the best proposed model in predicting the number of classroom assignment

with regards to student population.

2. What are the attributes of data to be considered as data features selection?

3. What is the best mathematical algorithm according to the variable mentioned in problem

number 2?

4. What is the best proposed model to deploy for prediction?

Malthusian Theory of Population Growth

In his 1798 work, An Essay on the Principle of Population, Malthus examined the relationship between

population growth and resources. From this, he developed the Malthusian theory of population growth in

which he wrote that population growth occurs exponentially, so it increases according to birth rate.

The researcher has adopted this kind of theory where relate the increase of student population as to the

resources needed which is the number of classroom of Taguig City University.

5

IVDV

Independent Variable Dependent Variable

Predictor ( IV) Target (DV)

Enrollment Report Number of

Data Set (2013-2017) Classroom

Assignment

Student Population

(Predicted Data)

(Historic Data)

Linear Regression

Proposed Predicted

Algorithm

Model

Model Assumptions

1. The predictor variable X is non – random

2. The error term E is random

3. Error term follows normal distribution

4. Standard Devation of error is independent of X

5. The data being used to estimate the parameters should be independent of each other.

6. If any of the above assumptions are violated, modelling procedure must be modified

This study would be of great benefit to the following:

Taguig City University Registrar. This will help them understand the trends of enrollment, analysis of the

student population and this will be useful for determining the allocation of classroom needed for the

upcoming year.

Taguig City University Administration. This study will guide them for the preparation for next three year or

five year in terms of readiness and for the strategic palnning of the university.

To the Future Researchers. This study will guide them in their future researchers especially where the

enchancement of the study and future reference.

6

1. The study is focus only on proposed predictive modeling.

2. The study is using the dataset which is the last five year enrollment report of university registrar of

Taguig City University.

3. The study is not develop a web-based software to predict the number of classroom against with

student population of Taguig City University.

4. The study is employed the proposed mathematical model to predict the number of classroom

against with student population of Taguig City University.

5. The study is using the R Studio of Anaconda Software to determine the confusion matrix with

accuracy, sensitivity and statistical result of the proposed model.

Definition of Terms

TCU Enrollment Data Set is the set of enrollment data report from TCU Registrar for the last five year

starting 2013 to 2017.

TCU Number of Classroom Data Set is the set of number of classroom data from TCU Monitoring Services

for the whole campus of the university.

Confusion Matrix is the statistical techniques for evaluating the accuracy, sensitivity, and effiecient of the

proposed predictive model.

R Studio Software is a software tool using python script for simulating the validation and computation of

the statistical measurement in terms of evaluation and the goodness of fitness of the proposed predictive

model.

Linear Regression is a statistical technique where the score of a variable Y is predicted from the score of a

second variable X. X is refrered to as the predictorvariable and Y as the criterion variable.

Regression Model is the equation that repesents how an independent variable is related to a dependent

variable and an error term is a regression model. Y = B0 + B1x + E , where, B0 and B1 are called

parameters of the model, E is a random variable called error term.

Regression Analysis is mainly foucses on finding a erlationship between a dependent variable and one or

more independent variables. Predict the value of a dependent variable based on one or more independent

variables. Coefficient expalins the impact of changes in an independent variable on the dependent variable

Predictive analytics is used to make predictions about future events which are unknown, the roof of

advanced analytics which is to predict the future events.

Regression and Classification are the two main objective of predictive analytics . It is composed of various

analytical and statistical techniques used for developing models which predicts the future occurrence,

probabilities or events.

7

Y- axis

Related Literature

124

modeling technique.

116

It estimates the relationship

between a dependent (target) and 112

an independent variable (predictor)

Linear Regression is a statistical 108

technique where the score of a

104

variable Y is predicted from the

score of a second variable X. X is 100

referred to as the predictor variable

5.50 5.75 6.00 6.25 6.50 6.75 7.00 X - axis

and Y as the criterion variable.

Predicted outcome = 123.9

ISBN-13: 978-1979079525)

Related Works

1. V. Smith. and D. Huston proposed a predictive modeling to forecast student outcomes and drive

effective interventions in online community college course.This case study from a community

college utilizing learning analytics and the development of predictive models to identify at-risk

students based on dozens of key variables.

(Journal of Asynchronous Learning Networks, Volume 16: Issue 3, 2014)

2. N. Mishra and S. Silakari conducted a study on predictive analytics: a survey, trends,

applications,oppurtunities & Challenges. The study is more on predictive analytics that uses data-

mining techniques in order to make predictions about future events, and make recommendations

based on these predictions. The process involves an analysis of historic data and a model can be

created to predict using predictive analytics modeling techniques. The form of these predictive

models varies depending on the data they are using. Regression is employed on this predictive

analytics. (International Journal of Computer Science and Information Technologies, Vol. 3 (3) , 2012)

3. M. Rahman developed a predictive model which his paper entitled “Dengue Epidemic Prediction

with Regression Model” for dengue fever where he find the relationships between climate and

dengue epidemic and he had came up a prediction model using linear regression algorithm.

(International Journal of Computer Science and Information Technologies, Vol. 8 (3) , 2010 Md. Muminur

Rahman,University of Derby)

8

Methodology

Quantitative research method was used utilizing the analysis of data and documentary

analysis approach. Descriptive and Experimental research was the research design being applied

during the analysis, collection and preparation of dataset up to the building of model and evaluation

of the model..

The data collection is coming from the university registrar which the last five-year

enrollment report from 2013 – 2017.

Over-all-Total (Enrollee) Number of Classroom

1 Semester 2 Semester

st nd

School Year 2014-2015 12,072 11,500 69

School Year 2015-2016 11,693 11,723 69

School Year 2016-2017 9,185 8, 349 69 * No First year

School Year 2017-2018 6,031 5,752 63 * No First year

The data source for the number of classroom of whole campus is the monitoring services

department headed by Mr. Dionico Jurada.

Regression Models

Univariate Multivariate

Linear Non Linear

Simple Multiple

The researcher adopted the simple linear regression model on which the

dataset had inputted into the regression equation.

9

Regression Model

The equation that represents how an independent variable is related to a dependent

variable and an error term is a regression model.

Research Process

The researcher’s research process is modeled based on the CRISP-DM model. CRISP-DM model

is a machine learning process model that describes commonly used approaches that machine learning

experts use to tackle problems. A review and critique of machine learning process models in 2009 called

the CRISP-DM the “de factor standard for developing machine learning and knowledge discovery projects”

. The researcher applied this kind of model in developing the web-based application software where the

machine learning algorithm which is “Linear Regression Algorithm” integrated on the Php and Python

source code. The development of software is a future work after the proposed model has been tested.

Process:

1 Business Understanding

2. Data Understanding

3. Data Preparation

4. Data Modeling

5. Model Evaluation

6. Model Deployment

1. Business Understanding. The selection of predictive model should start by defining its goals in terms of

business requirements which are the trends of student population. This specification should then be

converted into a proposed predictive model problem definition.

10

2 Data Understanding. To effectively operate on the data in the later phases, some knowledge has to be

obtained on the characteristics of the data itself. It’s very important to understand the enrollment report data

and number of classroom from Registrar and Monitoring of Taguig City University.

3 Data Preparation. This is the process of producing the enrollment data as training data which is

independent variable. Typical pre-processing tasks are noise-cleaning, feature extraction, feature reduction

and feature selection of selected data.

4) Data Modeling. In this phase, a number of statistical techniques are proposed and their parameters are

adjusted to the specific problem.

5) Model Evaluation. This stage involves further evaluation of the techniques of sufficient quality.

Particular attention has to be directed to possible problems that have not been previously considered. It is

also necessary to be confident that the methods will actually deal with the original goals of the proposed

predictive model.

6) Model Deployment. This last phase involves the necessary steps to make the user able to exploit the

predictive algorithm developed in the previous steps.

The simple linear regression classification method was selected to predict the number of classroom

of upcoming year. Linear Regression is a probabilistic classification method with a long history of research

and application. The method is commonly cited for its accuracy, robustness, and efficiency [15, 16]. A linear

regression model was employed to generate estimated probabilities of number of classroom of Taguig City

university for the next three year or five year which were then mapped if the level of student population

warning level is high. Ultimately, the linear regression classification model was chosen because it offered

significant advantages in several key areas compared to other model or methods. For instance, the linear

regression algorithm is computationally inexpensive, which is an important parameter due to the large

student population at Taguig City University. Also, linear regression is very scalable, meaning that the

addition of more students or input variables will cause a dramatic increase of resources of the university

which is number of classroom. Finally, as mentioned previously, linear regression has demonstrated a

strong record of accuracy in a variety of domains over many years of academic research.

The following are the series steps in building the proposed predicted model.

Divide Dataset

Exploratory Analysis Divide Dataset

Implement Model Exploratory Analysis

Optimize Model

Implement Model

Model Validation

Prediction Optimize Model

Model Validation

Prediction

11

1.) Data Acquisition. During this step, the researcher acquired the data from university registrar and

the office of monitoring services. The dataset are the student population which is the enrollment

report of the last five year and the number of classroom. These are dataset for linear regression

analysis.

Dependent Variable : number of classroom

The researcher load the file enrollment.csv in R studio IDE of Anaconda Software.

2.) Divide Dataset. During this step, the researcher divide entire dataset into two subsets as:

Testing Dataset – to validate and make prediction

The data is in 7:30 ratio such that 70% will be present as training set and remaining 30% as the

testing set.

3.) Exploratory Analysis. The researcher had used the scatter plot matrix to determine the positive

linear trend between student population and number of classroom. The researcher had cheched

the correlation which is the important factor in order to know dependencies. The correlation

analysis gives the researcher an insight between mutual relationship among two varfiables.

4.) Implement Model. The researcher had described the model by using the summary () function that

is summary(model). Two important values are R-squared value and P-value.

5.) Optimize Model. The researcher had used the Python Script Programming Language to execute

in R Studio to see the result of residual standard error, multiple r-squared and p-value.

6.) Model Validation. The researcher had used the proposed predicted model to predict the output of

the testing dataset. Using the R script code of Python Programming Language such as

>predict <-predict (model, test)

7.) Prediction. The researcher had implemented the prediction by executing the R code script of

Python Programming Language such as >predic <-predict (model, number_room)

The researcher had adopted the confusion matrix and statistics to test the accuracy of the

proposed predicted model. To validate the model, the researcher had computed the sensitivity, specificity,

prevalence and P-value. The R studio of Anaconda Software had used as a tools to determine all

necessary statistical measure.

The researcher had implemented also the root mean square error to determine the predicted error.

The lower root mean square error, the better the proposed model as being chosen which is linear

regression model.

12

The Proposed Predicted Model had written in R code script or Python Programming script.

The following r script code is the code in building the Linear Regression Model using the lm function.

Dependent Variable

Independent Variable

>view(final_data)

>view

>view(final_data)

> summary(model)

>confusionmatrix(table(pred_$number_room, $table_test$test(sk()(number_room))

13

The researcher had adopted the measurement of the accuracy of the proposed predicted model

known as the confusion matrix and statistics. The accuracy or the overall success rate is a metric

defining the rate at which a model has classified the records correctly. A good model should have a high

accuracy score.

pred_number_classroom: No Yes

No 54 66

Yes 63 59

Accuracy : 0.7113

95% CI : (0.6494, 0.7679)

No Information Rate : 0.6151

P-Value [Acc > NIR ] : 0.001172

Accuracy = 71.13%

Kappa : 0.3631

Mcnemar’s Test P-Value : 0.016053

Sensitivity : 0.8367

Specificity : 0.5109

Positive Predicted Value : 0.7321

Negative Predicted Value : 0.6620

Prevalence : 0.6151

Detection Rate : 0.5146

Detection Prevalence : 0.7029

Balanced Accuracy : 0.6738

Positive Class : No

Interpretation:

Based from the above calculation, it shows that the proposed predicted model has the accuracy of

71.13% means that the model has a high accuracy score. The P-value has a value of 0.016053 which is

lower than 5 means that the proposed predicted model has a better performance in terms of predicting the

estimates propability of number of classroom.

The root mean square error has a value of 1020.616 for error in prediction. It had shown that the

proposed predicted model has lower rmse therefore the model is better performance. The researcher had

found out also that the implementation of build simple linear regression model has a good impact.

14

1. How to predict the number of classroom in relation to student population?

Solution: By using the simple regression model and by following the series of steps in model building

process, the researcher came up final end model to determine the estimated number of classroom. From

the equation that represents how an indepoendent variable is related to a dependent variable and an error

term is a regression model.

Finding: The proposed predictive model has the high accuracy score of 71.13% with the P-value of less

than 5 which is 0.016053.

Solution: The dataset composed of training dataset and testing dataset with the ratio of 7:3. The criteria in

selecting the features of data had concentrated on the student population which is the last five year report

of enrollment and number of classroom from the monitoring services department of Taguig City University.

The data had been used for training the model and also to test the model for validation.

Finding: The result of root mean square error known as error of prediction is 1020.616 means the lower of

rmse the better the model. The actual data had been tested using R script code using LM function.

3. What is the best mathematical algorithm according to the variable mentioned in problem

number 2?

Solution: By using lm funtion of linear regression model, the proposed model had validated and tested

during the simulation of R studio of anaconda software. The best methematical equation is the simple linear

regression where the score of a variable Y is predicted from the score of a second variable X. X is referred

to as the predictor variable and Y as the criterion variable.

Finding: From the result of confusion matrix, it shows that the detection rate has a value of 0.5146 and

datection prevalence has a value of 0.6738. It means that the proposed predicted model had detected a

better performance and high accuracy which is 71.13%

Solution: simple linear regression was written on the R script code to deploy for prediction.

> lm (number_room..stud_population = train_reg) -> mod_regress

15

In this work work, the researcher had derived the linear regression process and analysis of the given

dataset from university registrar and monitoring services of Taguig City University which are student

population and number of classroom as an experimental simulation of the proposed model. The result of

confusion matrix and statistics including the root mean square error is not so guarantee to be a 100%

excellent performance of the proposed predictive model. There will be need more simulation of other data

to be inputted on the suggested model that has been code in R studio using python programming

language.

The researcher had planned to level up in developing a web-based application on which the proposed

predictive model integrated using Php and Python programming language and that will be the future works.

References

Proposed Intervention Measures on the Scholastic Defiency in the General Education Subjects of Taguig City University Students: Implications

on the City Scholarships Program ,Action Research of Birion & Tolentino, 2017

David F Andrews. 1974. A robust method for multiple linear regression. Technometrics 16, 4 (1974), 523–531. Irvan Bastian Arief Ang, Flora

Dilys Salim, and Margaret Hamilton. 2016. Human occupancy recognition with multivariate ambient sensors. In Pervasive Computing

and Communication Workshops (PerCom Workshops), 2016 IEEE International Conference on. IEEE, 1–6.

Study.com / Malthusian Theory of Population Growth: Definition & Overview. (2016, Jan 5 of publication). Retrieved from

https://study.com/academy/lesson/malthusian-theory-of-population-growth-definition-lesson-quiz.html (Malthusian Theory of Population

Growth: Definition & Overview.)

E-book: What is Predictive Modeling ? - Editor Review, User Reviews, Features, Pricing and Comparison in 2018 -

Predictive An E-book:The Predictive Retailer, by Andrew Pearson, 2017,First published by Intelligentsia,

ISBN-13: 978-1979079525 anlytics Today

International Journal of Computer Science and Information Technologies, Vol. 3 (3) , 2012

International Journal of Computer Science and Information Technologies, Vol. 8 (3) , 2010 Md. Muminur Rahman,University of Derby

Baepler, P., and Murdoch, C. Academic Analytics and Data Mining in Higher Education. International Journal for the

Scholarship of Teaching and Learning 4(2) (2010).

Anton Bezuglov and Gurcan Comert. 2016. Short-term freeway traffic parameter prediction: Application of grey system theory models. Expert

Systems with Applications 62 (2016), 284–292.

A multi-agent system architecture for smart grid management and forecasting of energy demand in virtual power plants. IEEE Communications Magazine 51, 1

(2013), 106–113.

Smith, V., and Lange, A. Predictive Modeling to Forecast Student Outcomes and Drive Effective Interventions. Research Paper

Presentation, Council for the Study of Community Colleges, Seattle, Washington, April 16-17, 2010.

Lange, A., Corona, S., and Ushveridze, A. Improving Student Persistence and Success through Predictive Modeling Analytics.

WCET Annual Conference, Phoenix, AZ (November 2008).

Macfadyen, L., Dawson, S., Mining LMS data to develop an ‘‘early warning system” for educators: A proof of

concept. Computers & Education 54: 588-599 (2010).

Auguste, B., Cota, A., Jayaram, K., and Laboissiere, M. Winning by degrees: the strategies of highly productive higher- education institutions.

Education 66 (2010).

S. K. Tso, et al., "Data mining for detection of sensitive buses and influential buses in a power system subjected to disturbances," IEEE Trans.

Power Syst., vol. 19, no. 1, pp. 563-568, Feb. 2004.

