Beruflich Dokumente
Kultur Dokumente
* Corresponding author
Structured Abstract
How to distinguish the best or worst institutions of higher education? This is a question
that permeates the minds and hearts of parents, students, and teachers because education
is an investment in the personal and nation's future. As a source of information for the
response to asking, the University Ranking of Folha - RUF appears. Known for its
traditional evaluation, the Folha's Ranking is considered an independent evaluation tool
and provides a ranking of the best Brazilian universities. 74% of the data are related to
research areas and postgraduate programs. Who regulates and supervises the postgraduate
609
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
Purpose – The use of Machine Learning techniques to predict the Ranking Universitário
da Folha (RUF), using previous year's history to train the Naïve Bayes algorithm
1 Introduction
Quality of Higher Education, assessment, the ranking of the best Universities are
topics to be tackled in this article, however, even given the relevance of the theme, the
central point explored, is the use of mining algorithms to predict this ranking. However, it
is mandatory to contextualize the emergence and importance of these rankings.
The preliminary topic, which highlights the relevance of the theme, refers to the
quality of Higher Education.
Quality of education of Brazilian Universities has become a central theme for the
country. In the educational area, this term is not consolidated and is not a standard
ground. But for all practical purposes, the lack of understanding of the concept is not a
problem. Moreover, the idea of quality is not even put into the focus of discussion.
Together with the quality theme, questions about guarantee quality and accreditation
arise. (Sobrinho, 2008)
610
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
Since the 90's, most of the Latin American countries have set up their bodies for the
evaluation of education quality of universities (Sobrinho, 2006). In Brazil, the
accreditation, which in Brazil ultimately means "operating authorization", is a
governmental assignment regulated by the Sistema Nacional de Avaliação da Educação
Superior - SINAES - National System for the Evaluation of Higher Education (Southern
and Vessuri, 2006; Rish, 2001).
Since all Universities in Brazil must have an accreditation, or government
authorization to act, how to distinguish the best or the worst? This is a question that
pervades the minds and hearts of parents, students, and teachers, as education is an
investment in the personal and the future of the nation. There are some resources
available, but being governmental, have the same origin and do not evidence an
independence of evaluation as having the origin in the own society.
Precisely in this "information vacuum", the Ranking Universitário da Folha (Folha's
University Ranking) - RUF appears. Known for its traditional evaluation, the Folha's
Ranking is considered as an independent evaluation tool and provides a ranking of the
best Brazilian universities. The RUF is developed under the responsibility of Folha de
São Paulo (started in 1921), and use several mechanisms, aiming to rank the 195 best
universities in the country, public or private. Its execution is in charge of DATAFOLHA.
According to Folha de São Paulo (2016), in its own website we have: The RUF
evaluates the 195 Brazilian universities based on 5 indicators: Scientific research; Quality
of Teaching; Internationalization; Labour market; Innovation.
Data are obtained from a variety of sources, including two annual surveys,
encompassing thousands of respondents, and data are collected from such sources as:
a. Inep-MEC
b. Web of Science
c. SciELO
d. Inpi
e. FAPs
f. CNPq
g. Capes
h. Two Datafolha
surveys done
annually
611
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
2 Theoretical Construction
Research in universities is usually associated with research groups, led by, in most,
Ph.Ds. Thus, it is plausible to hypothesize that the influence of the structure and
functioning of postgraduate programs is high in the RUF, even more than "research and
teaching quality" are relevant parts of the RUF, as presented in analyzing the construction
of the RUF and its structure. In this section, we look at how the RUF is built and the Data
Mining tools that will support the experiment.
We briefly describe what the RUF is and how it is composed.
612
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
published by the RUF. Also through the algorithm J48, it was tried to establish a decision
tree, but that due to the high number of branches is not feasible and with the "pruning",
becomes insignificant. As a tool, we use WEKA.
2.3 Classification
Classification is a process that we are constantly carrying out throughout history and
in our daily. We classify the transportation facilities by air, land, and sea, people of legal
age and minors legal age and the economic classes of the population are some examples.
613
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
1
ranks based on decision trees as "if is de sunny day then will not rain"
2
verify the probability of an event occurring
3
available at http://www.cs.waikato.ac.nz/ml/weka/
614
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
3 Methodology
The methodological classification of this research characterizes it as applied since it
produces immediate results, however, it is also basic to serve as the basis for other
research (Marconi and Lakatos, 2010). As for the objectives, it is descriptive, insofar as it
describes characteristics of a phenomenon and establishes relations between variables. In
seeking to establish limits, and approaches for new research, delimiting an unknown area,
it is also characterized as having an exploratory objective. It also presents an explanatory
objective, since it "deepens the knowledge of reality because it explains the reason, the
reason of things". (Gil, 2002). It has a qualitative approach, as researchers attribute
meanings to the data; on the other hand, it is quantitative because it follows the statistical
rigors, not only using samples but of the whole universe that involves Universities. We
used bibliographic, documentary and experimental procedures (Gil, 2008).
The research itself followed the following steps:
1. Get the data from the CAPES open data for the years 2014 and 2015, regarding
students, teachers, and courses. These data were processed and prepared, composing
a relational database. Next, the RUF data of 2015 and 2016 were obtained and were
treated and loaded into relational database tables.
2. We then need to mine the data, preparing a correlation and conversion table,
matching the University initials of both systems (CAPES and RUF). This was an
exhaustive task that even presented 2 incompatibilities that were not solved and that
are part of the general analysis of the data.
3. The CAPES data were then summarized, including Masters and Ph.D. courses of
each University, number of teachers, and final students. The predictors were each of
these summarized fields, and the decision obtained was whether or not it belonged to
the RUF, thus making compatible RUF and CAPES data.
4. Following the concepts of machine learning, we use the data from 2014 as "test",
training the machine. To do so, we use the WEKA 4 Software, where we apply the
Naïve Bayes algorithm.
5. Next, we submit the 2015 data, in order to establish the prediction, of which
Universities would be in RUF 2016 and compare it with the actual result.
6. These data were then compared, and a confusion matrix was established, indicating
both false positives and negatives. The results are interesting because with only open
data a significant result was obtained, not requiring surveys, interviews, and other
data not open (such as quantity of publications, quotations, among others).
4
WEKA is an open source software, produced at the University of Waikatu (NZ) and is a collection of machine
learning algorithms for data mining tasks.
615
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
Next, the data were standardized, ie they were prepared so that they could be handled
by the software tools.
The CAPE’s open data is provided in CSV format, which is a more suitable format for
processing the data relative to the RUF which is a web page. Because it is in CSV, the
data process is simpler than that applied to the RUF.
From the CAPE site of open data, the following files were downloaded for
postgraduate programs for the years 2014 and 2015: courses, teachers, and students
undergraduate.
With the raw data, we import them into a PostgreSQL database to facilitate the
process of data normalization and extraction in the format expected by WEKA software.
Although in CSV standard, this does not mean that your data is sanitized 5. For this
process of sanitization and import to the database, an Extract, Transform and Load (ETL)
tool used in the KDD process, more specifically in the preprocessing phase of the data.
The tool chosen is Pentaho's Data-integration 6 (Figure 2).
5
process that standardizes the data, maintaining its validity
6
can be downloaded of the Pentaho Community in http://community.pentaho.com/projects/data-integration/
616
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
With the Data-integration tool, all data of interest to the search was imported. Even
though both sources of data deal with the same domain - universities - there are
divergences between the initial of institutions between databases. Even with sanitized,
approximately 50 universities that are part of the RUF were not located in the CAPE data.
To minimize this difference, a manual analysis of the data was required.
O was provided for the learning of the machine in the case under study, it is precisely
the union between the CAPES data of 2014 and RUF 2015, it is called "training mass",
later to move to the algorithm already "trained", the new CAPE data and the same classify
and make the predictions based on the knowledge perceived in the training. The data used
were from CAPES 2014, obtaining UF and university initials compatible with RUF.
The training file was then imported into the WEKA, via a graphical interface, to apply
the Naïve Bayes algorithm and precision analysis (Figure 2).
Several analyses and tests were performed to identify a configuration with the best
possible result. It is important to emphasize that this is an extremely important activity for
the process of knowledge extraction and that it is linked to the required interdisciplinarity,
where an expert in the subject contributes to these adjustments.
Applying the algorithm to the training data, the result was a 78.95% success rate. The
confusion matrix generated is as follows:
a b <- Classified as
191 33 a = N
51 124 b = S
With this level of precision, training data were exported via WEKA to the ARFF
format.
617
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
After the training, we then have to predict the RUF for 2015. For this, the ARFF file
was generated with the CAPE data for classification by the algorithm learned by the
machine with the data of 2015. Again the ARFF will contain the same columns, however,
the decision will contain "?", indicating to the algorithm to predict:
Using a terminal 7 in the OS X operating system, the following command was
executed to determine to WEKA to perform the classification based on the training data:
java -cp weka.jar weka.classifiers.bayes.NaiveBayes -t training.arff -T sort.arff -p 3-8 –D.
As result, WEKA presents the classification performed for each instance of the file to
be classified. Figure 3 shows the partial result of WEKA processing.
The last predicted column presents the prediction of each entry (data in parentheses),
thus informing that an institution with those characteristics of courses, teachers, and
students tends to be part of the RUF ranking. The result of the prediction was normalized
and imported into the PostgreSQL database. After that, RUF2016 was compared with the
result of the predictions, reaching the following result: of the 195 institutions that make
up the RUF 2016, 120 were predicted by the Naïve Bayes process with a 61.5% success.
7
also known as command line or shell, allows the user to ask the operating system to perform some actions such
as listing files, creating directories, running an application, among others
618
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
5 Conclusions
Emphasize the importance of objective criteria for institutional evaluation when we
assert that objective criteria and procedures that prioritize quantitative and comparable
aspects are required (Sobrinho and Ristoff, 2005).
The publication of linked open data expands this comparison process, allowing human
and non-human agents to process and analyze information. Berners-Lee (1989) suggests
that the data are open, especially those that can be classified with 5 stars in the future.
Classification Description
Available on the web (whatever format) but with an open licence, to be Open
Data
Available as machine-readable structured data (e.g. excel instead of image scan
of a table)
All the above plus, Use open standards from W3C (RDF and SPARQL) to
identify things, so that people can point at your stuff
All the above, plus: Link your data to other people’s data to provide context
References: Berners-Lee (1989, 2006)
The act of measuring although it is a part of the evaluation process of the society on
Universities, can not be considered in isolation (Vianna, 2014):
The evaluation will express the actions, attitudes, and values of both individuals and
communities or the science itself; if possible it should contemplate its multiple
dimensions and interrelationships. It will always produce effects over time, be they
political or pedagogical. An important part of the evaluation refers to the tests
applied, the questionnaires to be answered and the results obtained - this is what is
called the technical part of the evaluation; therefore, measurement is part of the
evaluation, but the evaluation is not exhausted in the measurement. This means that it
is not enough to assign notes, weights, and concepts.
A percentage beyond 60% of the RUF ranking shows that it is possible, with a more
detailed study and analysis of the techniques, to predict with a certain degree of
confidence. It should be noted that, according to the RUF, the Scientific Research (mostly
postgraduate) corresponds to a 42% weight in the ranking.
Another hypothesis is to make a cut, selecting the first 60 universities. Thus, an
algorithm to predict the 40, 50 or 60 best Brazilian universities, based strictly on open
CAPE data, may present a higher degree of confidence.
It is also observed that there are positive reflexes (above 60%) of the CAPES
processes on the quality management of the Postgraduate Programs of Universities,
intrinsically linked to the quality of higher education.
619
Proceedings IFKAD 2018
Delft, Netherlands, 4-6 July 2018
ISBN 978-88-96687-11-6
ISSN 2280787X
References
Berners-Lee, T. (1989) Information management: A proposal.
Witten, I. H. and Frank, E. (2011) Data Mining: Practical machine learning tools and techniques, ed
Morgan Kaufmann
Sobrinho, J. D. and Vessuro, H. (2008) Quality, evoluation: from sinaes to indexes, In Avaliação da
Educação Superior Magazine.
Fulmari, A. and Chandak, M. B. (2014) An approch for word sense disambiguation using modified
naïve bayes classifier, In International Journal of Innovative Research in Computer and
Communication Engineering.
Vianna, C. T. (2014) Avaliação institucional e o desafio da cultura da autoavaliação e cpa, In
conference's publications of regional seminar about institutional self-evaluation and
evaluations committees
Fayyad, U., Piatetsky-Shapiro, G. and Smyth P. (1996) From data mining to knowledge discovery
in databases. AI magazine, Vol. 17, No. 3, p. 37
Fulmari, A. and Chandak, M. B. (2014) An approach for word sense disambiguation using modified
naïve bayes classifier. International Journal of Innovative Research in Computer and
Communication Engineering, Vol. 2
Gil, A. C. (2002) Como elaborar projetos de pesquisa. São Paulo, Vol. 5
Gil, A. C. (2008) Métodos e técnicas de pesquisa social. In: Métodos e técnicas de pesquisa social.
Atlas
Gonçalves, A. L. (2006) Um modelo de descoberta de conhecimento baseado na correlação de
elementos textuais e expansão vetorial aplicado à engenharia e gestão do conhecimento. 196 f.
Tese (Doutorado) — Tese (Doutorado em Engenharia de Produção)-Programa de Pós-
Graduação em Engenharia de Produção, Universidade Federal de Santa Catarina, Florianópolis
Linoff, G. S. and Berry M. J. (2011) Data mining techniques: for marketing, sales, and customer
relationship management. John Wiley & Sons
Marconi, M. d. A. and Lakatos, E. M. (2010) Fundamentos de metodologia científica. In:
Fundamentos de metodologia científica. ed Atlas
Mitchell, T. M. (1997) Machine learning. New York
Rish I. (2001) An empirical study of the naive bayes classifier. In: IBM NEW YORK. IJCAI 2001
workshop on empirical methods in artificial intelligence. Vol. 3, No. 22, pp. 41–46.
Sobrinho, J. D. (2006) Acreditación de la educación superior en américa latina y el caribe. In:
TRES, J.; SANYK, B. C. (Ed.). La educación superior en el Mundo 2007. Acreditación para la
garantía de la calidad: ¿Qué está en juego? Global University Network for Innovation
Sobrinho, J. D. and Ristoff, D. I. (2005) Avaliação como instrumento da formação cidadã e do
desenvolvimento da sociedade democrática: por uma ético-epistemologia da avaliação. Ristoff,
Dilvo & Almeida JR, Vicente (organizadores). Avaliação Participativa, Perspectivas e
Debates, série Educação Superior em Debate, No. 1, pp. 15–38
Sobrinho, J. D. and Vessuri, H. (2006) Paradigmas e políticas de avaliação da educação superior.
autonomia e heteronomia. Universidad e investigación científica: convergências y tensiones.
Vessuri H, org. Buenos Aires: CLACSO, Consejo Latinoamericano de Ciencias Sociales, pp.
169–191
620