Sie sind auf Seite 1von 5

Research Data Mart in an Academic System

Francesco Di Tria, Ezio Lefons, and Filippo Tangorra


Dipartimento di Informatica
Universit degli Studi di Bari Aldo Moro
Via Orabona 4, 70125 Bari
ITALY
{francescoditria, lefons, tangorra}@di.uniba.it
AbstractData warehousing is an activity that is getting more
and more attention in several contexts. Also Universities are
adopting data warehousing solutions for business intelligence
purpose. In these contexts, there are specific aspects to be
considered, such as the Didactics and the Research evaluation.
Indeed, these are the main factors affecting the importance and
the quality level of every University. In this paper, we present the
architecture of a Business Intelligence system in an academic
organization and we illustrate the design of a data mart devoted
to the evaluation of the Research activities.
Keywords-component; Business intelligence system, decision
making, data warehouse architecture, multidimensional modeling.

I.

INTRODUCTION

Data warehouses are the most important component of


Business Intelligence systems devoted to produce significant
information to be used in strategic decision making. The aim is
to improve business processes of an information system, also
using web-based environments [1]. In the last years, also
Universities have accepted to take advantage of implementing
Business Intelligence systems [2, 3, 4]. Nonetheless, these
organizations have specific purposes that differ very much
from those of enterprises and companies. Indeed, typical
objectives affecting the management of a University are:
offering a better quality of the instruction; managing
employees and human resources; managing economic-financial
institutions; avoiding wastes, and increasing scientific
publications and research projects. Given these business goals,
university decision makers are always interested in the
possibility to timely make the best business decisions on the
basis of historical data available in a unique and updated source
of information. The main problem to be faced in the realization
of an Academic Business Intelligence system is that each
University has own legacy databases of historical data.
Moreover, these database are independent of one another,
producing data redundancy and inconsistency. For example,
the list of the professors is included into the database inherent
the didactics offer. The same list is repeated in the database
related to the university staff. The two list cannot be aligned. It
follows the need of data integration in order to provide useful
information for analytic purposes. Also our University meets
the previously described conditions. Therefore, we decided to
develop a data warehouse integrating all the present and
historical data resources. The data warehouse covers different
departmental areas and, therefore, is composed of several data
marts.

In this paper, we present the Research data mart for the


analysis of the scientific activities and publications in our
academic environment. This data mart covers the needs related
to the evaluation of the results gained by university staff
involved in the Research activity. The paper is structured has
follows. Section 2 shows our academic Business Intelligence
system. Section 3 presents the design of the Research data
mart. Section 4 reports the Business Intelligence applications
for the evaluation of the Research activities. At last, Section 5
contains concluding remarks.
II.

BUSINESS INTELLIGENCE SYSTEM

Business Intelligence consists of methodologies and


technologies that support companies to obtain information
about their own business processes. A Business Intelligence
System in the University context aims to understand the
students performances, the teaching staffs productivity, and
the Research and Didactics quality. So, a University data
warehouse represents a unique system of analysis available to
the supervisory staff of the Athenaeum and to the single
organizational and administrative structures, such as
departments and secretariats for the students. Moreover, such
systems are also able to supply in real time data to information
external agencies devoted to control the results reached by the
University.
Figure 1 shows the Business Intelligence system we
developed for analytic purpose in our University. It consists of
four levels: the source relational databases, the ETL
procedures, the data warehouse, and the applications [5]. The
source relational databases contain transactional data. ESSE3
(Secretary and Services for Students) is the database that
manages the didactic students curricula. NOGE (NOt
manaGEd) is a secondary old database that stores legacy
historical data about students enrolled before the ESSE3
introduction. Other databases, as CIA (Athenaeum Integrated
Accounting) and CSA (Careers and Wages of Athenaeum), are
devoted to the financial and economic management. At last,
SAPERI is the database of the scientific research competence
of the University. It includes also publications and patents of
researchers. SINBAD is the database storing data about the
athenaeum research projects. ETL procedures are used to
populate the data warehouse with cleaned and historical data.
The data warehouse generates a set of data marts to cover
specific academic analytical needs: (a) Didactics is the data
mart devoted to the analysis of the students performance, such
as examinations and degree, and even enrolment trend per year
[6]; (b) Finance is the data mart developed for the analysis of

978-1-4577-1964-6/12/$31.00 2012 IEEE

financial and analytic economic movements; (c) Research is


data mart containing synthetic and integrated data about the
scientific publications; (d) Human Resource is the data mart
that allows investigating on the legal-economic careers and
wages of the academic personnel. At the bottom level, there are
the Business Intelligence applications. These applications
usually consist of reports developed for decision makers.
MIUR (Italian Ministry of the University and Research) is the
national committee for the evaluation of the university system.
CRUI (Council of the Rectors of the Italian University) is the
associations for the evaluation on the Didactics and Research
areas. The Academic Supervisory Staff includes the Academic
Senate that is the governing body in matter of programming the
development of the Athenaeum and the coordination of
Didactics and Research, and the Administration Council that
supervises the administrative, financial, and economicpatrimonial management of the Athenaeum. Organizational
Structures are Faculties that organize the Didactic activities,
and Departments that deal with the management of the
Research activities. The Administrative structures are the
student secretariats and the data elaboration centres, whose
tasks are the production of data for the national Alma Laurea,
or the registry of the graduate students, and the realization of
documents to support the decisional processes.
Source Relational Databases

ESSE3

NOGE

CSA

CIA

SAPERI

SINBAD

Academic Data Warehouse

Research
Data Mart

Finance
Data Mart

Hum. Res.
Data Mart

Business Intelligence Applications

MIUR

CRUI

Academic
Supervisory
Staff

Organizational
structures

Administrative
structures

Decisions Makers

Figure 1. Academic Business Intelligence System.

III.

We adopted our hybrid design methodology: the Graphoriented Hybrid Multidimensional Model (GrHyMM, for short)
[7], whose lifecycle includes the following steps:
1.

Requirement analysis, to define business goals using


the i* for data warehousing framework [8, 9].

2.

Source analysis, to understand, to normalize, and to


integrate different source databases.

3.

Conceptual design. This is based on a graph-oriented


representation of the integrated source schema. In
detail, the conceptual design consists in a
reengineering of the data source by performing
traditional operations on graphs (such as prune, graft,
and deleting nodes) using constraints derived from the
Requirement analysis.

4. Logical design. The conceptual schema, formed of a


set of one or more independent graphs, is transformed
into a logical one based on the relational model, where
the root node of the graph is a fact table and root
children are dimension tables.
5. Physical design. The implementation of the logical
schema ends the design process with defining the
physical database properties.
6. Feeding plan. This is the step when the Extraction,
Transformation, and Loading (ETL) [10] procedures
are developed in order to populate the data mart(s)
with cleaned data.

ETL

Didactics
Data Mart

business goals. Moreover, hybrid methodologies usually


provide also algorithms to define conceptual schemas in
automatic way, in order to reduce implementation time, design
errors, and wasting of time.

THE RESEARCH DATA MART

Because of the complexity of the data mart to realize, we used


a hybrid design methodology that aims to produce fact
schemas, by considering at the same time both the user needs
and the data sources. In this way, it is possible to obtain a data
mart that is consistent with the available data without missing

A. Requirement analysis
Here, we preesent the requirement analysis to identify the
business goals the decision makers need to achieve. The aim of
the decision makers is to evaluate the Research activities. To
do so, they are mainly interested in knowing the number of
scientific publications produced in the University.
In this context, the fact of interest is the publication. This fact is
an event that occurs each time one or more authors of the
University receive an approval for the publication of a paper.
The most important points of view to analyze a publication are
the year of publication, the author, the typology, and the editor.
In order to obtain reliable statistical data, the decision makers
must be allowed to analyze publications related to the last ten
years. Therefore, ten years represent the historicity level of the
data mart. At last, we define a workload, containing the typical
analytical queries the decision makers intend to do. Part of the
workload for the publication fact is the following:
mean number of publications per author,
count of publications per author and per typology in a
given period of time,
count of publications per Faculty,
count of publications per Department, and
count of publications per Sector.

B. Source analysis
The source databases used for this data mart are CSA and
SAPERI,. CSA aims at managing legal and economic data of
the university personnel, while SAPERI stores data about
scientific publications. As concerns the CSA database, we are
interested in knowing, for each person, the first and the last
name, the institutional role in the University (which can be, for
example, assistant professor, associate professor, full professor,
), the Sector representing the scientific area of membership,
the Department, and the Faculty of affiliation. From the
SAPERI database, we want to know the list of scientific
publications, and, for each of them, the title, a brief description,
its typology (that is, book chapter, journal paper, ), the
publication year, the author(s), the language, the editor(s), the
pages, the ISBN and DOI codes. The first step in the design
process is the integration of the interest part of these two
databases, in order to obtain a global and reconciled database.
This integration is done by mapping the concepts of the first
database onto the second one. In this case, we note that the
concept of person affiliated to the University in CSA
corresponds to the concept of author in SAPERI. So, we can
build an integrated source relational schema where the
publications are authored by persons affiliated to the University
Structures according to their specific institutional role.
In Fig. 2, we partially report the integrated source schema.
In detail, the figure shows only the tables of interest on the
basis of the requirement analysis. Other tables can be safely
ignored and, therefore, they have not been reported.

the creation, we performed some operations in order to modify


the source schema and to adapt it to the user requirements. The
operations we executed are as follows.

Unnecessary children of the publication node have


been pruned. These nodes are Language, Relevance,
InsertDate, Title, Description, ISBN, and DOI.

Unnecessary children of the editor node have been


pruned. These nodes are EditorCity and EditorNation.

Unnecessary children of the author nodes have been


pruned. The node is TaxCode.

The publication node and the authoring node have


been merged because facts represent many-to-many
relationships among dimensions.

The Year node has been defined as a dimension of the


publication fact.

The author node has been defined as a dimension of


the publication fact. This dimension has
institutionalRole has hierarchical level.

The conceptual design ends with the validation process. To


this aim, the workload is always used to validate the final
conceptual schema of the data mart [12]. Indeed, a conceptual
schema is assumed to be validated if all the queries included in
the workload are executable against that schema.
D. Logical design
In Fig. 3, there is depicted the logical schema of the Research
Data Mart, obtained from the remodeled attribute tree. The fact
table is called publication. It represents a four-dimension cube
with no measures. The first dimension is editor, the second
dimension is typology, the third one is time. The minimum
aggregation size for the time dimension is year. The last
dimension is author, which is structured in four hierarchies:
author department, author sector, author faculty, and
author institutionalRole. Notice that the symbol stands
for a one-to-many relationship. Therefore, roll-up and drilldown operations are allowed.
editor
PK

EditorID
EditorName

pubblication

time
PK

PK,FK1
PK,FK2
PK,FK3
PK,FK4

TimeID
Year

typology

EditorID
TimeID
TypoID
AuthorID

PK

TypoID
TypoDescr

author
department
PK

Figure 2. Integrated source schema.

C. Conceptual design
The GrHyMM methodology establishes the creation of an
attribute tree, having its root in the fact [11]. So, we built the
attribute tree starting from the publication table, by navigating
through foreign keys and many-to-many relationships. After

PK

AuthorID

FK1
FK2
FK3
FK4

AutorhFirstName
AuthorLastName
DeptID
SectID
FacID
roleID

DeptID
DeptName

institutionalRole
PK

RoleID
RoleName

faculty
PK

FacID
FacName

sector
PK

SectID
SectName

Figure 3. The Research Data Mart.

E. Physical design
The Research Data Mart has been implemented using MySQL
as relational Database Management System, since this actually
represents a valid solution for data warehousing environments
[13]. This system provides several storage engines. We chose
MyISAM, since this is a high performance engine. In fact, it is
not transaction-oriented and it does not implement foreign key
constraints. Indeed, in data warehousing systems, data
consistency is more important than referential integrity [14].
F. Feeding plan
During the design and the implementation of the ETL
procedures, several problems arose due to inconsistencies
between the source databases. The most important problem we
faced is related to the author(s) of a publication. In fact, in the
SAPERI database no constraints are defined on the authors
name. As a consequence, users that insert data are allowed to
associate any authors name to a paper, even if that name does
not match with any person affiliated to the University. So,
typos and misspelling are very frequent. Moreover, in the same
datasets, we found also incomplete names (without the first
name), shortened names, and reversed names because of the
lack of standardization in the name representation. Of course,
these errors created homonyms problems and difficulties for
the identification of the right author(s) of the paper. Some of
these inconsistencies have been solved by comparing names
with those in CSA and by using a data mining algorithm for
string matching [15]. Only a small part of the publications (the
5.8%) has not been associated to any person and it is gone lost.
Furthermore, the 26.38% of the publications has been partially
associated to legitimate authors, that is not all the authors have
been correctly identified.
IV.

BUSINESS INTELLIGENCE APPLICATIONS

In order to allow university decision maker to evaluate the


Research activities, we developed an ad-hoc Business
Intelligence tool. This software tool is a web application whose
main interface is shown in Fig. 4.
The interface is divided into four frames:
1.

Elaboration types,

2.

Search parameters,

3.

Grouping pattern, and

4.

Visualization.

In frame 1, it is possible to choose between two types of


measures: the counting of the publications and the mean of
publications in a given period of time. In frame 2, search
parameters can be set. This corresponds to slice-and-dice
OLAP operations that allow to reduce the data cardinality.
Selection patterns can be defined over year, typology, editor,
and author affiliation. As concerns this latter, it is possible to
filter data over Faculty, Sector, or Department. After the
choice, users can further reduce the data cardinality by
selecting a specific author by name. Notice that all the
parameters are optional. In frame 3, the grouping pattern can be
established and the typical roll-up/drill-down OLAP operations
are allowed. In detail, this frame provides aggregation up to

five dimensions: year, editor, typology, author affiliation, and


author. The order of the dimensions chosen for aggregation
determines the report output, as each dimension corresponds to
a column of the report. In frame 4, the user is allowed to
choose the report output, that can be a table, a crosstab based
on the pivoting OLAP operator, or charts.

Figure 4. Business Intelligence tool.

For example, Figure 4 shows an analysis devoted to


evaluate the number of publications from 2007 to 2009, where
sector is L-FIL-LET/04 (viz., Literature and Philosophy),
grouping by year, typology, editor, and author. The result is
shown in Fig. 5 in a tabular form. Another example of Business
Intelligence application is depicted in Fig. 6 that shows the
number of publications from 2007 to 2008, grouped by Faculty
and year. Therefore, this report allows decision makers to
evaluate the Faculties that have the highest number of
publications in that period of time. In this case, the output
format is a crosstab showing the grouping elements as
coordinates of a bi-dimensional cube. At last, Figure 7 shows
the number of publications per Department and typology using
histogram as output format of the report.
V.

CONCLUSIONS

In the last years, we have been involved in the realization of an


Academic Business Intelligence System in order to allow the
several university decision makers to evaluate the quality of the
Didactics and the Research activities in our University. In this
paper, we have presented the design of the Research Data Mart,
which is mainly devoted to store historical data about the
scientific publications of the persons affiliated to the
University. The data mart is yearly updated with new data.
However, we noted that almost the 6% of the publications is
lost, since these incomplete publications can not be referred to
any author. For multidimensional analysis, we developed also a
Business intelligence tool as a web application for a remote
access to legitimate users. The software tool generates reports
that show the numbers of scientific publications per year,
editor, typology, author, and authors affiliation.

Future work will address the necessity to complete the


realization of the Academic data warehouse, extending the
current Research data mart, by considering also the evaluation
of the research projects and the exploitation of the funds, and
developing the remaining two data marts. The former will be
devoted to the analysis of the financial management of the
University, in order to detect and to avoid waste of money. The
latter will be used for analysis about the staff employment, in
order to evaluate the performance of the university personnel.
REFERENCES
[1]

[2]
Figure 5. BI application: number of publications
per year, typology, editor, and author.

[3]

[4]

[5]
[6]

[7]

[8]
Figure 6. BI application: number of publications
per Faculty and year.
[9]

[10]
[11]

[12]

[13]
[14]
Figure 7. BI application: number of publications
per Department and typology.

[15]

C. Fernandes and M. Whalen, Data ware-housing from the web, Proc. of


the 2004 American Society for Engineering Education Annual
Conference & Exposition, Salt Lake City, 2004, pp. 1-11.
M. C. Lin, University data warehouse design issues: case study, Proc. of
the 2001 American Society for Engineering Education Annual
Conference & Exposition, Albuquerque, 2001, pp. 1-9.
D. Wierschem, J. McMillen, and R. McBroom, What Academia Can
Gain from Building a Data Warehouse, EDUCAUSE Quarterly, Vol. 26,
No. 1, 2003, pp. 41-46.
G. L. Donhardt and D. M. Keel, The Analytical Data Warehouse:
Empowering Institutional Decision Makers, EDUCAUSE Quarterly,
Vol. 24, No. 4, 2001, pp. 56-58.
R. Kimball and M. Ross, The Data warehouse toolkit 2nd edition, New
York: John Wiley & Sons, 2002.
C. dellAquila, F. Di Tria, E. Lefons, and F. Tangorra, An Academic
Data Warehouse, Proc. of the 7th WSEAS Int. Conf. on Applied
Informatics and Communications (AIC 07), Athens, Greece, August 2426, 2007, pp. 229-235, WSEAS Press.
F. Di Tria, E. Lefons, and F. Tangorra, GrHyMM: A Graph-Oriented
Hybrid Multidimensional Model, In: Advances in Conceptual Modeling.
Recent Developments and New Directions, Brussels, Belgium, October
31 - November 3, 2011, Lecture Notes in Computer Science, vol. 6999,
pp. 86-97.
J. N. Mazn, J. Trujillo, M. Serrano, and M. Piattini, Designing Data
Warehouses: from Business Requirement Analysis to Multidimensional
Modeling, In: Cox, K., Dubois, E., Pigneur, Y., Bleistein, S.J., Verner,
J., Davis, A.M., and Wieringa, R. (Eds.), Requirements Engineering for
Business Need and IT Alignment, University of New South, Wales
Press, 2005.
F. Di Tria, E. Lefons, and F. Tangorra, Hybrid Methodology for Data
Warehouse Conceptual Design by UML Schemas, Information and
Software Technology, Vol. 54, No. 4, 2012, pp. 360-379.
R. Kimball and J. Caserta, The Data Warehouse ETL Toolkit 1st edition,
New York: John Wiley & Sons, 2004.
C. dellAquila, F. Di Tria, E. Lefons, and F. Tangorra, Dimensional Fact
Model Extension via Predicate Calculus, The 24th International
Symposium on Computer and Information Sciences, ISCIS 2009, 14-16
September 2009, North Cyprus, IEEE Press, pp. 211-217.
C. dellAquila, F. Di Tria, E. Lefons, and F. Tangorra, Logic
Programming for Data Warehouse Conceptual Schema Validation, Data
Warehousing and Knowledge Discovery, 12th International Conference,
DAWAK 2010, Bilbao, Spain, August/September 2010, Springer,
Lecture Notes in Computer Science, vol. 6263, pp. 1-12.
D. Feinberg and M. A. Beyer, Magic Quadrant for Data Warehouse
DBMS Servers, Gartner, 2011.
M. Golfarelli and S. Rizzi, Data Warehouse Design: Modern Principles
and Methodologies, McGraw-Hill/Osborne Media, 2009.
D. Holmes and M. C. McCabe, Improving Precision and Recall for
Soundex Retrieval, ITCC '02 Proceedings of the International
Conference on Information Technology: Coding and Computing, IEEE
Computer Society Washington, DC, USA, 2002, pp.22-26.

Das könnte Ihnen auch gefallen