Sie sind auf Seite 1von 23

Decision Support Systems 37 (2004) 151 – 173

www.elsevier.com/locate/dsw

Metadata management: past, present and future


Arun Sen *
Department of Information and Operations Management, Mays Business School, Texas A&M University, College Station, TX 77843, USA
Received 1 October 2002; accepted 4 December 2002

Abstract

In the past, metadata has always been a second-class citizen in the world of databases and data warehouses. Its main purpose
has been to define the data. However, the current emphasis on metadata in the data warehouse and software repository
communities has elevated it to a new prominence. The organization now needs metadata for tool integration, data integration
and change management. The paper presents a chronological account of this evolution—both from conceptual and management
perspectives.
Repository concepts are currently being used to manage metadata for tool integration and data integration. As a final chapter
in this evolution process, we point out the need of a concept called ‘‘metadata warehouse.’’ A real-life data warehouse project
called TAMUS Information Portal (TIP) is used to describe the types of metadata needed in a data warehouse and the changes
that the metadata go through. We propose that the metadata warehouse needs to be designed to store the metadata and manage
its changes. We propose several architectures that can be used to develop a metadata warehouse.
D 2003 Elsevier B.V. All rights reserved.

Keywords: Metadata; Data warehouse; Decision support; Metadata warehouse; Repository

1. Introduction length (a physical property) of a stick (an object) is 5


ft (a measurement unit). This example uses for one
Metadata is a term that has been used and misused object, a data item (the number 5) and two metadata
many times in the past. Webster defines ‘‘meta’’ as a items (length and a measuring unit).
more comprehensive term needed to ‘‘describe a new In the past, the metadata has often been treated as
and related discipline designed to deal critically with a second-class citizen. With the advent of computers
the original one.’’ Metadata consequently then and our incessant need for data, we have introduced
describes a discipline that fosters the study of data techniques to store data permanently on a secondary
about data. storage. These data can then be retrieved and used by
The origin of metadata can be traced back to how application programs. File managers are used to store
we use measurement units. The purpose of a unit is to and retrieve data from the secondary storage. To
describe a property of an object. For example, the accomplish their job, file managers use such meta-
data as field names and filenames. This use of
metadata, along with the actual data, now has exten-
* Tel.: +1-979-845-8370; fax: +1-979-845-5653. sively been ingrained in the database management
E-mail address: Asen@cgsb.tamu.edu (A. Sen). technology.

0167-9236/$ - see front matter D 2003 Elsevier B.V. All rights reserved.
doi:10.1016/S0167-9236(02)00208-7
152 A. Sen / Decision Support Systems 37 (2004) 151–173

As a result, in the last 30 years, we have wit- 2.2. The 1970s


nessed a tremendous growth in the use of metadata
in developing information systems. The purpose of The 1970s could be called the decade that started
this paper is to study this field and see how it helps the metadata phenomenon. With the advent of data-
us in decision support. To do this, we first describe a base management system (DBMS), the use of meta-
40-year chronological development of metadata con- data increased tremendously. For example, in
cept (see Section 2). Several management tools were relational DBMS, the metadata is extensively used
also designed to manage metadata in the last 40 to define data. These metadata include relation names,
years. In Section 3, we categorize these develop- attribute names, key and domain information. The
ments. We argue that the most neglected area in the collection of metadata is used to define schema and
metadata management is the notion of managing subschemas. In 1976, with the introduction of entity
changes in metadata. Section 4 emphasizes the relationship (ER) data model and later with the advent
changes in metadata and describes a real-life case of other semantic data models, higher level metadata
study where the changes are of utmost importance. were being used. As the semantic data models (like
To manage these changes, we propose a new man- ER) did not have a supporting DBMS, a translational
agement tool called metadata warehouse. Unlike mechanism was deployed to translate the higher level
other tools that focus on tool integration and data (like ER) metadata to relational metadata.
integration, this tool manages the changes in meta-
data for organizational decision support. We con- 2.3. The 1980s
clude the paper in Section 5.
With the success of database management systems
to store and retrieve business data (which are essen-
2. Evolution of the metadata concept tially flat), efforts were made to look into non-business
data types. These data types came from diverse appli-
To describe the evolution of the concept of meta- cation areas such as Computer-Aided Design/Com-
data, we look at each decade starting with the 1960s. puter-Aided Manufacturing (CAD/CAM), Computer-
Aided Software Engineering (CASE), Geographic
2.1. The 1960s Information System (GIS), document storage and
retrieval, science and medicine. The notion of data,
Early work with files presumed that files were on at this time, got replaced by a term called asset. An
tapes. Access was sequential and the cost of access asset is a piece of useful item that is a product or by-
grew in direct proportion to the size of the file. Simple product of application development process. An asset
indexes were used to speed up the access. However, as can be more tangible, like data (as before), designs and
the indexes grew, they too became difficult to manage. software code, or more intangible, such as knowledge
Due to this reason, in the early 1960s, the idea of and methodologies. Tangible asset can take a form of a
applying tree structures emerged as a potential solu- large-grained component like framework or a com-
tion. In the late 1960s, using the work of B trees plete application; it can also be a fine-grained compo-
and B+ trees, many commercial vendors created file nent like subroutine, a class, or an encapsulated
systems that were faster and were not sequential. For component. It can include patterns [6] and algorithms
all file systems, the format of the records is first [6]. Examples of intangible asset can include program-
determined. The field names of the records and their ming knowledge, programming plans, software system
data types are essentially metadata used in the file. architectures, project plans, design documents, user
During this period, the use of metadata in program documents and other relevant knowledge sources.
development has also been fairly discrete. For exam- Just like data, assets need to be stored on disks to
ple, in attempting to reuse code, programming lan- be reused. Assets are, however, much more compli-
guages allow applications to include codes from their cated than simple business data and are difficult to
software libraries. We discuss these libraries in store and access. Several new database paradigms
Section 3. were proposed at this time that deal with assets and
A. Sen / Decision Support Systems 37 (2004) 151–173 153

their metadata semantics. This includes Complex metadata technology forward. We describe them one
Object Model [3], Nested Relation Data Model by one as follows.
[4,20] and Object-Oriented Data Model [1].
Accessing these assets was mainly done through 2.4.1. Metadata in code reusability
object queries. An asset is stored as an object or a The notion of code reusability in software develop-
collection of objects. The queries use metadata that ment, started in the 1980s with Japanese software
are usually class definitions and class hierarchies. factories [13], became very important in the 1990s.
Class definitions are very much like table definitions To facilitate the code reusability, research on soft-
with some exceptions. They include attribute defini- ware development environment flourished. A software
tions along with method definitions for the class. development environment (SDE) is a collection of
Three predominant relationships used in building software and hardware tools, which explicitly tailored
class hierarchies are aggregation [3], part – whole [3] to support the production of software systems in a
and generalization [3]. Aggregation relationship dealt particular application domain [22]. A SDE uses
with relationships among classes that are related. For metadata in all of its operations. The objective of
example, course class is related to student class and the use of metadata is to support the selection of the
instructor class. Part – whole, on the other hand, dealt tools used in the SDE. Many kinds of SDEs are
with composition. For example, vehicle class is com- currently in practice. These environments are classi-
posed of chassis class, engine class and others. Gen- fied into three major groups [22]: Programming
eralization (also known as ‘‘is-a’’) relationship was Environments (PE), CASE and Software Engineering
used to show classification among classes. For exam- Environments (SEE). A PE is an environment that is
ple, motor-vehicle and airplane classes are the sub- principally intended to support the process of pro-
classes of vehicle class. Generation relationship gramming, testing and debugging. A CASE is an
allows the subclasses to inherit the properties of the environment that supports software specification and
superclass. design. It can also be used with a programming
As database management systems access data environment. A SEE is intended to support the
through indexing, the query to access an asset even- production of large, long-lifetime software systems,
tually finds all parts of the asset from the database and the maintenance costs of which typically exceed
combines them to create the asset. However, it is development costs and are produced by a team rather
sometimes difficult to store an asset as a collection of than individual programmers. REuse Based on
objects in a database. This is because the asset may be a Object-Oriented Techniques (REBOOT) by Morel
package (like a code package or a document) that and Faget [14] is an example of Programming Envi-
cannot be decomposed into objects. To help in this ronment tool that supports storage and retrieval of
process, metadata was again utilized. A classification code components. It uses the facet-based classifica-
scheme, based on metadata, was proposed by Prieto- tion scheme to create the metadata of the components.
Diaz [18]. The idea of the classification scheme was to Four such facets are used: abstraction, operations,
show relationships by collection, that is, to keep related operates-on and dependencies. In systems like this,
classes more or less together according to the closeness the metadata is stored in a database. If code library
of the relationship. He introduced the notion of faceted gets updated, corresponding entries in the database
classification scheme. A facet is an arranged group of are changed.
descriptors. For example, in the Unix domain, a facet
could be {by action} [18]. A facet takes on terms or 2.4.2. Metadata in asset repository
values. The {by action} facet can take terms like get, According to Bernstein [2], an asset repository
put, update, append, check and others. (also called a repository) is a shared database of
information about engineered artifacts, such as soft-
2.4. The 1990s ware, documents, maps and other things. In other
words, a repository is a metadata manager. For exam-
In the 1990s, we see three separate research para- ple, a repository that supports software development
digms emerge that were responsible to move the and deployment tools could store metadata such as
154 A. Sen / Decision Support Systems 37 (2004) 151–173

database descriptions, form definitions, controls, tions typically related to source data, such as source
documents, interface definitions, source code, help schemas, old formats for archived mainframe data,
text and others. The objective of the use of metadata ownership description of the source, automated
in a repository context is to emphasize the selection extract tool settings and others; data staging metadata
and integration of diverse tools that support the differ- such as data cleaning specifications, slowly changing
ent kinds of data. dimension policies and others; data transform logs;
and DBMS system table contents. The front room
2.4.3. Metadata in data warehouse metadata is more descriptive, and it helps query tools
Metadata took a significant role in the 1990s due to and report-writers function smoothly. Examples
the advent of data warehouse concept. Kimball et al. include join specifications, network security user
(Ref. [8], p. 22) define metadata in the context of data privilege profiles, usage and access maps, network
warehouse as ‘‘all of the information in the data security usage statistics and others.
warehouse environment that is not the actual data The need of asset repositories became crucial in the
itself.’’ The metadata in the data warehouse context late 1990s as software projects increasingly were
are basically of two kinds: back room metadata and focussed on integrating and reusing codes, classes,
front room metadata. The back room metadata is components, patterns, frameworks and applications.
process related and guides the extraction, cleaning The need is also fueled by the massive use of data
and loading processes. Examples include specifica- warehouse techniques in the brick and mortar industry

Fig. 1. The time line for metadata management.


A. Sen / Decision Support Systems 37 (2004) 151–173 155

along with the web world to develop decision support 3.1. Stage I. Implicit metadata management
using huge operational data.
Even before the advent of database management
2.5. 2000 and beyond system, software library manager was an early attempt
to discretely use metadata to reuse codes. It is
Fig. 1 illustrates the evolution of metadata concept designed to co-manage data and metadata. It has been
in a chronological order. Since the 1960s, the concept so successful that it still exists today. Most program-
of metadata has grown quite a bit. Starting from the ming languages allow applications to include codes
simple filenames, field names and field types, meta- from their software libraries. A software library is a
data in the 1970s described data definitions as mod- collection of programs that can be reused in building
eled by various data models. In the 1980s, with the larger program modules. In such a programming
advent of object-oriented programming, the metadata environment, a language compiler converts a program
started to include class definitions and class hierar- into an object module. The linker using underlying
chies (aggregation and generalization). A special kind metadata then combines all object modules that make
of classification scheme called faceted classification up a program including the object modules that are
was introduced in the 1980s for classifying assets. In obtained from the library. Some example libraries are
the early 2000, however, a major effort in dealing with C++ Class Library [15], Standard C Library [16] and
metadata is in the recognition of the need for the Microsoft’s MFC Library.
creation of a metadata standard. This standard was Although the joint management of metadata and
fueled by the fact that unlike the software develop- data is not complicated, libraries like these are fairly
ment, a data warehouse project needs heterogeneous static. Once a library has been developed and released
tool and data environments. For example, in the data to the external world, its structure and interface cannot
warehouse world, data quality tools, data modeling change without great difficulty for the user base [19].
tools, ETL tools and end-user tools developed by Using a library is equivalent to using a programming
different vendors with entirely different specifications. language. Changes in a class library are often more
The integration using a metadata standard allows them difficult to deal with than changes in a programming
to communicate with each other. The data in a data language.
warehouse can also be of various types and format. A
standardization effort will also help in data integra- 3.2. Stage II. Co-management of metadata
tion.
With the advent of database management system,
efforts were made to manage metadata more explic-
3. Evolution of metadata management itly. This idea had been exploited to make the library
managers more dynamic, where users need a mecha-
Metadata, although started as information to nism to insert, delete, modify and retrieve the content
describe data or an asset, now is competing to get of the library easily.
equal attention as an asset or data it defines. With To create easy manipulation of the library, we use
phenomenal use of metadata in software development, metadata along with the library. In this technique,
in database management and in data warehouse design metadata of codes need to be captured and used along
and implementation, it is important to look into the with the actual library. The metadata of a code contain
techniques used to manage the metadata. data about the code. The idea is to store the metadata
To properly narrate the evolution of metadata in a database. If software library gets updated, corre-
managers, we start with a set of tools that implicitly sponding entries in the database are changed.
manage data with metadata. From this stage, we The evidence of this approach can be seen in
describe how metadata management has progressively software maintenance literature. Leiter et al. [9]
moved away from implicit management to co-man- describe a software tool that entails a relational data-
agement with asset and data, and then into explicit base with an interactive interface that supports queries
management with repository managers. about programs written in object-oriented languages.
156 A. Sen / Decision Support Systems 37 (2004) 151–173

Linos and Courtois [10] describe a tool set that idea got picked up again by other vendors who now
includes tools to detect a mixture of procedural and have developed several repositories. These vendors
object-oriented program information from the C++ include Microsoft (with Microsoft Repository),
code. Examples include files (both source and include Unisys (with Universal Repository), Computer Asso-
files), data types, functions, constants, variables, ciate (with Platinum Repository), ViaSoft (with
parameters, classes, objects, etc. These are then used Rochade Repository), Softlab (with Enabler Reposi-
to populate a database. Different browsing tools are tory) and Oracle (with Oracle Repository).
available to display metadata on the screen. IBM has To understand what constitutes a repository, let us
an information retrieval tool called ReDiscoveryk describe the important features of any repository. The
based on OS/2 that can be used to create and search first feature of a repository is the types of metadata it
databases of information about virtually any kind of manages. The types include database metadata, data
database or file system. model metadata, data movement metadata, business
rules metadata, application component metadata,
3.3. Stage III. Explicit metadata management data access metadata and data warehouse related
metadata. Other features of a repository include an
In this stage, metadata and their management come information model that describes the core metadata
of age. Instead of being a side issue, metadata now is types of the repository; a specification language that is
looked upon as glue that binds many enterprise the formal specification language for the tool; a
resources such as applications, ERP activities, Internet language that supports tool interoperability across
technologies and data warehouses. The metadata different products; a standard query language to query
management tool (also known as repository) now the metadata; and others. We provide as an example
becomes crucial. the features of Microsoft Repository Service (see
The first attempt in this area was the introduction Table 1). From this table, it is apparent that the
of data dictionary. The data dictionary typically was repository architecture must include four layers. The
very data focused. It provided a centralized repository top-most layer is usually the user access layer. The
of information about data such as meaning, relation- objectives of this layer are to provide support to
ships, origin, domain, usage and format [11,12]. The browsing through the repository data in a client–
purpose of data dictionary was to assist the DBAs in server setup or in the web environment; to provide
planning, controlling and evaluating the collection, an interface with CASE tools; and to help the com-
storage and use of data. ponent-based development (CBD) tools develop and
To allow the library to hold other items that are not manage components.
software, a second mechanism, called repositories (see The second layer is called common infrastructure
Section 2), was introduced. According to Bernstein support layer. The objective of this layer is to support
[2], a repository is a shared database of information common issues among assets such as integration of
about engineered artifacts, such as software, docu- tools, cross-platform support, event management and
ments, maps and other things. For example, a reposi- others. The next lower layer is called repository
tory that supports software development and engine layer. This layer is responsible to create and
deployment tools could store database descriptions, manage repository objects, their versions and config-
form definitions, controls, documents, interface defi- urations. It is designed and developed using an object-
nitions, source code, help text and others. oriented paradigm. Finally, at the bottom level (called
In the late 1980s, IBM offered to build a universal data services layer), a repository has a data server and
repository, a repository that would store the outputs of supports insertion, deletion, modification and retrieval
a wide variety of CASE tools and make the stored of the repository data. The technology effectively
elements available to any tool that requested them. manages the information assets of the enterprise,
Unfortunately, IBM’s repository was never delivered provides a metaview of the developmental process
due to software industry’s shift from COBOL orien- across all software development tools, uses common
tation to C/C++ orientation and also from main frame data store for development tools and allows version
to client – server [23]. In the late 1990s, however, this control of any object.
A. Sen / Decision Support Systems 37 (2004) 151–173 157

Table 1
Microsoft’s Repository 2.1 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/reposit/htm/
reconthearchitectureofmicrosoftrepository.asp
Feature Availability of the feature in the repository
Date repository was introduced Repository 1.0 was introduced with Microsoft’s VB5.0 in 1997:
Major objectives of the repository . To provide a metaview of the development process across all development tools
like VB, C++, etc.
. To study the impact of changes of any object.
. To provide a common data store for all development tools.
Types of metadata Component definitions, development and deployment models, reusable software
components, data warehouse descriptions, web pages, etc.
Information model that describes the core Supports Meta Data Coalition’s Open Information Model (OIM). OIM is a
metadata types set of metadata specifications to facilitate sharing between and reuse between
tools and systems. The OIM consists of over 200 types and 100 relationships,
described in UML and organized in easy-to-use and easy-to-extend subject areas.
Specification language Universal Modeling Language (UML)—an object-oriented language.
Repository architecture Repository Services 2.1 has four layers:
. Top Layer—Tools and Application Layer. This layer supports integrated tools,
metadata-driven applications and other utilities and browsers.
. Layer III—OIM that supports shared metadata objects and structures,
relationship among objects, etc.
. Layer II—Repository object manager responsible for life cycle management
and maps objects to tables.
. Bottom layer—persistence service with jet database engine or SQL server.
Environment particulars Windows-based system.
Development kit Systems development kit (SDK) is available.
Metadata browsing facility including web Yes
browsing
Exchange metadata between multiple Supports XML interchange format
heterogeneous repositories
Object management service support (a) Version Control, (b) Composite Object Service, (c) Collection Service
(e.g. relationship service), etc.
Component-based framework support Component Object Model (COM) based
Query language support SQL

3.4. Stage IV. Metadata integration management Object Model (COM) architecture. In the world of
diverse metadata integration, such as currently being
The focus of this stage is to manage the used in data warehouse [17] domain, the tools do not
integration of diverse metadata in an application. have a common model. This is because different tools
There are two kinds of integration: tool integration from different vendors with different specifications (or
and data integration. Tools typically generate tool- metadata) are brought into the project. Tools can be
specific metadata. To establish communication ‘‘best-of-breed,’’ ‘‘most compatible,’’ or ‘‘most eco-
among multiple tools, these metadata need to be nomic.’’ This is also true for data. The repository-
integrated. Similarly, data can be of different for- based metadata management is not very effective in
mats and can come from different worlds. These this kind of metadata integration.
data also need to be integrated to help decision Noticing this kind of integration problem, two
support. Repository technology (of Stage III) that standardization groups, Meta Data Coalition (MDC)
was originally created to manage metadata can be and Object Management Group (OMG), started to
used to integrate. work on the metadata standard in the 1990s. They
The repository technology described above sup- called the metadata standard a metamodel. The MDC
ports the metadata integration as long as the metadata group along with Microsoft in the late 1990s devel-
follow a common model, like Microsoft’s Component oped and adopted a metadata standard called Open
158 A. Sen / Decision Support Systems 37 (2004) 151–173

Fig. 2. The data warehousing process for the TIPS project with changed statistics.
A. Sen / Decision Support Systems 37 (2004) 151–173 159

Information Model (OIM). The other group, OMG, 4. Proposing metadata warehouse—the final
created a model called Common Data Warehouse frontier
Metamodel (CWM) [17] in early 2000. In June
2000, MDC members decided to join OMG and In the earlier sections, we have seen how the
support one standard in metadata for integration. concept of metadata has evolved and how tools to

Fig. 3. A partial load script used in TIP data warehouse.


160 A. Sen / Decision Support Systems 37 (2004) 151–173

Fig. 4. A partial transform script used in TIP data warehouse.


A. Sen / Decision Support Systems 37 (2004) 151–173 161

Fig. 4 (continued ).

manage them have matured from simple library man- data staging metadata such as data cleaning specifi-
agers to techniques that help integrate tools metadata. cations, slowly changing dimension policies and
The integration is accomplished through a repository others; data transform logs; and DBMS system table
manager, supporting a common model of metadata contents. The front room metadata is more descrip-
under a single environment; or with a diverse set of tive, and it helps query tools and report-writer func-
metadata that follow a metamodel, called CWM. tion smoothly. Examples include join specifications,
The study of CWM reveals that it does not only network security user privilege profiles, usage and
support integration of tool metadata, it also supports access maps, network security usage statistics and
the integration of metadata created in a data ware- others. It is not only important to capture all these
house project. The metadata in a data warehouse metadata, but we also need to worry about the changes
project are of two kinds—back-room types and that happen to them.
front-room types. The back room metadata is process Unfortunately, at the present time, no single robust
related and guides the extraction, cleaning and loading tool is available that completely supports the CWM.
processes. Examples include specifications related to Few vendors, like Microsoft’s Metadata Services 2000
source data, such as source schemas, old formats for and Teradata’s MDS, have offered extended reposi-
archived mainframe data, ownership description of the tory-based tools. These tools support repository fea-
source, automated extract tool settings and others; tures but fail to cover the decision support questions
162 A. Sen / Decision Support Systems 37 (2004) 151–173

on metadata changes. We propose a new management ogy resources have to be redirected, making other
tool as the final stage of evolution of the metadata project work discontinuous and extending project
management. We call this last stage metadata ware- completion dates;
house. We adopt this term from the data warehouse  developing answers is laborious, because custom
literature to emphasize the need for creating a ware- queries or report programs have to be analyzed,
house full of metadata only. We need this warehouse designed, developed and tested;
to manage different kinds of changes that happen in a  answers can be inconsistent, because answers
regular data warehouse. These changes include organ- developed by different analysts—or even by the
izational changes, technological changes, physical same analyst at different times—have the potential
changes, process changes, end-user requirement of producing results that do not agree with other or
changes and website changes [21]. previous results.
The best way to demonstrate the need of a meta-
data warehouse is to look at a real-life data warehouse The TAMUS, spurred by a state management
project. Once we describe the development of the data control audit, embarked on a process to define its
warehouse, we need to see if we can answer the information needs and to develop a system to address
following metadata questions. They include what those needs. The resulting data warehouse project was
kinds of metadata we need to store? What are the defined and managed by business users and supported
changes in this metadata? How frequent are these by an NCR’s Teradata system.
changes? Where do they happen most? The primary objective of the project was to
develop an enterprise data warehouse environment
4.1. Developing an academic data warehouse to:

To answer the above questions, we look at a  provide the ability to access historical information
working data warehouse. With the popularity of the currently stored in existing systems;
data warehouse in the industry, several academic  provide interactive and ad hoc data exploration;
institutions (like Arizona State University, University  provide information on a secure intranet website;
of Wisconsin at Milwaukee and Texas A&M Univer-  provide access to information for ad hoc query,
sity) are currently developing data warehouses to meet analysis and report writing;
their decision support needs. In this paper, we use one  serve as a credible source of information;
such data warehouse project to develop the concept of  support the decision-making process without
metadata warehouse. hindering the performance of the operational sys-
The goal of Texas A&M University System’s tems.
(TAMUS) Data Warehouse Project, called TAMUS Fig. 2 describes the data warehousing process used
Information Portal (TIP), was to develop an auto- in the TIP project. The project has been successfully
mated information system. This system uses decision completed in the beginning of 2001 and has realized
support analysis of financial, personnel and academic some initial benefits. They are as follows:
data for the Board of Regents, and management of
the TAMUS and its member institutions and agencies  integrating disparate administrative system data into
[5]. The current on-line transaction processing sys- a common system for all management reporting;
tems are unable to meet the growing need for  providing a ‘‘single version of the truth’’;
management information. Before the data warehouse,  serving as a source system for data mart projects
the ability to perform analysis and cross-functional undertaken by various members;
analysis was a manual process. In this environment,  providing the end-user capabilities to users as
satisfying information requests were problematical ‘‘drill down’’ analysis of electronic reports; reports
for several reasons: that can include graphs and or tables; reports that
are easily modified; reports that can be requested
 requests are often unpredictable, important and anytime by anyone (with appropriate security and
urgent, which means that key information technol- access privileges); reports that can be automatically
A. Sen / Decision Support Systems 37 (2004) 151–173 163

run when the data warehouse is updated and a semester, and are the primary reason the academic
‘‘pushed’’ to the users. subject area of the TIP is the most volatile. Because this
data source is outside the control of TAMUS, changes
4.2. Details of metadata in the TIP warehouse are pushed to the TIP with limited opportunity for
feedback and thus are implemented after the fact.
Various types of metadata are at play in TIP data On the other hand, in pull-oriented changes, the
warehouse. We start with data sets that are currently changes are actually at the end-user level. Any changes
sourced from the Texas Higher Education Coordinat- in the end-user requirements will pull the changes in
ing Boards. There are nine such flat files. Each file the different metadata in TIP data warehouse.
contains summarized data elements that are not related A sample of these changes is shown in Fig. 9.
to data elements in the other files. For example, These changes can happen at any level of the schema
Student-Report data source (cbm001) has 34 attributes (a metadata). For example, the changes can happen at
such as record code, institution code, student identi- the domain level, at the attribute level, or at the entity-
fication number, gender, classification and others. In relationship level. Therefore, having a change man-
the ETL process, we have Teradata Load Scripts and agement tool will greatly improve the identification of
Transform Scripts. Figs. 3 and 4 show a partial picture the effect of the changes and aid in automation of
of these scripts. implementing the changes throughout the process.
The logical and physical diagrams also form a
source of metadata. In TIPs, these diagrams were 4.4. Why do we need a metadata warehouse for the
created in Erwin Development tool and can be repro- TIP project?
duced in XML. Fig. 5 shows a partial XML descrip-
tion of the physical ERD. The TIP data warehouse Just as the data warehouse project was initiated in
runs on Teradata DBMS that provides another set of response to business decision support, so also the
metadata called target data table metadata (see Fig. 6 development of a metadata warehouse must also be
for a sample DDL). The view also is a source of driven by business needs. Some of these business
metadata (see Fig. 7 for a sample user view DDL). needs related to the changes in TIP metadata are
Finally, the end-user application uses Business Object described below:
end-user tool. A sample metadata, called Business
Object Universe, is also shown in Fig. 8.  Tracking the life cycle for each data element. The
ability to provide life cycle information for each
4.3. Changes in the TIP metadata data element—from data source to end-user
representation, such as source field name, ETL
As business needs or conditions change over time, processing, target table definition and representa-
a data warehouse must be responsive, continuously tion to end-user (including transformations and
evaluating effectiveness of the changes. Change man- derived columns).
agement spans all components of a data warehouse  Analyzing the effect of changes in TIP. This changed
and thus plays a vital role in the ongoing development information can be used to analyze the impact of a
and overall success of a data warehouse. To manage change. Knowing the source field name aids in the
the data warehouse, it is necessary to balance two analysis of the effect of a change to the TIP when-
conflicting environment goals: maximizing the use of ever changes to the source system occur. In addition,
the data warehouse asset while consistently achieving knowing the ETL processing can help determine the
user expectations by continuously monitoring the impact of changing a source field. We provide two
effect of business changes. examples to describe the effects of changes:
To study the need for a metadata warehouse, we  Change of a two-character field to a four-
focus on the academic data of TIP. Changes happen in character field, as was the case with the ‘‘year’’
two ways: push and pull. In the push-oriented changes, attribute in the cbm001 data source in September
changes happen at the data source level. Changes from 1999, only affected the load elements of the TIP
these sources can occur each reporting period, typically (load scripts, staging tables and transform
164 A. Sen / Decision Support Systems 37 (2004) 151–173

Fig. 5. A partial XML description of physical ERD in TIP data warehouse.

scripts) as the target table column was already and the transform scripts. However, the target
defined as an integer data type. table schema and view definition remain
 Deletion of ‘‘Nursing Program Acceptance’’ unchanged because the historical values need
data element in the cbm001 data source caused to be maintained. This change is then classified
two major changes. It affected the load scripts as having an average impact.
A. Sen / Decision Support Systems 37 (2004) 151–173 165

Fig. 6. A partial target table DDL in TIP data warehouse.

 Merging of current data with historical data. As comparing the existing historical salary data with
source data elements slowly change over time, the new salary definition, columns containing the
there is a need to merge the changing data new data were added in addition to retain the
definitions with the existing historical data existing salary columns to facilitate historical
definitions in a meaningful way for the user. comparisons.
This merge should include strategies for handling  Identifying and archiving obsolete data ele-
null values because new attributes will not have ments. Some data elements become obsolete
values in the historical data. For example, the and do not provide any business value after a
‘‘Appointment 0.1%—Salary’’ data element in given period of time. These elements need to
the cbm008 data source was deleted and be identified and archived to historical tables.
replaced with other salary data elements at a The archived obsolete elements can then be
completely different level, that is, instead of used for comparative historical trend analysis.
relating salary to the faculty appointment, salary For example, the ‘‘Nursing Program Accept-
became related to the funding source (e.g. State ance’’ data element in the cbm001 data source
Appropriations, Auxiliary Enterprises, Restricted, is no longer being used in analysis and is
etc.). In this case, because there was no basis for eligible for archival.
166 A. Sen / Decision Support Systems 37 (2004) 151–173

Fig. 7. A partial user view DDL in TIP data warehouse.

 Identifying required data elements to be used in sources into the existing data model has uncovered
analysis of new data sources. As new data sources gaps in both the new data sources, that is, data
are incorporated into the TIP, the data warehouse elements that are not stored in the source systems,
needs to have some ability to analyze gaps between as well as gaps in the TIP, that is, data elements not
the required data elements and the data elements originally modeled.
provided by the new data source. The gaps can
occur at the source (new data source) or target 4.5. Metadata warehouse architecture
(TIP). Identification of the gaps can be used to
drive changes to meet necessary business require- Using Inmon’s data warehouse definition [7], we
ments. This is especially true for other areas of the define a metadata warehouse as a subject-oriented,
TIP, where there are foreign key relationships and integrated, time-variant and somewhat volatile collec-
required attributes, such as the account data in the tion of metadata in support of change management
financial subject area. Incorporating additional data process. Like repository, a metadata warehouse stores
A. Sen / Decision Support Systems 37 (2004) 151–173 167

Fig. 8. A partial business object universe in TIP data warehouse.

the warehouse metadata in a database, needs to have a MB. This is very typical in all data warehouses.
common information model and provides version Second, the metadata in the TIP project can grow as
control. However, unlike repositories, the main focus it goes through a lot of changes. This kind of
of a metadata warehouse is to manage changes that volatility is somewhat uncharacteristic of a regular
continuously happen in the metadata. Typically, these data warehouse. These changes are important and
changes in the metadata ripple throughout the entire need to be managed. This section discusses three
metadata base. possible architectures of a metadata warehouse. The
From the earlier case study, we observe two first type uses existing data warehouse techniques,
things. First, the size of the metadata used by the the second type extends the repository technique
data warehouse is not small. In fact, currently in the and the third type combines these two. We now
TIP project, the academic part of the data ware- describe these architectures with their advantages
house is 331 MB, while its metadata is about 1 and disadvantages.
168 A. Sen / Decision Support Systems 37 (2004) 151–173

Fig. 9. A partial list of changes in TIP data warehouse metadata.

4.5.1. Type 1 approach (using data warehouse formats and are extracted from these sources by using
architecture) some specialized extract, transfer and load (ETL) tool.
Like in a regular data warehouse, the metadata are They are then temporarily stored in an area called
gathered from various metadata sources. The metadata Metadata Staging area. It is actually the construction
sources include descriptions of traditional data sour- site of the metadata warehouse. Flat files or relations
ces, load scripts, logical and physical data models, are typically used in the metadata staging area. After
staging table definitions and data warehouse table the extraction is done, metadata sets need to be
definitions, view definitions and definitions used in cleaned, standardized, duplicates removed, integrated
end-user applications. The metadata come in different and transformed by using the ETL tool. Once we are
A. Sen / Decision Support Systems 37 (2004) 151–173 169

satisfied with the quality of the metadata, we again that support decision support activities, indexing capa-
use ETL tool to load the metadata warehouse (see Fig. bilities and reside on very fast processors. The metadata
10). Currently, not much is available in metadata ETL warehouse typically spans the entire enterprise. We feel
tools domain. managing version control can become a serious prob-
The metadata warehouse needs to be located in a lem in this kind of metadata warehouse.
metadata warehouse server. A metadata warehouse Finally, a metadata warehouse provides end-user
server uses a relational database management system facilities for various types of metadata access and
(like NCR’s Teradata, IBM’s DB2, Microsoft’s SQL reporting. These tools need to allow decision support
Server or Oracle’s Oracle 9i). The logical and physical queries with respect to change management, provide
models for the metadata warehouse must support the mechanisms to manage metadata changes and
metadata descriptions and their changes. The models finally, to allow the metadata to be evaluated. The
are translated to three NF relations before storing them end-user tools for the metadata warehouse are also
in a DBMS. The models can be designed from scratch not well developed. Most vendors like NCR and
or can use some generic template. These relations are others provide simplistic browsing facilities through
used with specialized query-processing capabilities the metadata.

Fig. 10. The metadata warehouse architecture (Type 1 approach).


170 A. Sen / Decision Support Systems 37 (2004) 151–173

4.5.2. Type 2 approach (using repository-based using an information model and then implemented on
architecture) top of the repository tables. The end-user tools in this
The advantage of using repository-based architec- case use standard repository tools to provide the
ture is to make use of some of its salient features such repository-styled decision support capabilities.
as version control (to manage changes), inherent infor-
mation model to design the metadata warehouse 4.5.3. Type 3 approach (using integrated architecture)
model, cross-platform support (i.e. the integration of Both techniques described above have their advan-
tool metadata), event management and others. The tages and disadvantages. We propose here an integra-
architecture (see Fig. 11) is very similar to Fig. 9, but tion framework that takes the best of both worlds. We
uses repository engine as the main tool. The metadata capture the ETL tools, the concept of data staging and
sources use the specialized ETL-like tools to bring the some end-user tools ideas from Type 1 architecture. We
metadata into the staging or holding area. The staging then connect these with the benefits of Type 2 archi-
area is very much like we see in a warehouse. It again tecture to get the use of repository engine, version
uses temporary storage and does not use any repository control mechanism and integration framework for tool
infrastructure. The metadata warehouse is designed metadata that is inherent in the repository world. The

Fig. 11. The metadata warehouse architecture (Type 2 approach).


A. Sen / Decision Support Systems 37 (2004) 151–173 171

Fig. 12. The metadata warehouse integrated architecture (Type 3 approach).

emphasis on CWM can be achieved through the show how metadata has come of age and now
repository technology also. The result of this kind of demands its own warehouse. Several conclusions
integration can be seen in Fig. 12. and future research can be drawn from the Metadata
Warehouse Proposal for the TIP project.
First, the major challenge in the metadata ware-
5. Conclusion and future research house is to develop a design methodology that can
capture the metadata scattered in the data warehouse
While TIP Data Warehouse is a specific data projects and put them in the metadata warehouse. It is
warehouse, the need for metadata warehouse is uni- unclear at this time how that can be done. Will the
versal. We see its need in the web data warehouses design methodology follow any data warehouse
also. In fact, the changes are much more prevalent in design technique or will it be a new one?
the web world than in the regular data warehouse Second, the design and development of ETL tools
world [21]. The current research provides a window to for the metadata warehouse need to be done. It is
172 A. Sen / Decision Support Systems 37 (2004) 151–173

somewhat unusual to make use of the current ETL [3] C.J. Date, An Introduction to Database Systems, Addison-
tools for the metadata. We feel that the metadata is Wesley Publishing, New York, 1995.
[4] A. Deshpande, D. Van Gucht, An implementation for nested
quite different from regular data and needs to be relational databases, Tech Report Number 234, Department of
treated differently. Computer Science, Indiana University, February 1988.
Third, the end-user tools need to be developed for [5] D. Doran, TIP management change repository, An Internal
the metadata warehouse. As one can see in Fig. 12, Report, Texas A&M University System, 2001.
[6] E. Gamma, R. Helm, R. Johnson, J. Vlissides, Design Pat-
end-users activities for a metadata warehouse span
terns: Elements of Object-Oriented Software, Addison-Wesley
from version control (a repository activity) to change Publishing, New York, 1995.
management decision support (a warehouse-type [7] W.H. Inmon, Building Data Warehouse, Wiley, New York,
activity). We may need to drill down or drill up the 1992 and 1996.
data for end-users very much like we do in a regular [8] R. Kimball, L. Reeves, M. Ross, W. Thronthwaite, The Data
data warehouse. Current repository-based tools (see Warehouse Lifecycle Toolkit, Wiley, New York, 1998.
[9] M. Leiter, S. Meyer, S.P. Reiss, Support for maintaining ob-
Fig. 11) do not have these choices. ject-oriented programs, Transactions on Software Engineering
Fourth, it is obvious that the Type 3 architecture is 18 (12) (1992) 1045 – 1052.
the best of both worlds. Unfortunately, none of the [10] P.K. Linos, V. Courtois, A toolset for maintaining hybrid C++
current data warehouse and repository vendors imple- programs, Software Maintenance Research and Practice 8
ment this architecture. (1996) 389 – 419.
[11] D. Marco, Building and Managing Meta Data Repository: A
Fifth, as indicated before, the metadata warehouse Full Life Cycle Guide, Wiley, New York, 2000.
becomes a big problem in a web world. The reason is [12] D. Marco, Meta data repositories: where we’ve been and
the incessant changes that happen in the website, where we’re going, DM Review, February 2002. http://
starting the site structure changes, page content www.dmreview.com/master.cfm?NavID=198&EdID=4612.
[13] Y. Matsumoto, A software factory: an overall approach to
changes to the click-record log format changes and
software production, in: P. Freedman (Ed.), Software Reus-
others. To manage all these changes, metadata ware- ability, Computer Society Press, Los Alamitos, CA, 1987, pp.
house seems to be the answer. 155 – 178. http://www.computer.org/cspress/.
Finally, the question of integrating the metadata [14] J.-M. Morel, J. Faget, The REBOOT environment, Proceed-
warehouse with the existing data warehouse is very ings of Advances in Software Reuse, Lucca, Italy, March 24 –
important. As the metadata warehouse is a new 26, IEEE Computer Society Press, Los Alamitos, CA, 1993,
pp. 80 – 88.
phenomenon, we need to figure out how to incorpo- [15] M. Pace, Evaluation and Comparison of C+ Class Libraries,
rate it with the existing data warehouse. Will the http://www.desy.de/user/projects/C++/Projects.html.
metadata warehouse be kept apart from the data [16] P.J. Plauger, The Standard C Library, Prentice-Hall, Engle-
warehouse, or will it be integrated with it? wood Cliffs, NJ, 1992.
[17] J. Poole, D. Change, D. Tolbert, D. Mellor, Common Ware-
house Metamodel, Wiley, New York, 2002.
[18] R. Prieto-Diaz, A software classification scheme, PhD Thesis,
Acknowledgements University of California, Irvine, 1985.
[19] D. Reed, Tools for software reuse, Object Magazine (1995
The research was partially funded by a grant from February) 63 – 67.
Teradata, a division of NCR. The author acknowl- [20] H.-J. Schek, M.H. Scholl, The relational model with relation-
valued attributes, Information Systems 11 (4) (1986).
edges the help of Debbie Doran from the Texas A&M
[21] A. Sen, Metadata warehousing: a methodological support to
University System, and David Riegel and Mary Gros data warehouse change management, Working Paper, Texas
from Teradata. A&M University, 2002.
[22] I. Sommerville, Software Engineering, Addison-Wesley Pub-
lishing, New York, 1992.
References [23] K. Watterson, More than databases, repositories should hold
the Corporate IS Jewels. Why don’t they? Byte Magazine,
(1998 May) 1 – 9.
[1] F. Banchilhon, C. Delobel, P. Kanellakis, Building an Object-
Oriented Database System, Morgan Kaufmann Publishers,
San Mateo, CA, 1992.
[2] P.A. Bernstein, Repositories and object-oriented databases,
SIGMOD Record 27 (1) (1998 March) 88 – 96.
A. Sen / Decision Support Systems 37 (2004) 151–173 173

Arun Sen is a full professor and Mays


Fellow in the Department of Information
and Operations Management in Texas
A&M University. Before joining the Texas
A&M University in 1986, he was an assis-
tant and a tenured associate professor in the
Department of Management Science, Uni-
versity of South Carolina. He holds an
MTech in Electronics (from Calcutta Uni-
versity, India in 1971), an MS in Computer
Science (from Penn State University in
1976) and a PhD in Information Systems (from Penn State University
in 1979).
He has published 42 research papers in many journals such as MIS
Quarterly, Information Systems Research, IEEE Transactions on
Systems, Man and Cybernetics, IEEE Transactions on Software
Engineering, IEEE Transactions on Engineering Management,
Decision Sciences, Communications of the ACM, Information
Systems, Computers and OR, Omega, European Journal of Oper-
ations Research, Decision Support Systems, Journal of MIS, Infor-
mation and Management and others. His research interests include
decision support systems, database management, repository man-
agement and software reuse, case-based reasoning, technical and
behavioral aspects of data warehouse and E-Commerce.
He was an associate editor of Journal of Database Management. He
was a special issue editor for Decision Support Systems, Commu-
nications of the ACM, Database and Expert Systems with Applica-
tion. He was the chair of INFORMS College on Information
Systems, a program chair for the 1996 Workshop on Information
Technology and Systems (WITS) Conference and a track chair
(Decision Support Systems and AI track) for the 1996 National DSI
Conference.

Das könnte Ihnen auch gefallen