Sie sind auf Seite 1von 10

Chapter 21.

Metadata Management
In addition to managing data, DBAs need to be able to manage and control the definition of the data elements used in databases. Without an understanding of the structure, limitations, definition, and description of data, it is likely that data will be misinterpreted or misused. Furthermore, data that is not well defined can cause database integrity problems.

What Is Metadata?
Have you ever watched the "Antiques Roadshow" program on television? In this show, people bring items to professional antiques dealers to have them examined and evaluated. The participants hope to learn that their items are long-lost treasures of immense value. The antique dealers always spend a lot of time talking to the owners about their items. They always ask questions like "Where did you get this item?" and "What can you tell me about its history?" Why? Because these details provide knowledge about the authenticity and nature of the item. The dealer also carefully examines the item, looking for markings and dates that provide clues to the item's origin. Users of data must be able to put it into context before the data becomes useful as information. Information about data is referred to as metadata. The simplest definition of metadata is "data about data." To be a bit more precise, metadata describes data, providing information like type, length, textual description, and other characteristics. For example, metadata allows the user to know that the customer number is a five-digit numeric field, whereas the data itself might be 56789. Metadata is "data about data." So, using our "Antiques Roadshow" example, the item being evaluated is the "data." The answers to the antique dealer's questions and the markings on the item are the "metadata." Value is assigned to an item only after the metadata about that item is discovered and evaluated. Metadata characterizes data. It is used to provide documentation such that data can be understood and more readily consumed by your organization. Metadata answers the who, what, when, where, why, and how questions for users of the data.

From Data to Knowledge and Beyond

The basic building block of knowledge is data. Data is a fact represented as an item or event out of context and with no relation to other facts. Examples of data are 27, JAN, and 010110. Without additional details, we know nothing about any of these three pieces of data. Consider the following:

Is 27 a number in base ten, or is it in octal (which would translate to 23 in base ten)? If 27 is a number in base ten, what does it represent? Is it an age, a dollar amount, an IQ, a shoe size, or something else entirely? What does JAN represent? Is it a woman's name (or a man's name)? Or does it represent the first month of the year? Or perhaps it is something else entirely? Finally, What about 010110? Is it a binary number? Or is it a representation of a date, perhaps January 1, 1910? January 1, 2010? Or something else entirely?

Data is a fact represented as an item or event out of context. Because of the lack of context, these are all examples of data. Information, on the other hand, adds context by specifying relationships between data, and possibly other information. Data in context with metadata makes information. The relationships may represent information, yet the relations do not actually constitute information until they are understood. In addition, the relationships that represent data have a tendency to be limited in context, mostly about the past or present, with little if any implication for the future. Webster's New Collegiate Dictionary defines knowledge as "the fact or condition of knowing something with familiarity gained through experience or association." Knowledge adds understanding and retention to information. It is the next natural progression after information. To have "knowledge" requires information in conjunction with patterns between data, information, and other knowledge. Therefore, knowledge couples information with understanding and cognition. The final step would be to move from knowledge to wisdom. Wisdom can be thought of as applied knowledge. You may have the knowledge that fatty foods are bad for you, but if you eat them anyway, you are not wise. Wisdom can be thought of as applied knowledge. In order for data to be anything more than simply data, metadata is required. Without metadata, data has no identifiable meaningit is merely a collection of digits, characters, or bits. Metadata gives data its form and makes it usable by information professionals.

Metadata Strategy
A wise organization will develop a metadata strategy to collect, manage, and provide a vehicle for accessing metadata. A sound metadata strategy should address the following:

A policy for how metadata is used in the organization Procedures for identifying and defining data ownership and stewardship Identification of the types of metadata that need to be collected A description of the purpose for each type of metadata that is identifieda clear and concise reason why each piece of metadata is required by the organization Methods for the collection and storage of metadata (typically using a repository) Methods for accessing the metadata Policies to enforce data stewardship procedures and security for metadata access Identification of metadata sources, both internal and external Measurements to gauge the quality and usability of metadata

Metadata publicizes and supports the data your organization produces and maintains. By assembling and managing metadata, your organization will have access to relevant facts about your data, making your systems more usable and your databases more useful. DBAs should participate in the team that develops the metadata strategy, but the data administration organization, if one exists, should be the leader of the metadata effort.

Data Stewardship
A data steward is accountable for actions taken using a defined set of data. A data stewardship policy will identify individuals within the organization whose responsibility it is to create, maintain, and delete data. A data steward is not necessarily the data owner. A comprehensive data stewardship policy will also define the consumers of the datathat is, those who directly use the data during the course of their jobs.

Data Warehousing and Metadata

Companies implementing data warehousing systems are more likely than other companies to have embarked on a metadata management strategy. Users require accurate information about the data contained in a warehouse before the data can be used appropriately for business. Therefore, such

businesses have a critical need for readily available high-quality metadata. Frequently, though, little if any metadata is captured and managed prior to the onset of a data warehousing effort. A data warehousing effort has a critical need for readily available high-quality metadata.

Types of Metadata
Even though all metadata describes data, there are many different types and sources of metadata. At a fundamental level, though, all metadata is one of two types: technology metadata or business metadata. Technology metadata describes the technical aspects of the data as it relates to storing and managing the data in computerized systems. Business metadata, on the other hand, describes aspects of how the data is used by the business, and is needed for the data to have value to the organization. Knowing, for instance, that the LICNO column is a positive integer between 1 and 9,999,999 is an example of technology metadata. Of course, the business user also requires this information. Knowing that a number referred to as a LICNO is the practitioner license number for certified course instructors, that it must must be unique, and that every instructor can have one and only one license number is an example of business metadata. (Though, these details also are also useful to the DBA in order to create the database appropriately and effectively.) For DBAs, the DBMS itself is a good source of metadata. The system catalog used to store information about database objects is a vital store of DBA metadatatechnology metadata. DBAs and developers make regular use of the metadata in the DBMS system catalog to help them better understand database objects and the data contained therein. Depending on the DBMS, the user can write queries against the system catalog tables or views, or he can execute system-provided stored procedures to return metadata from the system catalog tables. Just about any type of descriptive information about the composition of the data may be found in the system catalog. For example, most DBMSs store all of the following metadata in the system catalog:

Names of every database, table, column, index, view, relationship, stored procedure, trigger, and so on Primary key for each table and any foreign keys that refer back to that primary key Which tables are in which views Data type, length, and constraints for each column of every table Names of the physical files used to store database data, as well as information about file storage, extents, and disk volumes Authorization and security information detailing which users have what type of authority on which database objects

Date and time of the last database definition change, as well as the ID of the user who implemented the DDL for the change Database organization information

The DBMS system catalog is a particularly effective source of metadata. The DBMS system catalog is a particularly effective source of metadata because it is active, integrated, and nonsubvertible. The system catalog is active because the metadata is automatically built and maintained as database objects are created and modified. As the DBA creates databases, the DBMS automatically collects and populates metadata in the system catalog. The integration of the system catalog and the DBMS, coupled with the active nature of the system catalog, keeps the technology metadata in the system catalog accurate and up-to-date. Additionally, the DBMS system catalog is nonsubvertible, meaning that normal DBMS operations are the only mechanism for populating the system catalog. Of course, the subvertibility of the system catalog will differ from DBMS to DBMS. Some DBMSs provide options to enable direct updates to the system catalog, but such an option is to be used only in emergencies and generally under the direction of the DBMS vendor's technical support personnel. Although a wealth of metadata can be found in the system catalog, this DBMS metadata is usually insufficient to fully describe data. For example, descriptions of database objects are not commonly found in the DBMS system catalog. Some DBMSs provide system catalog description columns that can be populated at the DBA's discretion. However, many DBAs avoid this for fear of disorganizing the system catalog. It's also possible that descriptions for the database objects were not available when the objects were created. Additional metadata that is useful, but not found in the system catalog, includes

Metadata for nondatabase files (flat or sequential files) Modification information regarding when and by whom data in the database was last changed Copybook information for the database table (or nondatabase file), as well as which programs use that information Information on batch jobs and transactions that access the data Operational metadata on IT infrastructure components Data model metadata describing the logical database design and how it maps to the physical database implementation Data warehousing and ETL metadata defining data source(s), system of record, and other analytical information

Data ownership and stewardship metadata

Of course, this is an incomplete list. A myriad of different metadata types and purposes exists that can be cataloged and managed. Capturing and maintaining metadata better documents databases and systems, thereby making them easier to use. The more metadata that you make available to business users, the more value they will be able to extract from their information systems. Capturing and maintaining metadata makes databases and systems easier to use.

Repositories and Data Dictionaries

A repository stores information about an organization's data assets. In other words, repositories are used to store metadata. A properly implemented repository stores all pertinent metadata for the corporation. It can act as a single, centralized mechanism to assist in the migration of data from the multiple sources to a data warehouse. A repository stores all pertinent metadata for the corporation. In choosing a repository, base your decision on the metadata storage and retrieval needs of your entire organization, not just the databases you wish to support. Typically, a repository can

Store information about your data, processes, and environment. Support multiple ways of looking at the same data. An example of this concept is the threeschema approach, in which data is viewed at the conceptual, logical, and physical levels. Store in-depth documentation, and produce detail and management reports from that documentation. Support data model creation and administration. Integration with popular ETL, data modeling, and CASE tools is also an important evaluation criterion. Support for versioning and change control. Versioning helps to synchronize application development, eliminating rework and increasing flexibility. Enforce naming conventions. Parse and extract metadata from multiple sources. For example, if your site is a big COBOL shop, the repository vendor should offer tools that automatically examine your COBOL source code to extract metadata.

Generate copybooks from data element definitions.

These are some of the more common functions of a repository. When choosing a repository for database development, the following features generally are desirable.

The data stores used by the repository can be stored using database tables in your DBMS. This enables your applications to directly read the data dictionary tables. For example, if you are primarily an Oracle shop, you should favor using a repository that stores its metadata information in Oracle tables. Some repository products utilize multiple DBMSs and allow the user to choose the DBMS to be used.

The repository should be capable of directly reading the system catalog or views on the system catalog for each DBMS you use. This ensures that the repository will have current information on database objects.

If the repository does not directly read the system catalog, an interface should be provided to simplify the task of populating the repository using the system catalog information. The repository provides an interface to any modeling and design tools used for the generation of database objects.

Most of the popular repository products are mainframe-based and rely on a centralized metadata "database," or repository. This approach is usually better suited for documenting OLTP-based systems. Such a repository may be more difficult to use in a data warehouse environment because a mainframe focus can present challenges when managing metadata in a distributed, state-of-the-art data warehouse implementation. Many ETL tools used in data warehousing projects also contain a repository that is geared toward the needs of the data warehouse. Organizations needing to manage metadata for both OLTP and data warehouses should make sure that the data in their ETL repositories can be migrated successfully to the OLTP repository. Other repository products are application-centric. Such repository technology focuses on application development metadatawhich is useful, but not comprehensive. For example, the Microsoft Repository is focused on Visual Studio and is focused on Microsoft computing assets. Microsoft has partnered with Computer Associates, makers of the market-leading PLATINUM Repository, to provide additional enterprisewide capabilities for the Microsoft repository technology. Some repository products are application-centric.

Repository Benefits
Repository technology provides many benefits to organizations properly exploiting their capabilities. The metadata in the repository can be used to integrate views of multiple systems helping developers

to understand how the data is used by those systems. Usage patterns can be analyzed to determine how data is related in ways that may not be formally understood within the organization. Discovery of such patterns can lead to business process innovation. In general, the primary benefit of a repository is the consistency it provides in documenting data elements and business rules. The repository helps to unify the "islands of independent data" inherent in many legacy systems. The repository enables organization's to recognize the value in their legacy systems by documenting program and operational metadata that can be used to integrate the legacy systems with new application development. A repository provides consistency in documenting data elements and business rules. Furthermore, a repository can support a rapidly changing environment such as those imposed by Internet development efforts on organizations. The metadata in the repository can be examined to produce impact analysis reports to quickly determine how changes in one area will impact others. Reusability is a big time saver. If something can be reused instead of being developed again from scratch, not only will time be saved but also valuable resources can be deployed on more crucial projects. Repositories facilitate reuse documenting application components and making this metadata available to the organization. Finally, repositories are an invaluable aid to data warehousing initiatives.

Repository Challenges
One of the biggest challenges in implementing and using repository technology is keeping the repository up-to-date. The repository must be populated using data from multiple sourcesall of which can change at any time. When the composition or structure of source data changes, its metadata most likely will need to change, too. The process for populating the repository is complicated and should be made as automated as possible. Refer to Figure 21-1. Metadata sources come from multiple areas and locations within an organization and can include

Application component metadata from program development tools, application programs, and code libraries Business metadata from business user input, documents, and memos Data modeling metadata from data modeling tools

Database metadata from the DBMS system catalog ETL metadata from data warehousing tools Operational metadata from automated operations and job scheduling tools Other types of metadata such as data usage metadata from query tools Figure 21-1. Populating the repository

Populating the repository is complicated. To be successful, this information needs to be collected, parsed, and recorded in the corporate metadata repository. The integration process must take into account the frequency of change for each metadata source. Whenever metadata changes at the source, the metadata in the repository will be out of sync until the source metadata is scanned, captured, and integrated into the repository again. Many shops do not own a repository. More accurately, very few shops own a centralized metadata repository. Furthermore, many organizations that do own a repository do not always implement the proper integration and usage procedures, causing the repository to be neglected. As soon as the metadata in the repository becomes outdated, inaccurate, or nonexistent, the repository will cease to be of value. Of course, the fault does not necessarily lie with the repository technologymore likely the fault lies with the organization that does not implement procedures for keeping the metadata in the repository up-to-date. Of course, such an effort requires a significant budget, commitment, and the effort of skilled data management professionals including DAs and DBAs. Very few shops own a centralized metadata repository.

Data Dictionaries
Data dictionaries were the precursors to repository technology. Data dictionaries were popular in the 1980s. The purpose of a data dictionary was to manage data definitions. In general, they offered little automationthe user had to manually key in the definitions. In some cases, the data dictionary was integrated into the DBMS and databases could be defined using the metadata in the data dictionary, but this was prerelationalbefore DBMS products had system catalogs. The purpose of a data dictionary is to manage data definitions. As more and more types of metadata were identified and organizations desired to accumulate and manage such metadata, the data dictionary was transformed into the repository. Use of CASE tools, such as Excelerator and Advantage Gen, for application and database development enabled more metadata to be captured and maintained during the development process. As developers became more sophisticated over time, data dictionaries evolved to provide more than just data attribute descriptions. The products became capable of tracking which applications accessed what databases. Developers who used the data dictionary properly were able to maintain their systems and applications more easily. Truthfully, IBM's AD/Cycle and Repository Manager initiatives caused much of this transformation. Even though both initiatives ultimately failed in the marketplace, repository technology was forever changed by IBM's ventures into this field. For more information on IBM's initiatives in this area, consult IBM's Repository Manager/MVS by Henry C. Lefkovits, the definitive book on the topic.

This chapter on metadata management has been necessarily brief. As a DBA, you will need to understand the role of metadata as it impacts the DBMS, databases, and database users. Organizations that spend a lot of time managing and maintaining metadata will likely have a data administrator on staff. Alternatively, the data warehouse administrator or architect might focus on metadata management. DBAs may become involved in certain aspects of metadata management, such as repository selection, installation, and maintenance. However, most DBAs will use metadata far more than they will be called upon to store, manage, and maintain metadata.