You are on page 1of 25

Data Management Task Force Final Report CAUSE INFORMATION RESOURCES LIBRARY The attached document is provided through

the CAUSE Information Resources Library. As part of the CAUSE Information Resources Program, the Library provides CAUSE members access to a collection of information related to the development, use, management, and evaluation of information resources- technology, services, and information- in higher education. Most of the documents have not been formally published and thus are not in general distribution. Statements of fact or opinion in the attached document are made on the responsibility of the author(s) alone and do not imply an opinion on the part of the CAUSE Board of Directors, officers, staff, or membership. This document was contributed by the named organization to the CAUSE Information Resources Library. It is the intellectual property of the author(s). Permission to copy or disseminate all or part of this material is granted provided that the copies are not made or distributed for commercial advantage, that the title and organization that submitted the document appear, and that notice is given that this document was obtained from the CAUSE Information Resources Library. To copy or disseminate otherwise, or to republish in any form, requires written permission from the contributing organization. For further information: CAUSE, 4840 Pearl East Circle, Suite 302E, Boulder, CO 80301; 303449-4430; e-mail info@cause.colorado.edu. To order a hard copy of this document contact CAUSE or send e-mail to orders@cause.colorado.edu.

Administrative Information Services (AIS) Academic Information Systems (AcIS) Data Management Task Force: Final Report 31 January 1992 SECTION 1: OVERVIEW AND BACKGROUND OBJECTIVES The task force on Data Management was commissioned by Michael Marinaccio, DVP for Administrative Information Services, and by Vace Kundakci, Deputy Vice President for Academic Information Systems. The group was tasked with a two-fold objective of recommending (1) a technical environment to support data demands, and (2) an organizational structure that can support data ownership, custodianship, and administration. The orientation of the task force was to be on future projects and technology rather than on the current reengineering of existing systems.

The task force members included: Alan Crosswell Terry Davidson Bob Juckiewicz (Chair) Ken Lee David Millman Lou Proyect Steve Rosenthal Fred Trickey Interviews and presentations were held with: David Bloom, AMS Joe Judenberg, Computer Assistants Mike Titmus, AMS CURRENT ENVIRONMENT IS FUNCTIONALLY ORIENTED Our current portfolio of applications was designed over many years with differing technologies, making it difficult to meet new requirements, or to answer managerial questions. Today we maintain redundant data in functionally-oriented systems, thus making it difficult to link data between applications. THE TIME IS RIGHT AIS is presented with the task of reengineering its application portfolio within a relatively short period of five years. This provides us with an opportunity to put into place a strong organization and process for data requirements to become the basis of its architecture. As a beginning step, AIS is developing a highlevel entity model of the University's administrative data needs. This model will be the starting point for integrating data across new application systems. As the University implements these systems, AIS will work with the vendor(s)--most likely AMS--to insure that the new systems correlate to the model and that data redundancy is eliminated. GUIDING PRINCIPLES The following major principles guided the task force as it explored the issues of data management: - Data is a University resource and its use for decision-making will be made available to all those with a need to view the information. - The architecture for administrative information systems will be driven by the data model. A data architecture is the foundation by which the University can quickly respond to changing regulatory and University requirements. New applications must follow the model, with every attempt to eliminate data redundancy. (The task force, however, realizes that there may be a need for "planned redundancy" of data, particularly in decision support systems.) - There are three classes of data--University, departmental, and personal--that must be considered when defining new systems and

access privileges. - The data model and associated tools will be made available to all, with appropriate safeguards and procedures to protect the integrity of the model. - Standard access methods will be employed for connectivity and interoperability. SECTION 2: TECHNICAL ISSUES RELATIONAL DATABASE MANAGEMENT SOFTWARE ON A HOST COMPUTER For the next couple of years AIS will be implementing new systems by acquiring packages. American Management Systems (AMS), the leading application software vendor, has designed computerized systems for large complex universities, primarily on IBM mainframe platforms. With the recent installation of the ES 9000 computer, Columbia will, for the next several years, be an IBM mainframebased shop. Higher education is coming under increasing financial constraints and additional regulatory requirements. This environment requires AIS to respond quickly to the demands for management information for decision making. We believe that a relational database management system will provide the underlying technology to permit quicker response to these changing needs. In our IBM environment, DB2 is the logical relational database of choice, and the Task Force worked under this assumption. DB2 has won wide acceptance in the marketplace and many third-party vendors have developed tools to support the product. A major attraction of a relational database is that users will find it easier to understand their data and reports. Users can easily see the idea of a table made up of rows and columns. For the technical staff we expect to see increased productivity, as the staff will be able to alter the system, to add fields, and to define new relations without affecting production programs. A drawback to relational systems is that they require more CPU resources than VSAM file structures or an IMS database. Relational systems, because of their nature, will always require more resources than other file structures and therefore DB2 may not be suitable for high-volume transaction systems. However, IBM has made important strides to address this issue and each new release has seen an improvement in performance. In Columbia's environment, there is the luxury of not having a tremendous need for high performance, high volume transaction processing systems. An occasional high volume, for example, during peak processing of registration, can be accommodated through prudent machine tuning and timing of other work. The use of a relational database on a mainframe will not necessarily marry the University to this solution forever. Through AMS's layered approached to software development, it is possible to migrate from this environment to another. The current idea of distributing data across processors is an attractive one because it offers the prospect for exploiting cheap MIPs available on workstations and UNIX hardware. However, the task force's research has found that the distributed database software to do large, complex systems will probably not be available at the earliest until 1993-94 ( see section on distributed database).

OLTP vs. DSS: HOW MANY SUBSYSTEMS? It is necessary to isolate operational DB2 for On-line Transaction Processing (OLTP) in its own subsystem. If a Decision Support System (DSS) or any other SQL application is to be developed, a separate DB2 development subsystem is required. Each subsystem requires at least 15M of real storage for the DB2 address spaces and additional real storage for the "allied agent" address space for the host language program. An additional CICS region would require approximately 5M of real storage. Two DB2 runtime environments should be established, controlled by separate catalogs. One can be used as a development, test, and perhaps as a DSS region for a pilot application. The production environment would be tuned for OLTP, exclusively. We will need procedures, and probably tools, to facilitate migration of application plan packages between test and production DB2. This raises the question of the relationship between database administration and tech support, which will have to be empowered (and trained) to perform this task. It would seem that the concurrency effects described below might require yet another DB2 subsystem for a DSS to be usable by its customer base. The package plan feature of DB2 V2.3 makes the plan rebind in (in 3, below) unnecessary, so that application developers hold exclusive locks on the catalog for shorter periods. Of course, this allows more development to take place. OLTP, DSS, and application development make different and contradictory demands of DB2 subsystems. We recommend that for DSS, AIS explore distributing data to a functional server. However, we do not know the capacity constraints of such a distributed system. A pilot project supporting actual DSS requirements should be started with the particular task of exploring the performance envelope of different approaches to DSS. What about standard reports? Reporting can be handled as it is today: it is a matter of scheduling reports to run at a time when the concurrency effects noted below are not a problem. There is no DB2 batch mode as such, although DB2 can be configured for batch processing (it needs to be brought down to accomplish this, or else an expensive tool will change the Z parms on the fly). Reports that reference the same file may run at the same time: the buffer pool will then contain a large number of the file's pages, which are a shared resource in the context of batch reporting. The following cases illustrate concurring effects within a DB2 subsystem: 1) It would be unusual to allow DB2 data that is participating in OLTP to be referenced by queries from outside the transaction processing system. An OLTP system supports quick access to a large database by a large population of concurrent users. The transactions often include updates. A user process that wishes to update a database must first lock some portion of the database so that no one else may update, or even read, the data that is about to be changed. In DB2 the granule of locking is typically the 4K page containing the data, although locks can be obtained (and sometimes promoted dynamically) to the table or tablespace. If the column to be updated is an indexed field,

then the index page(s) to be changed must be locked. In this way, updates are not lost, and the integrity of the data is ensured. 2) To satisfy concurrent requests for access to shared data resources, application programs must be coded to hold locks for the shortest time possible. While many readers share access to data, a reader will prevent a writer from acquiring the exclusive lock she needs to update a page. If the DB2 optimizer detects that a user process is reading sequentially through a table, as would often be the case for a batch report, or even that the predicate in a DSS query uses an index that is not selective enough, it (the optimizer) determines that it is wasteful to lock each page in turn, scan the page, and release the lock. Instead, an S (share) lock is obtained on the entire table or tablespace (depending on the storage characteristics of the tablespace). Now no further updates are allowed to any page of the table or tablespace until the sequential scan is complete. 3) An application developer binding a new version of an embedded SQL Data Base Request Module (DBRM) causes the plan catalog (SYSIBM.SYSPLAN) to be X (exclusive for update) locked, so that the new plan can be written to the catalog. Under the current version of DB2, V2.2, all of the SQL associated with the CICS transaction is automatically rebound. While this process is going on, the plan table is locked, so that queries cannot read their plans, and cannot execute. 4) A DSS might support dynamic SQL, so that the query needs to be optimized in a "mini-bind" before it can execute. The optimization process reads certain catalog tables to obtain the statistics it needs to generate an application plan for the query to navigate through the database. In addition to the S locks on the these tables, the optimizer obtains IX (intend to get exclusive) locks on the plan table (SYSIBM.SYSPLAN) which has the effect of making it impossible for OLTP processes to read their plans. 5) Finally, a static SQL application plan that does not lock the catalog and merely seeks to scan a table that is of no interest to anyone else, will flood the buffer pool with its own data. DB2 assumes that data is required as a shared resource. Table or tablespace scans should not be allowed in OLTP environments. STRUCTURED QUERY LANGUAGE (SQL) STANDARD ACCESS THROUGH SQL Use of a standard dialect of SQL will enable applications and databases to be transported from one system type to another, or to participate in distributed databases among heterogeneous database server and client platforms. In an ideal world, this would allow a database residing on DB2, for example, to be moved (or extracted) to a Unix or OS/2 SQL server with no changes required to the SQL statements embedded within applications. STATUS OF SQL STANDARDIZATION

IBM invented the SQL language for a research predecessor to SQL/DS and DB2. ANSI adopted an early version of SQL as an American National Standard, warts and all. The ANSI SQL standard has a number of failings, including not covering enough ground. ANSI SQL also has some mistakes based on the IBM version which IBM has since fixed. For the past several years, ANSI has been working on a SQL2 standard. However, this is still in the draft stage. SQL2 addresses many of the shortcomings of the original standard. Many database products do provide ANSI SQL compatibility, but always as a subset of the product's full capabilities. Many experts say that ANSI SQL is simply too limited a subset to be useful in real applications. IBM has an SAA standard for SQL. None of IBM's SQL products currently fully implement it. Each version (DB2, SQL/DS, AS/400, OS2/EE) has its own idiosyncrasies. The SQL Access Group (SAG) is a consortium of about 30 major SQL vendors. SAG has been working in conjunction with X/Open, an open systems vendor consortium to come up with its own standard for SQL, as well as a standard for distributed SQL based on the OSI Remote Data Access protocol (RDA). This standard is expected to appear in the next release (4) of the X/Open Portability Guide, due out in mid-January 1992. SAG's distributed SQL standard is a competitor to IBM's Distributed Relational Database Access (DRDA). DRDA is an IBMonly product, designed for compatibility across IBM's several SQL platforms. One drawback of DRDA is that it requires the application to know which SQL server platform it is talking to so that it can take advantage of that particular server's flavor of SQL. The SAG-X/Open solution is to use a common SQL dialect but to allow explicit extensions to the dialect when needed. RECOMMENDATIONS While SQL standardization is still in a state of flux, database and SQL application designers should take into account the following: - To plan on the possibility of the database and/or application to be re-homed to a different SQL platform in the future. - To keep abreast of standardization efforts and vendor compliance. What the SQL Access Group and X/Open are up to, especially, should be tracked, as they represent just about every SQL vendor other than IBM. Also, the ANSI SQL2 draft should be reviewed. - When defining database schemas, portable types should be used, not vendor extension data types. If a vendor extension data type is used, it should be justified and how that data will be transported to a different SQL platform should be documented.

- When writing SQL data manipulation statements, standard constructs should be adhered to. For example, while an outerjoin operator is handy, there may be problems transporting code to a new platform since many SQL servers don implement it. - SQL software vendors should be asked what their approach to SQL standards and portability is. DISTRIBUTED DATABASE MANAGEMENT SYSTEMS INTRODUCTION: WHY WE CARE ABOUT DISTRIBUTED DBMS Ideally a distributed DBMS is a set of independent DBM systems, connected via network, on different hardware and software platforms, perhaps also in physically disparate places, which appear to any client as a single DBMS. Benefits achieved from such a configuration include: (1) Conformance to a data model which assumes that individual units of the enterprise have distinct data interests which they do not necessarily share with all other units. (2) Savings in communications costs, especially for units which have a high transaction volume on specific data. In such cases network traffic remains localized within the unit physically possessing either the "primary" copy of the data (OLTP) or the copy representing the particular research interests of the unit, (DSS or Decision Support Systems). (3) Savings in processing costs. The cost per unit of processing (MIPS, CPU cycles, etc.) is lower for decentralized computing. Incremental costs can be more accurately estimated. Particular departments may independently adjust their processing requirements. (4) Increased reliability. Decentralization limits the effects of a single point of failure, either in the communications network or in a central point of processing. (5) Encourages managerial research (DSS). By providing the mechanisms for separate summary information to be removed from the more performance-sensitive daily transactions, a distributed system suggests taking advantage of these mechanisms, which are often targeted to common desktop applications (e.g., Paradox, DBase-XBase) familiar to management and sufficient for their processing requirements. (6) Conformance to the strategic directions of both IBM and everyone else who sells us things. The IBM DRDA specification, its SAA strategy, the ability of DB2 (rel 2.2) to cooperate as a unified data base with other peers, the plans for DB2 (2.3) to cooperate in that way with SQL/DS, and the future plans to include OS/2 and AS/400 platforms in the same scheme all indicate that IBM has taken this architecture quite seriously. That their plans include only their own products is, as usual, simply a bargaining chip against the rest of the market, which will

permit them a certain edge for a limited amount of time. Most other vendors have also undertaken a strategy towards interoperability between data base systems, but they have generally decided to cooperate with each other as well. IBM sends a double message: cooperation among disparate platforms--but just our disparate platforms. This kind of unfortunate hypocrisy is marketing alone; it will solve few of the real problems we face. (7) Vendor independence. Despite the IBM problem above, distributed data base systems are, by nature, designed to permit an individual department to select the most appropriate data base software for its particular needs, while also maintaining accurate copies of "other people's" data. Several quite active standards bodies, which include almost all vendors, are involved in insuring this capacity (OSI, ANSI, X/Open, SQL Access Group). MARKET SURVEY: WHY WE CAN'T HAVE DISTRIBUTED DBMS NOW At the risk of this entire endeavor it remains very important to note that there are currently no commercial products capable of satisfying the desires above. Surprisingly enough, we find ourselves at the very edge of technology. While, as mentioned, virtually all data base vendors are actively attempting to provide the above benefits there have been, as yet, no real products. A "real" product is lately measured by its ability to meet the now-famous standards set out by Chris Date in "Twelve Rules for a Distributed Data Base" (Computerworld, June 8, 1987, p.75). These standards mostly specify the independence and the mutual transparency of the various data base software participants over the network. A few reasons why there are no commercial products available right now: (1) Protocol Standardization. In an arena where high-level cooperation will be the key to a distributed DBMS, standardization has not been finalized. Many vendors appear to be poised for the outcome of that process and it appears worthwhile to permit them the time to make their best efforts to work with each other. While an IBM-only alternative is perhaps more readily promised, it requires substantial reinvestment in hardware, which is perhaps inappropriate and is certainly premature at this time. (2) Two-Phase Commit. The "two phase commit" is a network protocol enabling cooperating data base management systems to perform distributed "update" transactions. (An "update" is the ability to change information, rather than just look at it.) While this process is well understood and perfectly practical within the data base research community, it does not adequately address the performance of a production network nor does it comprehensively suggest behavior on the part of the cooperating systems (especially if some are down). (3) Network and Processor Reliability. While this is a special focus of the latter problem (in 2 above), it has the potential to solve it. Server redundancy and the guarantee

of network performance could make the two-phase commit a realistic possibility, therefore, making the distributed DBMS happen. Because large technological strides have recently been made in these areas, they may well be viewed as the current bottom-line bottlenecks to the solution. (4) Design. Many of our current development and design methods retain built-in assumptions of a centralized data base. If our modeling and "CASE" efforts can establish a broader constituency, for client enterprise units and for computer products respectively, then we will be in a better position to accommodate changes in either. CRITERIA SUMMARY: WE WOULD LIKE A SINGLE DBMS We believe we want to realize the model, variously described above, of a single DBMS. It should be distributed across a network so that it is optimized for transaction performance in the units which perform the most number of transactions, and for reporting performance in the units which do the most reporting--quite independently of each other. A case has perhaps been made that a distributed DBMS is the best possible solution. Without currently reasonable choices, we should remain "shoppers" in this market while making note of the following issues: (1) Protocol standardization. Our vendors need to know that we are interested in this. And further, that we will not be willing to work with vendors who aren't actively pursuing standardization. A couple of levels should be watched carefully: the high-level syntax, e.g., SQL; and the lowerlevel transport mechanisms, like IBM DRDA and OSI RDA. Conformance, commitments, and gateways between different protocols should be part of any vendor strategy. (2) Query Optimization. A principal design issue in the distributed DBMS products now available is the method of query optimization. Vendors' strategies must account for network traffic incurred by their DBMS query optimization methods. (3) Recovery Control. The two-phase commit, above, may not be the most effective way of ensuring data integrity. Vendors must demonstrate rollback procedures, and those which cooperate across standard protocols. (4) Security. How is authorization information passed across the network between applications? Thus far, this work has been quite primitive. We are analyzing this question and it also needs to be answered by our vendors. Vendors should never require the "manual" maintenance of user authentication information, but rather they should be in a position to batch-load this data from whatever central security database is adopted here. (5) Design Methodology. How does the distributed data base product work with CASE tools? Products should offer distributed data dictionaries to properly support distributed application development. And, in what ways do our own DBMS modeling efforts include assumptions about

data distribution? Do our "CASE" data management tools assume a single, centralized data base? Will they generate data base dictionaries which can been used across the potentially distributed environment described above? RECOMMENDATIONS: WHAT WE CAN DO IN THE MEANTIME While waiting for a reliable set of products to implement the ideal distributed DBMS, we can certainly take some proactive steps now. We can immediately benefit from a distributed data methodology by: (1) Insuring that our data-modeling and data-management efforts account for potential distribution of the data. For example, our CASE tools should support a wide variety of DBMS and offer distributed data dictionaries. (2) Designing replicated-data applications. As an intermediate step, physical copies of data may be used to actually achieve most of the benefits of a truly distributed system. Also, this experience is necessary for us to explore the implications of the more widely distributed systems we will be required to support in the future. And further, this experience will enable us to evaluate network performance and to recommend network strategy. (3) Continuing our pursuit of a network-based security system. Such a system must be in stable production before we can embark on any real distributed DBMS. SECTION 3: ORGANIZATIONAL ISSUES A STRONG ORGANIZATION NEEDS TO BE IN PLACE In a previous recommendation, the task force stated that the architecture for administrative information systems will be driven by the data requirements of the University. To insure that our data can answer the required managerial questions, a strong, diverse organization with formalized responsibilities needs to be in place. The purpose of this section is to cover the fundamental issues, to control objectives and techniques related to functions of data administration, database administration, data ownership, and data custodianship which are vital to the implementation and maintenance of a database at Columbia University. To adequately support a database environment at Columbia, an organizational infrastructure must be created where none currently exists. DATA MANAGEMENT RESPONSIBILITY HIERARCHY The recommended organizational responsibilities include the following: - Data Administration: Based in the Provost's Office, the data administrator defines data requirements and definitions at the highest level and maintains data integrity across organizational lines. Data administration's objectives are strategic and

organization-wide. - Database Administration: Based in AIS, the database administrator is responsible for the tactical implementation of the corporate database model developed within data administration. He or she is also responsible for the operational integrity of the physical database. - Data Ownership: The owner of data is a senior manager within the university who is ultimately responsible for the data created and maintained within his or her department. - Data Custodianship: The data custodian is a liaison between the data owner and the database administrator and is responsible for authorizing access to data. ORGANIZATIONAL RESPONSIBILITY Department Responsibility STRATEGIC Provost Data Administrator TACTICAL AIS Database Administrator Personnel Management & Human Resources Student Financial & Information Services University Development & Alumni Relations Treasurer & Controller Data Ownership OPERATIONAL AIS Database Administrator Personnel Management & Human Resources Student Financial & Information Services University Development & Alumni Relations Treasurer & Controller Data Custodian These responsibilities only describe general functional requirements. We do not indicate whether each responsibility is to be carried out on a part-time basis by one individual or by a staff of people on a full-time basis. DATA MANAGEMENT RESPONSIBILITIES AND FUNCTIONS DATA ADMINISTRATION In organizations that have a adopted a corporate data base model--such as Columbia has done explicitly through the highlevel information architecture project carried out in conjunction with AMS--the responsibility of data administration becomes critical. If the model is to be successful in helping

to implement mission-critical systems, the data administrator must be the ultimate guarantor of the integrity and reliability of the database model. The model is a definition of data, it is metadata. Not only must the data administrator ensure the accuracy of the data definition, he or she must understand how to make the definition of the data available to the broad user and developer community. The data definition is like a road-map. Without such a map, it becomes difficult to exploit the data contained in DASD or other media, or to get from one point to another. The data administrator is responsible for the accuracy of the map and its dissemination. Based on these considerations, we recommend that the responsibility of the data administrator be defined for Columbia University. The data administrator will set policies and plans for the definition, organization, protection, and efficient utilization of University-wide data. The data administrator will function at a high level, since he or she must have a corporate-wide view of the data. Not only must this person be in a position to determine the logical view of the corporate data, the data administrator must also be in a position to arbitrate between different functional areas of the University whenever conflicts over ownership or interpretation arise. Consequently, we recommend that the responsibility for the data administrator be lodged in the Provost's office of the University. The tasks of the data administrator include the following: - To determine the scope of data to be contained in the database (i.e., administrative, Health Sciences, academic, etc.). This task generally is associated with strategic planning and should be carried out prior to any full-scale systems implementation. - To create entity-relationship data model of organization's data, ideally using an automated modelling tool such as Excelerator or Bachman DBA. The model must be verified against user's view of the business and modified when business changes. E-R model supports creation of logical schema within DBMS. - To define data elements and their synonyms, preferably using automated data dictionary or repository (Excelerator, DB Excel, etc.). Data elements must be reviewed to ensure that no redundant definitions have been entered. Standards must be established to ensure that element names follow certain conventions (e.g., accounts must be prefixed with acct, dates with dte, etc.) - To develop standard reports from dictionary or repository showing relationships between various entities (programs, data elements, records, etc.) Reports should show crossreferencing between various entities in order to reveal impact of proposed database modifications. Reports can be available on-line through QMF, SPUFI or other facilities and through batch with imbedded SQL statements. - To coordinate and plan for compatibility between DBMS's and

existing data structures. Sets plans for conversion of existing data structures to DBMS. MAINTAINING THE DATA MODEL The data model is an evolving one. As new initiatives are undertaken, AIS must examine new data requirements and make sure that there is congruency with the model. If not, decisions must be made to determine if the data entities are required for University, departmental, or personal needs. If the entity is required for the University needs, then the model must reflect this and appropriate changes must be made to the systems. During the requirements phase of implementation, departmental and personal data needs will be examined. During this phase of the project, the DA will be responsible for making decisions about whether or not to include these specific data needs into the model will be made . DATABASE ADMINISTRATION In contrast to the data administrator, who is concerned with the strategic direction of how data is defined and utilized, the database administrator is more concerned with the day-today tactical implementation and maintenance of physical databases. The database administrator must work closely with the data administrator to ensure that the high-level logical model of data is accurately reflected in the schemas created by the specific database management system. By implication, the database administrator must have technical mastery of the DBMS being used, whether it is DB2, Ingres, or some other package. The DBA is also more primarily concerned with the operational aspects of the database. Both the users and the applications developers must have access to databases on a continuous basis if mission-critical activities are to be successful. When an organization relies on project-oriented flat-file systems, it may be possible to allow a certain amount of latitude. In a full-scale database environment such as the one we are projecting, there is little room for latitude. Correct tuning parameters for memory or DASD allocations can make the difference between 3 and 30 second response times. The tasks of the database administrator include the following: - To be responsible for database design. The database administrator analyzes the entity-relationship diagram and constructs a schema for the DBMS. The schema takes one-tomany, many-to-many relationships into account; it defines keys, both primary, secondary, and foreign. It defines record/record element relationships in detail. The resulting logical structure should be reviewed with the application team. - To be responsible for physical design and access methods. (DB2, VSAM, etc.). This includes specification of database size (number of tracks, blocks, etc.), amount of free space, indexes, data clustering, data compression, controlled redundancy, distribution of files across volumes, etc.

- To assist application staff in using the database (SQL syntax, etc.). Ensures that they make the most efficient use of database resources (record locking vs. table locking, etc.). Controls access to database to prevent excessive timeconsuming search operations through QMF, SPUFI or other uncontrolled on-line operations. - To establish restart and recovery procedures. This includes timing and scope of periodic backups, methods of partial recovery; cold-start vs. warm-start procedures, etc.. Task must include plan for recovery of data in the event of catastrophic destruction of major portions of database and backup media. - To monitor database performance through on-line and batch tools. Recommends database or application modifications to enhance performance. The database administrator should primarily be familiar with tools that concentrate on DBMS statistics (e.g., average I/O per execution of transaction), but should have some knowledge of MVS and CICS performance monitors. This facilitates a holistic view of system performance and makes enhancement tuning more effective. - To monitor space utilization. Ensures that database is expanded in timely fashion so that applications can operate without interruption. Must have expertise with various methods of expanding available space (reorganizing database, increasing size of blocks, etc.) - To determine how database is to be distributed if need be. Determines how deadlocking and integrity problems can be avoided within distribution architecture. This responsibility would entail some expertise with non-mainframe products and technologies such as LAN's, client-server DBMS's, micro DBMS's, etc. DATA OWNER AND DATA CUSTODIAN In a distributed processing environment, data will be maintained on various platforms and in various technologies. Whether maintained in a main-frame DBMS or on a local platform, however, responsibility for the data must be clearly defined. Responsibility for the data in University administrative systems should reside with the appropriate administrative division of the University, not with AIS. We recommend that the Data Owner and Data Custodian functions described below be clearly defined and assigned within the client community. Both of these functions, in fact, already exist and have been in practice within the University for some time, although not specifically defined or spelled out. AIS has always operated on the assumption that each functional user area owns the data for applications for that area. Individuals within client areas (e.g., Don Burd for Student Systems, Nick Goudoras for Financial Accounting Systems, etc.) have served as Data Custodians. We recommend that these responsibilities be formally assigned to each functional area. DATA OWNER FUNCTION

All administrative information is a University corporate resource and as such is owned by the University. Data (the representation of that information) should be owned at the highest appropriate administrative level. Assuming that the data model proposed is implemented (to divide data into three categories: University-wide, departmental, and personal), ownership of the last two categories is self-defining. Personal data will be owned by the individual creating and maintaining such data. Departmental data will be owned by the director, department chairman, or other University officer responsible for the department under which the data is created and maintained. For University-wide data, ownership resides with the senior University officer responsible for the functional area which that data primarily serves. - Financial data is owned by the Controller of the University - Alumni data is owned by the Vice President for University Development and Alumni Relations - Student data is owned by the Director of Student Financial and Information Services - Human Resources data is owned by the Vice President for Personnel Services (Note: The University Data Administrator would be responsible for arbitrating any disagreements about who "owns" specific data or data elements.) "Data" in this context includes both the data itself (e.g., name and address information for students) and application programs, data dictionaries, etc., which are created and used to maintain such data. Operating systems software is owned by the Data Center (that is, by the DVP for Administrative Information Services) but application software is not. Specific responsibilities of the Data Owner: - To approve data elements included in the application and their classification. - To sign-off on access and security policies for the data. - To approve general-use policy for the data (including crossapplication access). - To be responsible for identification and enforcement of statutory and other external controls on use and maintenance of the data. The Data Owner's responsibilities are at the level of approving and/or determining general policies. For example, the Controller of the University would not be expected to identify and define every data element to be included in financial systems, but would be responsible for reviewing any proposed scheme of data elements to ensure its completeness and for

approving it. DATA CUSTODIAN FUNCTION The Data Custodian is essential to the accurate and timely maintenance of University data. Responsibility for custodianship of specific data should be delegated by the owner of the data. The data owner is ultimately responsible for setting or approving policies for the definition and classification of data elements, for authorization for use of the data, and for authorization for access to the data. The data custodian is responsible for the implementation and administration of those policies to ensure validity, consistency and accuracy of data. Specific responsibilities of the Data Custodian: - To assign data classification to specific data elements. - To authorize cross-application use of data. - To participate in establishment of general policies on access to the data for creation/modification/retrieval. - To participate in design of access security profiles. - To approve specific access requests for individuals or functional groups of individuals. - To determine policies for retention, deletion and archiving of data. The Data Custodian will be the primary liaison among the Data Owner and the Database Administrator, the Security Officer, and other Data Center staff supporting the application and tasked with the day-to-day maintenance and administration of the database. Many of the above policies and the procedures to enforce them will be developed in cooperation with the Database Administrator, the Information Security Officer, and other Data Center staff. The Data Custodian will provide client-based knowledge of business requirements for the data, and Data Center staff will provide technical knowledge of how to best meet those business requirements. TOOLS ARE AN IMPORTANT FACTOR FOR SUCCESS As part of the original charge to the task force, data management tools were to be examined and recommended. The task force agrees that automated tools are an important ingredient to the success of data administration. Our data models, even at this early stage, are just too complex to design and administer by hand. Some members of the task force have spent considerable time examining tools. It became obvious, however, that it would be impossible to arrive at specific recommendations within the timeframe that the Task Force members had to complete their tasks. Therefore, the Task Force has two recommendations: 1) The data model currently being developed should be placed

into Excelerator. AMS is currently using this product for their SIS and CUFS products. 2) One of the first tasks for the Data Administrator and the Database Administrator should be to examine their requirements carefully and to decide on the exact toolset that will be required to manage the University's data. As a starting point for the second recommendation, the Task Force has developed guidelines and a matrix for selecting data management tools (see appendix). REQUIREMENTS TO INSTALL DB2 V2.2. For each DB2 subsystem, IBM recommends: 9M + (1.1M * tmax) of real memory, where tmax corresponds to the maximum allowable number of CICS transactions per second. Tmax should be construed as the anticipated number of transactions per second. This storage is for the DBMS nucleus and associated data structures. 4-6M of real storage for buffer pools, (they recommend that BP0 be used exclusively). Additional real storage for the EDM pool, (where plans go), if possible. The default amount of real storage is 15,575K for a test DB2 subsystem. IBM estimates 43M of real storage for a production system. The estimates given above are for a "medium-sized site". A "small" site has 100 application plans, 50 application databases, and 500 tables. Note that a CICS transaction is associated with exactly one (large) plan. A database in the sense of DB2 is a collection of tables related in an application. In the near term, AIS would have perhaps 1 application database (for AMS CUFS), as many (large) plans as there are CICS transactions in CUFS, and perhaps 50 tables. So the estimate of required space is generous for the establishment of a test DB2 Subsystem. However, it is clear that a production DB2 subsystem will require more real storage than is currently available in the ES/9120. Additional memory will require doubling the installed real storage from 256M to 512M. 700 cylinders of 3380 DASD for DB2 libraries, internal work files (the DSNDB07 database where result tables are materialized), as well as 2 logs. The DASD should be distributed across 2 actuators for a test DB2 subsystem; 4-6 actuators for production. In addition, cache control for the existing control unit is considered a requirement. Application databases should reside on other DASD. IBM recommends that libraries not be shared across subsystems; DB2 is considered a fairly volatile piece of software, and a test and production subsystem might be at different maintenance levels, or even different releases. Should there be multiple production subsystems, (say, an operational OLTP system and a reporting system running at the same level of maintenance), DB2

could share the same libraries. We do not know at this time to what extent DB2 is itself reentrant, so that real storage could be shared across subsystems. The requirement for dual logging in a DB2 development subsystem seems excessive. However, if there is to be only one DB2 subsystem, then dual logging and all recovery procedures should be fully implemented, as part of the development of an operational system. DB2 V2.2 (the current release) and V2.3 (which is scheduled to be fully supported on March 27, 1992), both require CICS 2.1.1 (or higher). They are incompatible with CICS 1.7, which we are currently running. It is not anticipated that existing CICS 1.7 production applications at AIS will be migrated to CICS 2.1.1 in the near term. We will be running multiple versions of CICS, which requires approximately 5M of additional real storage. In order to access DB2, the TSO, batch, and CICS attachments must be installed. The batch interface can be invoked by submitting a Pl/1 program provided with the product, with dynamic SQL in stream. Should a more convenient batch interface be required, it would have to be constructed on the basis of this utility. The AIS PL/I compiler and macro preprocessor may require maintenance. TSO access needs to be provided at least to database administrators, and possibly to developers. TSO security profiles need to be created (in TSS) to restrict staff involved with DB2 to just those TSO resources necessary to connect to DB2 (and any associated software tools). This is just the beginning of the security issue. SECURITY ISSUES The TSS/DB2 interface is now in Beta test. It requires an upgrade to TSS/MVS 4.3 that is now in ESP. The latter is only compatible with version 2.2 of TSS/VM that is in Beta. Assuming that AIS will not acquire these products until they become generally available, DB2 will have to maintained by way of its own internal security (Grant/Revoke). This should be a function of central database administration, authorized by the central security office. When the TSS/DB2 interface (and prerequisite products) are generally available, the DB2 user authorization exit for TSS/DB2 will need to be defined, and TSS tables populated from the DB2 catalog. The administration of DB2 security should still be carried out by central database administration: effective administration of external DB2 security requires some understanding and manipulation of the DB2 environment. Authority to create DB2 objects needs to be set up, and a local database administrator identified for each application. CUFS IMPLEMENTATION Assuming the University purchases the AMS CUFS DB2 application package, AMS should be consulted as to the appropriate settings for DB2 install-time parameters (Z parms), as well as operational issues and scripts for the creation and recreation

of CUFS DB2 objects. This needn't wait for DB2 to be installed. AIS may wish to change these settings in due course, but a good place to start making DB2 operational is to understand the AMS parameter settings, and their rationale. We are told by AMS that implementation of CUFS at Columbia will not require the development of additional SQL. All of the SQL that is required comes with the DB2 flavor of a Core Foundation Software application package, embedded in external COBOL subroutines. The embedding programs need to be preprocessed and the resultant Data Base Request Module needs to be bound exactly once. There is no need for dynamic SQL access to DB2, and so the requirement for TSO connectivity is exactly as it is now, with the exception of TSO authorization for a database administrator local to the application system. Thereafter, access to DB2 by CUFS developers is by way of plan execution in the context of CICS transaction processing. CUFS developers do not require dynamic SQL access to DB2. There are at least two exceptions to this scenario: First, any bridgeback conversion system, in which changes to data in the new system must be applied back to the old, would require the use of SQL. In that case developers would require dynamic SQL and TSO connectivity. Secondly, a new, native SQL development project, such as a Decision Support System (DSS) would also require that the programmer workbench connect to the DB2 development environment. UPGRADING TO VERSION 2.3 Assuming that DB2 V2.2 is to be installed this Spring, AIS should set aside time during the Summer for the migration to DB2 V2.3. The new version is easier to administer, due to the package bind feature. It is also considerably more difficult to understand, thanks to optimizer enhancements. Without package binds, any change to the SQL of a CICS/DB2 application in development requires that all of the SQL associated with a CICS transaction be rebound. Any rational version control of CICS/DB2 development requires package binds. AIS should migrate to DB2 V2.3 and require that CUFS development at Columbia use package binds. INSTALLATION SCHEDULE The amount of time required to install DB2, as estimated by IBM, is as follows: 2 days to plan installation parameters. 1 to 5 days to install via IPO tape (customized by IBM). After the product is installed in the MVS/ESA test system (which at the time of this writing does not exist), the DB2 subsystem needs to be migrated to the live MVS/ESA system. This involves "flipping" the system residence packs, which takes 1 to 2 days, and the scheduling of systems time to verify. An IPL of the live MVS/ESA system is required. With the change control process, DB2 can be migrated from MVS/ESA test to the live system in about one week.

IBM estimates for setting up a DB2 environment include the following: Establish Secure Environment Operations and Recovery: Startup/Shutdown procedures Backup/Recovery Strategy Monitoring/Control procedures Recovery Administrative procedures System Backup/Recovery procedures Application Backup/Recovery procedures 3 days 2-3 days 2-3 weeks 1-2 weeks 1 week 4 weeks 2-3 weeks

Installation time training: DB2 System Programming Workshop (D2SP) Vendor: Amdahl Class location: Columbia, MD. Time: 4 days: Feb 11, 1992; Mar 10, 1992 Tuition: $1260. Audience: Systems personnel assigned to install product; database administrative personnel assigned to assist in (know about) initial configuration. - installation and management of DB2 as an MVS subsystem. - connecting to DB2 via TSO, batch. - CICS Call Attach Facility - establish DBA access to DB2 presumably using TSO. - establish procedures for operation and recovery. - students perform an installation. Issues to resolve: determination of division of responsibility between systems, dba staff (e.g., who has the power to determine MVS console activity as related to DB2 Subsystem?). Conformance of AIS DB2 environment with requirements for CUFS; Establishment of number of DB2 subsystems considering real storage may be unavailable. DB2 AND ASSOCIATED RESOURCE COSTS The cost for DB2 itself is as follows, including 15% educational institution discount: DB2 V2.2 Monthly Licence Group 40 processor DB2 Performance Monitor - Online Monitor Monthly Licence Group 38 processor DB2 Performance Batch Monitor Monthly Licence Group 38 processor THE COST OF ASSOCIATED HARDWARE: 256M memory DB2 DATABASE ADMINISTRATION TOOLS. There are a number of contending products to assist in DBA tasks. Some give improved performance over utilities that come with DB2, and we should not acquire these products until we have evaluated the DB2 utilities in V2.3 and determined that we need the enhanced performance of the utilities in question. $3390.00 $3506.00 $368.05 $952.00

DB2PM uses the same statistics that other monitors use; unlike other monitors, DB2PM can be leased. We can use DB2PM until we find out why we need something better. Other DBA tools assist in crucial tasks for which there is no utility provided with DB2. The core example is dropping and recreating an object: For example, say a table has a clustering index (so that the data in the table is in the same order as the index), and after a number of insertions the table goes out of cluster. Existing application plans will continue to use the index, although not as efficiently. New plans, including any subsequent rebinds might not use the index at all. Now the table needs to be reorganized. To reorganize the table, it must be unloaded, dropped, recreated, and reloaded. When it is dropped, all entries in SYSIBM.SYSCOLUMNS for the table's columns are dropped; all views based (at all) on the table are dropped, as are the corresponding columns; all views based on those views are dropped, etc. All catalog record of permissions granted with respect to any of these objects are dropped. Any plan that refers to any of these objects is invalidated (but not dropped). To recreate this state of affairs requires constructing the Data Definition Language necessary to recreate the objects, including the Data Manipulation Language clauses included in the view definitions, and the Data Control Language to restore the grants. Similarly, If an authid is removed from the system, then any privileges granted by that authid, (possibly with grant option), are removed, which may not be what is wanted. It is not difficult to write a program to record these dependencies. Different products (like Platinum's RC/Update) will generate the necessary code. It is possible to do without a product, but not without some utility to perform this common task. Again, consultation with AMS as to their methodology for reorganization of DB2 tables should prove instructive. These products are bundles of functions, some of which we will not need. For example, generating DCL to recover from a cascading REVOKE statement just described is not a problem if we use external TSS/DB2 security. Yet we shall have to pay for this function if we purchase RC/Update, or something like it. The other side of this story is that the products are not extendible to include user written functions, or other products from other vendors. RC/Update lists for $43,125. One criteria for the acquisition of database administration tools is that a single functional areas in AIS should not have too many different products. In this sense, the Candle OMEGAMON for DB2 monitor ($23,000) has the inside track in the systems area, and we could consider acquiring the suite of Candle products ($90,000, list). Some tools can be used by systems programmers, central database administrators, local database administrators and application developers. In that case, it is necessary that the utilities be

securable so that users are restricted to the correct scope of application: a local database administrator should only have access to her own database. Notice that if we use external (TSS/DB2) security, the DB2 catalog may lack the information these tools need to distinguish among users. In the area of application development, a debugging tool, something that can explain the DB2 EXPLAIN tables to the developer, may serve as an adequate mode of online access to dynamic SQL. There is a remarkable difference in price among tools that advertise similar functionality. Before purchasing a product, we should show that we need the functionality provided. Besides, none of this stuff is magic and we could always write something. DB2 DATABASE ADMINISTRATION TRAINING. DB2 Database Administration Workshop (U4066). Vendor: IBM Location: NYC Time: 4 1/2 days, Mar 2, 1992; May 4, 1992. Tuition: $1620. - Audience: all central database administrative staff. - if there is a particular application in question, then the dba local to that application should also receive this training. - creation of DB2 objects: databases and other data objects, plan packages, authid's. - internal DB2 security. - issues to resolve: custodianship of data and plan packages; relationship of dba to Security Office. - determination of division of responsibility among central and local dba. - relation between data administrative function (including data modeling) and database. - administration (implementation of a model in a specific software environment). - Note that the March 2 class is in conflict with the CICS/DB2 course, below. All IBM DB2 classes have now been upgraded to include DB2 V2.3. TRAINING IN DB2 APPLICATION PROGRAMMING AND SUPPORT SQL Application Programming for DB2 Vendor: Platinimum Location: Shearson Lehman /390 Greenwich/NYC Time: February 24-28, 1992 (5 days) Tuition: $1250; $950 if > 2 students. Audience: dba local to application. For other SQL development, we may want to bring training in house. CICS/DB2 Vendor: Platinimum Location: Shearson Lehman /390 Greenwich/NYC Time: March 2 -6, 1992 (5 days) Tuition: $1250; $950 if >2 students. Audience: systems & dba.

- CICS will be the preferred access method for all DB2 attached users. - Database Administrative personnel must understand the proper design of CICS transactions for use with DB2. Remote access to DB2 may use CICS threads. Sybase accesses the DB2 "server" in this way. - Note that Platinimum will not upgrade their courses for V2.3 until June, at the earliest. COST TO INSTALL DB2 IN PRODUCTION ENVIRONMENT (ESTIMATES) ES 9121 Enhancements DB2 associated hardware costs: 256M memory for ES/9121 less 15% educational discount co-terminous lease: 42 months left on ES/9121 lease @ 10% interest: 3990-002 control unit w/o cache less 26% discount on state contract upgrade of 3990-002 to 3990-G03 with 32M cache less 15% educational discount One of the following 3390 DASD is needed: 3390 A18 less 26% discount on state contract -or3390 A28 -or3390 A28 SOFTWARE - DB2 $3,506 monthly - Online monitor $368 monthly - Batch monitor $952 monthly $4,682 $57,912 annual TOOLS - OMEGAMON for DB2 or $23,000 - Candle DB2 products (list price, 1x charge, $90,000 includes OMEGAMON) TRAINING Installation Training 2 @ $1,260 DBA Training 3 @ $1,620 SQL Application Programming for DB2 3 @ $ 950 CICS/DB2 3 @ $ 950 $2,520 $4,860 $2,850 $2,850 $13,080 $537,600 $456,960 $11,969/month $ 77,700 $ 57,498 $126,200 $102,270 $156,750 $115,954* $154,660 $185,999

GUIDELINES FOR SELECTION OF DATA/DATABASE ADMINISTRATION TOOLS: "Reverse engineering" capabilities: * Must provide a capability to reverse engineer existing DB2,

VSAM data structures into a conceptual model. * Must be able to use a conceptual model in order to forward engineer into relational and physical DB2 designs. * Does the product support development of relational database model based on existing VSAM or non-relational technology (Model 204, FOCUS, etc.)? (While the underlying assumption is that databases will be designed from a "forward engineering" perspective--building a database based on newly-defined requirements--there may be some use in having a reverse engineering capability in reserve.) * Can product scan COBOL copybooks or copylibs and generate DB2 table definitions? GRAPHICAL USER INTERFACE FEATURES: * Does product support verification of the structure of diagrams? * Will it support decomposition of diagram into subordinate views? * Does it include a palette of tools for drawing diagrams? * Does product include object-oriented, pop-up menus that are related to the respective modelling objects in a diagram? For example, will double-clicking a relationship object pop up a menu that allows modification of mandatory/optional properties, etc.? * Do object-oriented pop-up menus allow entry of text description? This is critical for the population of repository with significant descriptions of database entities. * Are multiple windows allowed? Can a user view different aspects of the model simultaneously? RELATIONAL DESIGN TOOLS: * Can primary and foreign keys be identified separately? Potential primary key identification: check unique clustering index, check unique index that corresponds to a foreign key, and keys that are subsume unique keys. Potential foreign key identification: identify candidates based on primary key matches to exact name match, utilizing non-identical name matching algorithms to find additional primary key match-ups, utilizing number-of-distinct-values test to further qualify primary key potential candidates. * Does the product define referential integrity constraints? Does it identify: * Tables without primary keys * Tables that are delete connected to themselves * Tables that are delete connected through multiple paths REFERENCES

Arnold, Dean, et. al. SQL Access: An implementation of the ISO Remote Database Access standard. Computer: 74-78, December, 1991. Balboni, Jeff SQL Access: A Cure for the Nonstandard Standard Data Communications 20(3):89, March, 1991 Date, C.J A Guide to the SQL Standard Addison-Wesley, Reading, Mass., 1989 Newman, Scott and Jim Gray Which Way to Remote SQL? Database Programming and Design :46-54, December, 1991 Gartner Group, 1991 Symposium on Information Technology. There are three: Data Manager, Relational Data System and IMS Resource Lock Manager.