Sie sind auf Seite 1von 15

1.

Explain the following normal forms with a suitable example demonstrating the reduction of a sample table into the said normal forms: A) First Normal Form B) Second Normal Form C) Third Normal Form

Ans: 1NF A relation R is in first normal form (1NF) if and only if all underlying domains contain atomic values only Example: 1NF but not 2NF FIRST (supplier_no, status, city, part_no, quantity) Functional Dependencies: (supplier_no, part_no) quantity (supplier_no) status (supplier_no) city city status (Supplier's status is determined by location) Comments: Non-key attributes are not mutually independent (city status). Non-key attributes are not fully functionally dependent on the primary key (i.e., status and city are dependent on just part of the key, namely supplier_no). Anomalies: INSERT: We cannot enter the fact that a given supplier is located in a given city until that supplier supplies at least one part (otherwise, we would have to enter a null value for a column participating in the primary key C a violation of the definition of a relation). DELETE: If we delete the last (only) row for a given supplier, we lose the information that the supplier is located in a particular city. UPDATE: The city value appears many times for the same supplier. This can lead to inconsistency or the need to change many values of city if a supplier moves. Decomposition (into 2NF): SECOND (supplier_no, status, city) SUPPLIER_PART (supplier_no, part_no, quantity) 2NF A relation R is in second normal form (2NF) if and only if it is in 1NF and every nonkey attribute is fully dependent on the primary key Example (2NF but not 3NF): SECOND (supplier_no, status, city) Functional Dependencies: supplier_no status

supplier_no city city status Comments: Lacks mutual independence among non-key attributes. Mutual dependence is reflected in the transitive dependencies: supplier_no city, city status. Anomalies: INSERT: We cannot record that a particular city has a particular status until we have a supplier in that city. DELETE: If we delete a supplier which happens to be the last row for a given city value, we lose the fact that the city has the given status. UPDATE: The status for a given city occurs many times, therefore leading to multiple updates and possible loss of consistency. Decomposition (into 3NF): SUPPLIER_CITY (supplier_no, city) CITY_STATUS (city, status) 3NF A relation R is in third normal form (3NF) if and only if it is in 2NF and every nonkey attribute is non-transitively dependent on the primary key. An attribute C is transitively dependent on attribute A if there exists an attribute B such that: A B and B C. Note that 3NF is concerned with transitive dependencies which do not involve candidate keys. A 3NF relation with more than one candidate key will clearly have transitive dependencies of the form: primary_key other_candidate_key any_nonkey_column An alternative (and equivalent) definition for relations with just one candidate key is: A relation R having just one candidate key is in third normal form (3NF) if and only if the non-key attributes of R (if any) are: 1) mutually independent, and 2) fully dependent on the primary key of R. A non-key attribute is any column which is not part of the primary key. Two or more attributes are mutually independent if none of the attributes is functionally dependent on any of the others. Attribute Y is fully functionally dependent on attribute X if X Y, but Y is not functionally dependent on any proper subset of the (possibly composite) attribute X For relations with just one candidate key, this is equivalent to the simpler: A relation R having just one candidate key is in third normal form (3NF) if and only if no non-key column (or group of columns) determines another non-key column (or group of columns) Example (3NF but not BCNF):

SUPPLIER_PART (supplier_no, supplier_name, part_no, quantity) Functional Dependencies: We assume that supplier_name's are always unique to each supplier. Thus we have two candidate keys: (supplier_no, part_no) and (supplier_name, part_no) Thus we have the following dependencies: (supplier_no, part_no) quantity (supplier_no, part_no) supplier_name (supplier_name, part_no) quantity (supplier_name, part_no) supplier_no supplier_name supplier_no supplier_no supplier_name Comments: Although supplier_name supplier_no (and vice versa), supplier_no is not a non-key column it is part of the primary key! Hence this relation technically satisfies the definition(s) of 3NF (and likewise 2NF, again because supplier_no is not a non-key column). Anomalies: INSERT: We cannot record the name of a supplier until that supplier supplies at least one part. DELETE: If a supplier temporarily stops supplying and we delete the last row for that supplier, we lose the supplier's name. UPDATE: If a supplier changes name, that change will have to be made to multiple rows (wasting resources and risking loss of consistency).

2. Explain the concept of a Query? How a Query Optimizer works. Ans: Queries are essentially powerful Filters. Queries allow you to decide what fields or expressions are to be shown and what information to be sought. Queries are usually based on Tables but can also be based on an existing Query. Queries allow you seek from very basic information through to much more complicated specifications. They also allow you to list information in a particular order, such as listing all the resulting records in Surname order for example. Queries can select records that fit certain criteria. If you had a list of people and had a gender field you could use a query to select just the males or females in the database. The gender field would have a criteria set as "male" which means that when the query is run only records with "male" in the Gender field would be listed. For each record that

meets the criteria the you could choose to list other fields that may be in the table like first name, surname, phone number, date of birth or whatever you may have in the database. Queries can do much more than just listing out records. It is also possible to list out totals, averages etc. from the data and do various other calculations. Queries can also be used to do other tasks, such as deleting records, updating records, adding new records, creating new tables and creating tabulated reports The query optimizer is the component of a database management system that attempts to determine the most efficient way to execute a query. The optimizer considers the possible query plans for a given input query, and attempts to determine which of those plans will be the most efficient. Cost-based query optimizers assign an estimated "cost" to each possible query plan, and choose the plan with the smallest cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, the CPU requirements, and other factors determined from the data dictionary. The set of query plans examined is formed by examining the possible access paths (e.g. index scan, sequential scan) and join algorithms (e.g. sortmerge join, hash join, nested loop join). The search space can become quite large depending on the complexity of the SQL query. Generally, the query optimizer cannot be accessed directly by users: once queries are submitted to database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs. However, some database engines allow guiding the query optimizer with hints. Most query optimizers represent query plans as a tree of "plan nodes". A plan node encapsulates a single operation that is required to execute the query. The nodes are arranged as a tree, in which intermediate results flow from the bottom of the tree to the top. Each node has zero or more child nodesthose are nodes whose output is fed as input to the parent node. For example, a join node will have two child nodes, which represent the two join operands, whereas a sort node would have a single child node (the input to be sorted). The leaves of the tree are nodes which produce results by scanning the disk, for example by performing an index scan or a sequential scan Join ordering The performance of a query plan is determined largely by the order in which the tables are joined. For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and 1,000,000 rows, respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to execute than one that joins A and C first. Most query optimizers determine join order via a dynamic programming algorithm pioneered by IBM's System R database project. This algorithm works in two stages:
1. First, all ways to access each relation in the query are computed. Every relation

in the query can be accessed via a sequential scan. If there is an index on a

relation that can be used to answer a predicate in the query, an index scan can also be used. For each relation, the optimizer records the cheapest way to scan the relation, as well as the cheapest way to scan the relation that produces records in a particular sorted order. 2. The optimizer then considers combining each pair of relations for which a join condition exists. For each pair, the optimizer will consider the available join algorithms implemented by the DBMS. It will preserve the cheapest way to join each pair of relations, in addition to the cheapest way to join each pair of relations that produces its output according to a particular sort order. 3. Then all three-relation query plans are computed, by joining each two-relation plan produced by the previous phase with the remaining relations in the query. In this manner, a query plan is eventually produced that joins all the queries in the relation. Note that the algorithm keeps track of the sort order of the result set produced by a query plan, also called an interesting order. During dynamic programming, one query plan is considered to beat another query plan that produces the same result, only if they produce the same sort order. This is done for two reasons. First, a particular sort order can avoid a redundant sort operation later on in processing the query. Second, a particular sort order can speed up a subsequent join because it clusters the data in a particular way. Historically, System-R derived query optimizers would often only consider left-deep query plans, which first join two base tables together, then join the intermediate result with another base table, and so on. This heuristic reduces the number of plans that need to be considered (n! instead of 4^n), but may result in not considering the optimal query plan. This heuristic is drawn from the observation that join algorithms such as nested loops only require a single tuple (aka row) of the outer relation at a time. Therefore, a left-deep query plan means that fewer tuples need to be held in memory at any time: the outer relation's join plan need only be executed until a single tuple is produced, and then the inner base relation can be scanned (this technique is called "pipelining"). Subsequent query optimizers have expanded this plan space to consider "bushy" query plans, where both operands to a join operator could be intermediate results from other joins. Such bushy plans are especially important in parallel computers because they allow different portions of the plan to be evaluated independently.

Q.3. Explain the following with respect to Heuristics of Query Optimizations: A) Equivalence of Expressions B) Selection Operation C) Projection Operation D) Natural Join Operation Ans. Equivalent expressions We often want to replace a complicated expression with a simpler one that means the same thing. For example, the expression x + 4 + 2 obviously means the same thing as x + 6, since 4 + 2 = 6. More interestingly, the expression x + x + 4 means the same thing as 2x + 4, because 2x is x + x when you think of multiplication as repeated addition. (Which of these is simpler depends on your point of view, but usually 2x + 4 is more convenient in Algebra.) Two algebraic expressions are equivalent if they always lead to the same result when you evaluate them, no matter what values you substitute for the variables. For example, if you substitute x := 3 in x + x + 4, then you get 3 + 3 + 4, which works out to 10; and if you substitute it in 2x + 4, then you get 2(3) + 4, which also works out to 10. There's nothing special about 3 here; the same thing would happen no matter what value we used, so x + x + 4 is equivalent to 2x + 4. (That's really what I meant when I said that they mean the same thing.) When I say that you get the same result, this includes the possibility that the result is undefined. For example, 1/x + 1/x is equivalent to 2/x; even when you substitute x := 0, they both come out the same (in this case, undefined). In contrast, x2/x is not equivalent to x; they usually come out the same, but they are different when x := 0. (Then x2/x is undefined, but x is 0.) To deal with this situation, there is a sort of trick you can play, forcing the second expression to be undefined in certain cases. Just add the words for x 0 at the end of the expression to make a new expression; then the new expression is undefined unless x 0. (You can put any other condition you like in place of x 0, whatever is appropriate in a given situation.) So x2/x is equivalent to x for x 0. To symbolise equivalent expressions, people often simply use an equals sign. For example, they might say x + x + 4 = 2x + 4. The idea is that this is a statement that is always true, no matter what x is. However, it isn't really correct to write 1/x + 1/x = 2/x to indicate an equivalence of expressions, because this statement is not correct when x := 0. So instead, I will use the symbol , which you can read is equivalent to (instead of is equal to for =). So I'll say, for example,

x + x + 4 2x + 4, 1/x + 1/x 2/x, and x2/x x for x 0.

The textbook, however, just uses = for everything, so you can too, Selection Operation 1. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is
2. 3.

(CUSTOMER
o o o

DEPOSIT

BRANCH))

4. This expression constructs a huge relation, CUSTOMER DEPOSIT BRANCH of which we are only interested in a few tuples.
o o o o o o o

We also are only interested in two attributes of this relation. We can see that we only want tuples for which CCITY = ``PORT CHESTER''. Thus we can rewrite our query as: DEPOSIT BRANCH)

This should considerably reduce the size of the intermediate relation.

Projection Operation 1. Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query:

2. When we compute the subexpression

3.

we obtain a relation whose scheme is (CNAME, CCITY, BNAME, ACCOUNT#, BALANCE) 4. We can eliminate several attributes from this scheme. The only ones we need to retain are those that o appear in the result of the query or o are needed to process subsequent operations. 5. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. 6. In our example, the only attribute we need is BNAME (to join with BRANCH). So we can rewrite our expression as:
7. 8. 9.

10. Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: o We would access every block for the relation to remove attributes. o Then we access every block of the reduced-size relation when it is actually needed. o We do more work in total, rather than less! Natural Join Operation 1. Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations. 2. Natural join is associative:
3.

4. Although these expressions are equivalent, the costs of computing them may differ. o Look again at our expression
o o o o o o

we see that we can compute DEPOSIT BRANCH first and then join with the first part. However, DEPOSIT BRANCH is likely to be a large relation as it contains one tuple for every account. The other part, is probably a small relation (comparatively).

o o

So, if we compute

first, we get a reasonably small relation. It has one tuple for each account held by a resident of Port Chester. This temporary relation is much smaller than DEPOSIT BRANCH. 5. Natural join is commutative:
o o 6. o o o o o o o

Thus we could rewrite our relational algebra expression as:

But there are no common attributes between CUSTOMER and BRANCH, so this is a Cartesian product. Lots of tuples! If a user entered this expression, we would want to use the associatively and commutativity of natural join to transform this into the more efficient expression we have derived earlier (join with DEPOSIT first, then with BRANCH).

Q 4. There are a number of historical, organizational, and technological reasons explain the lack of an all-encompassing data management system. Discuss few of them with appropriate examples. Ans. Most current data management systems, DMS, have been built on the assumption that the data collection, or database, to be administered consists of a single media type - structured tables of "fact" data or unstructured strings of bits representing such media objects as text documents, images, or video. The result is that most DMS' store and index a specific type of media data and provide a query (data access) language that is specialized for efficient access to and retrieval of this data type. A further assumption that has frequently been made is that the information requirements of the system users are known and can be used for structuring the data collection and tuning the data management system. It has also been assumed that the users would only infrequently require information/data from some other type of data management system.

These assumptions have been criticized since the early 1980s by researchers who have pointed out that almost from the point of creation, a database would not (nor could) contain all of the data required by the user community (Gligor & Luckenbaugh, 1984; Landers & Rosenberg, 1982; Litwin et al., 1982; among many others). A number of historical, organizational, and technological reasons explain the lack of an allencompassing data management system. Among these are:

The sensible advice - to build small systems with the plan to extend their scope in later implementation phases - allows a core system to be implemented relatively quickly, but has lead to a proliferation of relatively small systems. Department autonomy has led to construction of department specific rather than organization wide systems, again leading to many small, overlapping, and often incompatible systems within an organization. The continual evolution of the organization and its interactions both within and to its external environment prohibits complete understanding of future information requirements. Parallel development of data management systems for particular applications has lead to different and incompatible systems for management of tabular/administrative data, text/document data, historical/statistical data, spatial/geographic data, and streamed/audio and visual data.

The result is that only a portion of an organization's data is administered by any one data management system and most organizations have a multitude of special purpose databases, managed by different, and often incompatible, data management system types. The growing need to retrieve data from multiple databases within an organization, as well as the rapid dissemination of data through the Internet, has given rise to the requirement of providing integrated access to both internal and external data of multiple types. A major challenge and critical practical and research problem for the information, computer, and communication technology communities is to develop data management systems that can provide efficient access to the data stored in multiple private and public databases (Brodie, 1993; Hurson & Bright, 1996; Nordbotten, 1988a, 1988b and Nordbotten, 1994a).

Problems to be resolved include:


1. Interoperability among systems (Fox & Sornil, 1999; Liwtin, & Abdellatif, 1986), 2. Incorporation of legacy systems (Brodie, 1993) and 3. Integration of management techniques for structured and unstructured data

(Stonebraker & Brown, 1999).

Each of the above problems entails an integration of concepts, methods, techniques and tools from separate research and development communities that have existed in parallel but independently and have had rather minimal interaction. One consequence of which is that there exists an overlapping and conflicting terminology between these communities. In the previous chapter, a database was defined as a COLLECTION OF RELATED DATA REPRESENTING SOME LOGICALLY COHERENT ASPECT OF THE REAL WORLD. With this definition, NO limitations are given as to the type of:

Data in the collection, Model used to structure the collection, or Architecture and geographic location of the database

The focus of this text is on on-line - electronic and web accessible - databases containing multiple media data, thus restricting our interest/focus to multimedia databases stored on one or more computers (DB servers) and accessible from the Internet. Examples of such databases include the image collections of the Hermitage Museum, the catalog and full text materials of the ACM digital library, and the customer records for the 7 sites of Amazon.com Electronic databases are important since they contain data recording the products and services, as well as the economic history and current status of the owner organization. They are also a source of information for the organization's employees and customers/users. However, databases can not be used effectively unless there exist efficient and secure data management systems, DMS for the data in the databases. Q5. Describe the Structural Semantic Data Model (SSM) with relevant examples. Ans: Modelling Complex and Multimedia Data Data modelling addresses a need in information system analysis and design to develop a model of the information requirements as well as a set of viable database structure proposals. The data modelling process consists of:
1. Identifying and describing the information requirements for an information

system, 2. Specifying the data to be maintained by the data management system, and 3. Specifying the data structures to be used for data storage that best support the information requirements. A fundamental tool used in this process is the data model, which is used both for specification of the information requirements at the user level and for specification of the

data structure for the database. During implementation of a database, the data model guides construction of the schema or data catalog which contains the metadata that describe the DB structure and data semantics that are used to support database implementation and data retrieval. Data modelling, using a specific data model type, and as a unique activity during information system design, is commonly attributed to Charles Bachman (1969) who presented the Data Structure Diagram as one of the first, widely used data models for network database design. Several alternative data model types were proposed shortly thereafter, the best known of which are the:

Relational model (Codd, 1970) and the Entity-relationship, ER, model (Chen, 1976).

The relational model was quickly criticized for being 'flat' in the sense that all information is represented as a set of tables with atomic cell values. The definition of well-formed relational models requires that complex attribute types (hierarchic, composite, multivalued, and derived) be converted to atomic attributes and that relations be normalized. Inter-entity (inter-relation) relationships are difficult to visualize in the resulting set of relations, making control of the completeness and correctness of the model difficult. The relational model maps easily to the physical characteristics of electronic storage media, and as such, is a good tool for design of the physical database. The entity-relationship approach to modelling, proposed by Chen (1976), had two primary objectives: first to visualize inter-entity relationships and second to separate the DB design process into two phases: 1. Record, in an ER model, the entities and inter-entity relationships required "by the enterprise", i.e. by the owner/user of the information system or application. This phase and its resulting model should be independent of the DBMS tool that is to be used for realizing the DB. 2. Translate the ER model to the data model supported by the DBMS to be used for implementation. This two-phase design supports modification at the physical level without requiring changes to the enterprise or user view of the DB content. Also Chen's ER model quickly came under criticism, particularly for its lack of ability to model classification structures. In 1977, (Smith & Smith) presented a method for modelling generalization and aggregation hierarchies that underlie the many extended/enhanced entity-relationship, EER, model types proposed and in use today.

6. What are differences in Global and Local Transactions in distributed database system? What are the roles of Transaction Manager and Transaction Coordinator in managing transactions in distributed database? Ans: A distributed database system consists of a-collection of sites, each of which maintains a local databases system. Each site is able to process local transactions, those transactions that access data only in that single site. In addition, a site may participate in the execution of global transactions, those transactions that access data is several sites. The execution of global transactions requires communication among the sitcs. The sites in the system can be connected physically in a variety of ways. The various topologies are represented as graphs whose nodes correspond to sites. An edge from node A Lo node B corresponds to a direct connection between the two sites. Some of thc most common configurations are depicted in Figure 1. The major differences among these configurations involve: Installation cost. The cost of physically linking the sites in the system. Communication cost. The cost in time and money to send a message from site A to site B. Reliability. The frequency with which a link or site fails. Availability. The degree to which data can be accessed despite the failure of some links or sites. As we shall see, these differences play an important role in choosing the appropriate mechanism for handling the distribution of data. The sites of a distributed database system may be distributed physically either over a large geographical area (such as the all Indian states). or over a small geographical area such as a single building or a number of adjacent buildings). The former type of network is referred to as a long-haul network, while the latter is referred to as a localarea network. Since the sites in long-haul networks are distributed physically over a large geographical area, the communication links are likely to be relatively slow and less reliable as compared with local-are. networks. Qpical long-haul links are telephone lines, microwave links, and satellite channels. In contrasf since all the sites in local-area netwoks are close to each other, communication links are of higher speed and lower e m r rate than their counterparts in long-haul networks. The most common links are twisted pair, baseband coaxial, broadband coaxial, and fiber optics. Let us illustrate these concepts by considering a banking system consisting of four branches located in four different cities. Each branch has its own computer with a database consisting of all the accounts maintained at that branch. Each such installation is thus a site. There also exists one single site which maintains information about all the branches of the bank. Suppose that the database systems at the various sites are based on the relational model. Thus, each branch maintains (among others) the relation deposite (Dejmsit-scheme) where Deposite-scheme = (branch-name, account-number,

customer-name, balance) site containing information about the four branches maintains the relation branch (Branch-scheme), where Branch-scheme = (branch-name, assets, branch-city) There are other relations maintained at the various sites which are ignored for the prrrpose of our example. A local transaction is a transaction that accesses accounts in the one single site, at which the D b h l b u t d Datnbs s l uansaction was initiated. A global transaction, on the other hand is m e which either access accounts in a site different from the one at which the transaction was initiated, or accesses accounts in several different sites. To illustrate the difference between these two types of transactions, consider the transaction to add $ 5 0 to account number 177 located at the Dehi branch. If the transaction was initiated at the Delhi branch, then i t is considered local; otherwise, it is considered global. A transaction to uansfer $50 from account 177 to account 305, which is located at the Bombay branch, is a global transaction since accounts in two different sites are accessed as a result of its execution. What makes the above configuration a distributed database system are the facts that The various sites are aware of each other. Each site provides an environment for executing both local and global transactions. There are several reasons for building distributed database systems, including sharing of data, reliability and availability, and speedup of query processing. However, along with these advantages come several disadvantages, including software development cost, greater potential for bugs, and increased processing overhead. A distributed database system cons is of a collection of sites, each of which maintain a local database system. Each site is able to process local transaction, those transaction that access data only in that single site. In addition, a site may participate in the execution of global transactions those transactions that access data n several sites. The execution of global transactions requires communication among the sites. There are several reasons for building distributed database systems, including sharing of data, reliability and availability, and speed or query processing. However, along with those advantages come several disadvantages, including software development cost, greater potential for bugs, and increased processing overhead. The primary disadvantage of distributed database systems in the added complexity required to ensure proper co-ordination among the sites. There are several issues involved in storing, a relation in the distributed database, including replication and fragmentation. It is essential that the system minimise the degree to which a user needs to be aware of how a relation is stored. A transaction manager is a program module that provides the interface between the lowlevel data stored in the database and the application programs and queries submitted to the system. The storage manager is responsible for the interaction with the file manager. The raw data are stored on the disk using the file system, which is usually

provided by a conventional operating system. The storage manager translates the various DML statements into low-level file-system commands. Thus, the storage manager is responsible for storing, retrieving, and updating data in the database.The storage manager components include:

Authorization and integrity manager, which tests for the satisfaction of integrity constraints and checks the authority of users to access data. Transaction manager, which ensures that the database remains in a consistent (correct) state despite system failures, and that concurrent transaction executions proceed without conflicting. File manager, which manages the allocation of space on disk storage and the data structures used to represent information stored on disk. Buffer manager, which is responsible for fetching data from disk storage into main memory, and deciding what data to cache in main memory. The buffer manager is a critical part of the database system, since it enables the database to handle data sizes that are much larger than the size of main memory.The storage manager implements several data structures as part of the physical system implementation:

1. Data files, which store the database itself. 2. Data dictionary, which stores metadata about the structure of the database, in

particular the schema of the database.


3. Indices, which provide fast access to data items that hold particular values.

Das könnte Ihnen auch gefallen