Sie sind auf Seite 1von 38

MCA Sem.

IV Advance Database Systems

Assignment Set -1

Q. 1. Describe the Following. A. Dimensional Model The dimensional model is a specialized adaptation of the relational model used to represent data in data warehouses in a way that data can be easily summarized using OLAP queries. In the dimensional model, a database consists of a single large table of facts that are described using dimensions and measures. A dimension provides the context of a fact (such as who participated, when and where it happened, and its type) and is used in queries to group related facts together. Dimensions tend to be discrete and are often hierarchical; for example, the location might include the building, state, and country. A measure is a quantity describing the fact, such as revenue. Its important that measures can be meaningfully aggregated for example, the revenue from different locations can be added together. In an OLAP query, dimensions are chosen and the facts are grouped and added together to create a summary. The dimensional model is often implemented on top of the relational model using a star schema, consisting of one table containing the facts and surrounding tables containing the dimensions. Particularly complicated dimensions might be represented using multiple tables, resulting in a snowflake schema.

A data warehouse can contain multiple star schemas that share dimension tables, allowing them to be used together. Coming up with a standard set of dimensions is an important part of dimensional modeling. B. Object Database Model In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program. This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases. A variety of these ways have been tried for storing objects in a database. Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent. This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities. Object databases suffered because of a lack of standardization: although standards were defined by ODMG, they were never implemented well enough to ensure interoperability between products. Nevertheless, object databases have been used successfully in many applications: usually specialized applications such as engineering databases or molecular biology databases rather than mainstream commercial data processing. However, object database ideas were picked up by the relational vendors and influenced extensions made to these products and indeed to the SQL language

C. Post Relational Database Model. Several products have been identified as post-relational because the data model incorporates relations but is not constrained by the Information Principle, requiring that all information is represented by data values in relations. Products using a post-relational data model typically employ a model that actually pre-dates the relational model. These might be identified as a directed graph with trees on the nodes. Post-relational databases could be considered a sub-set of object databases as there is no need for object-relational mapping when using a post-relational data model. In spite of many attacks on this class of data models, with designations of being hierarchical or legacy, the post-relational database industry continues to grow as a multi-billion dollar industry, even if the growth stays below the relational database radar. Examples of models that could be classified as post-relational are PICK aka MultiValue, and MUMPS, aka M.

Q. 2. Explain the Concept of query ? How a Query Optimizer Works ? Ans. The aim of query processing is to find information in one or more databases and deliver it to the user quickly and efficiently. Traditional techniques work well for databases with standard, single-site relational structures, but databases containing more complex and diverse types of data demand new query processing and optimization techniques. Most real-world data is not well structured. Todays databases typically contain much non-structured data such as text, images, video, and audio, often distributed across computer networks. In this complex milieu (typified by the World Wide Web), efficient and accurate query processing becomes quite challenging. Principles of Database Query Processing for Advanced Applications teaches the basic concepts and techniques of query processing and optimization for a variety of data forms and database systems, whether structured or unstructured.

Query Optimizer The Query Optimizer is the component of a database management system that attempts to determine the most efficient way to execute a query. The optimizer considers the possible query plans (discussed below) for a given input query, and attempts to determine which of those plans will be the most efficient. Cost-based query optimizers assign an estimated "cost" to each possible query plan, and choose the plan with the least cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, the CPU requirements, and other factors. Query plan A Query Plan (or Query Execution Plan) is a set of steps used to access information in a SQL relational database management system. This is a specific case of the relational model concept of access plans. Since SQL is declarative, there are typically a large number of alternative ways to execute a given query, with widely varying performance. When a query is submitted to the database, the query optimizer evaluates some of the different, correct possible plans for executing the query and returns what it considers the best alternative. Because query optimizers are imperfect, database users and administrators sometimes need to manually examine and tune the plans produced by the optimizer to get better performance. The set of query plans examined is formed by examining the possible access paths (e.g. index scan, sequential scan) and join algorithms (e.g. sort-merge join, hash join, nested loops). The search space can become quite large depending on the complexity of the SQL query. The query optimizer cannot be accessed directly by users. Instead, once queries are submitted to database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs. Implementation Most query optimizers represent query plans as a tree of "plan nodes". A plan node encapsulates a single operation that is required to execute the query. The nodes are arranged as a tree, in

which intermediate results flow from the bottom of the tree to the top. Each node has zero or more child nodes those are nodes whose output is fed as input to the parent node. For example, a join node will have two child nodes, which represent the two join operands, whereas a sort node would have a single child node (the input to be sorted). The leaves of the tree are nodes which produce results by scanning the disk, for example by performing an index scan or a sequential scan. Q. 3. Explain the following with respect to Heuristics of Query Optimizations: A. Equivalence of Expressions

The first step in selecting a query-processing strategy is to find a relational algebra expression that is equivalent to the given query and is efficient to execute. Well use the following relations as examples: Customer-scheme = (cname, street, ccity) Deposit-scheme = (bname, account#, name, balance) Branch-scheme = (bname, assets, bcity)

B. Selection Operation

. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is bname, assets( ccity=Port Chester (customer deposit branch))

- This expression constructs a huge relation,

customer

deposit

branch of which we are only interested in a few tuples.

- We also are only interested in two attributes of this relation. - We can see that we only want tuples for which ccity = Port Chester. - Thus we can rewrite our query as: bname, assets(ccity=Port Chester(customer)) customer deposit branch)

- This should considerably reduce the size of the intermediate relation. 2. Suggested Rule for Optimization: - Perform select operations as early as possible. - If our original query was restricted further to customers with a balance over $1000, the selection cannot be done directly to the customer relation above. - The new relational algebra query is

- The selection cannot be applied to customer, as balance is an attribute of deposit. We can still rewrite as

- If we look further at the subquery (middle two lines above), we can split the selection predicate in two:

- This rewriting gives us a chance to use our perform selections early rule again. - We can now rewrite our subquery as:

3. Second Transformational Rule: - Replace expressions of the form P1^P2(C) by P1( P2( C)) where P1 and P2 predicates and e is a relational algebra expression. - Generally, P1( P2( C)) = P2( P1( C)) = P1^P2(C)

C). Projection Operation.

Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query:

2. When we compute the subexpression

we obtain a relation whose scheme is (cname, ccity, bname, account#, balance)

3. We can eliminate several attributes from this scheme. The only ones we need to retain are those that - appear in the result of the query or - are needed to process subsequent operations. 4. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. 5. In our example, the only attribute we need is bname (to join with branch). So we can rewrite our expression as:

Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: - We would access every block for the relation to remove attributes. - Then we access every block of the reduced-size relation when it is actually needed. - We do more work in total, rather than less! D) Natural Join Operation

Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations. Natural join is associative:

Although these expressions are equivalent, the costs of computing them may differ. Look again at our expression

we see that we can compute deposit branch first and then join with the first part. However, deposit account. The other part, is probably a small relation (comparatively). branch is likely to be a large relation as it contains one tuple for every

So, if we compute first, we get a reasonably small relation.

It has one tuple for each account held by a resident of Port Chester. This temporary relation is much smaller than deposit branch. Natural join is commutative:

Thus we could rewrite our relational algebra expression as:

But there are no common attributes between customer and branch, so this is a Cartesian product. Lots of tuples! If a user entered this expression, we would want to use the associativity and commutativity of natural join to transform this into the more efficient expression we have derived earlier (join with deposit first, then with branch).

Q. 4. There are a number of historical, organizational, and technological reasons explain the lack of an all-encompassing data management system. Discuss few of them with appropriate examples. Ans. Models of Failures Failures can be classified as 1) Transaction Failures a) Error in transaction due to incorrect data input. b) Present or potential deadlock. c) Abort of transactions due to non-availability of resources or deadlock. 2) Site Failures: From recovery point of view, failure has to be judged from the viewpoint of loss of memory. So failures can be classified as a) Failure with Loss of Volatile Storage: In these failures, the content of main memory is lost; however, all the information which is recorded on disks is not affected by failure. Typical failures of this kind are system crashes.

b) Media Failures (Failures with loss of Nonvolatile Storage): In these failures the content of disk storage is lost. Failures of this type can be reduced by replicating the information on several disks having independent failure modes. Stable storage is the most resilient storage medium available in the system implemented by replicating the same information on several disks with (i) independent failure modes, and (ii) using the so-called careful replacement strategy, at every update operation, first one copy of the information is updated, then the correctness of the update is verified, and finally the second copy is updated. 3) Communication Failures: There are two basic types of possible communication errors: lost messages and partitions. When a site X does not receive an acknowledgment of a message from a site Y within a predefined time interval, X is uncertain about the following things: i) Did a failure occur at all, or is the system simply slow? ii) If a failure occurred, was it a communication failure, or a crash of site Y? iii) Has the message been delivered at Y or not? (as the communication failure or the crash can happen before or after the delivery of the message.) Network Partition Thus all failures can be regrouped as i) Failure of a site ii) Loss of message(s), with or without site failures but no partitions. iii) Network Partition: Dealing with network partitions is a harder problem than dealing with site crashes or lost messages.

Q.5 Describe the Structural Semantic Data Model (SSM) with relevant examples. Ans. The Structural Semantic Model, SSM, first described in Nordbotten (1993a & b), is an extension and graphic simplification of the EER modeling tool 1st presented in the 89 edition of (Elmasri & Navathe, 2003). SSM was developed as a teaching tool and has been and can continue to be modified to include new modeling concepts. A particular requirement today is the inclusion of concepts and syntax symbols for modeling multimedia objects. 4.7.1 SSM Concepts The current version of SSM belongs to the class of Semantic Data Model types extended with concepts for specification of user defined data types and functions, UDT and UDF. It supports the modeling concepts defined in below and compared in below. Following diagram shows the concepts and graphic syntax of SSM, which include: Data Modeling Concepts

1. Three types of entity specifications: base (root), subclass, and weak 2. Four types of inter-entity relationships: n-ary associative, and 3 types of classification hierarchies,

3. Four attribute types: atomic, multi-valued, composite, and derived, 4. Domain type specifications in the graphic model, including;

standard data types, Binary large objects (blob, text, image, ), user-defined types (UDT) and functions (UDF), 5. Cardinality specifications for entity to relationship-type connections and for multi-valued attribute types and 6. Data value constraints.

Q-6. Describe the following with respect to Fuzzy querying to relational databases: A. Proposed Model The easiest way of introducing fuzziness in the database model is to use classical relational databases and formulate a front end to it that shall allow fuzzy querying to the database. A limitation imposed on the system is that because we are not extending the

database model nor are we defining a new model in any way, the underlying database model is crisp and hence the fuzziness can only be incorporated in the query. To incorporate fuzziness we introduce fuzzy sets / linguistic terms on the attribute domains / linguistic variables e.g. on the attribute domain AGE we may define fuzzy sets as YOUNG, MIDDLE and OLD. These are defined as the following:

Age For this we take the example of a student database which has a table STUDENTS with the following attributes:

A snapshot of the data existing in the database

B. Meta knowledge At the level of meta knowledge we need to add only a single table, LABELS with the following structure:

Meta Knowledge This table is used to store the information of all the fuzzy sets defined on all the attribute domains. A description of each column in this table is as follows: Label: This is the primary key of this table and stores the linguistic term associated with the fuzzy set. Column_Name: Stores the linguistic variable associated with the given linguistic term. Alpha,Beta, Gamma, Delta: Stores the range of the fuzzy set

C. Implementation The main issue in the implementation of this system is the parsing of the input fuzzy query. As the underlying database is crisp, i.e. no fuzzy data is stored in the database, the INSERT query will not change and need not be parsed therefore it can be presented to the database as it is. During parsing the query is parsed and divided into the following 1. Query Type: Whether the query is a SELECT, DELETE or UPDATE.

2. Result Attributes: The attributes that are to be displayed used only in the case of the SELECT query. 3. Source Tables: The tables on which the query is to be applied. 4. Conditions: The conditions that have to be specified before the operation is performed. It is further sub-divided into Query Attributes (i.e. the attributes on which the query is to be applied) and the linguistic term. If the condition is not fuzzy i.e. it does not contain a linguistic term then it need not be subdivided.

Master of Computer Application (MCA) Semester 4 MC0077 Advanced Database Systems 4 Credits
(Book ID: B0882) Assignment Set 2 (60 Marks) 1. How costs are computed for execution of a query? Discuss the method of Measuring Index Selectivity? Ans 1: Heuristics of Query Optimizations Equivalence of Expressions The first step in selecting a query-processing strategy is to find a relational algebra expression that is equivalent to the given query and is efficient to execute. We'll use the following relations as examples: Customer-scheme = (cname, street, ccity) Deposit-scheme = (bname, account#, name, balance) Branch-scheme = (bname, assets, bcity) Selection Operation 1. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is bname, assets( ccity=Port Chester (customer deposit branch)) o This expression constructs a huge relation, customer deposit branch of which we are only interested in a few tuples. o We also are only interested in two attributes of this relation. o We can see that we only want tuples for which ccity = Port Chester''. o Thus we can rewrite our query as: bname, assets(ccity=Port Chester(customer)) customer deposit branch) o This should considerably reduce the size of the intermediate relation. 2. Suggested Rule for Optimization: o Perform select operations as early as possible.

o If our original query was restricted further to customers with a balance over $1000, the selection cannot be done directly to the customer relation above. o The new relational algebra query is bname, assets( ccity = PortChester ^ balance >1000 (customer deposit branch)) o The selection cannot be applied to customer, as balance is an attribute of deposit. We can still rewrite as bname, assets ((ccity = PortChester ^ balance >1000 (customer deposit)) branch) o If we look further at the subquery (middle two lines above), we can split the selection predicate in two: ccity = PortChester( balance >1000 (customer deposit)) o This rewriting gives us a chance to use our perform selections early'' rule again. o We can now rewrite our subquery as: ccity = PortChester(customer) balance >1000 (deposit) 3. Second Transformational Rule: o Replace expressions of the form P1^P2(C) by P1( P2( C)) where P1 and P2 predicates and e is a relational algebra expression. o Generally, P1( P2( C)) = P2( P1( C)) = P1^P2(C) Projection Operation 1. Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query: bname, assets (((ccity = PortChester (customer)) deposit) branch) 2. When we compute the subexpression (((ccity = PortChester (customer)) deposit) we obtain a relation whose scheme is (cname, ccity, bname, account#, balance) 3. We can eliminate several attributes from this scheme. The only ones we need to retain are those that o appear in the result of the query or o are needed to process subsequent operations. 4. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. 5. In our example, the only attribute we need is bname (to join with branch). So we can rewrite our expression as: bname, assets (((ccity = PortChester (customer)) deposit)) branch) 6. Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: o We would access every block for the relation to remove attributes.

o Then we access every block of the reduced-size relation when it is actually needed. o We do more work in total, rather than less! Natural Join Operation Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations. Natural join is associative: (r1 r2) r3 = r1 (r2 r3) Although these expressions are equivalent, the costs of computing them may differ. Look again at our expression bname, assets ((ccity = PortChester (customer)) deposit branch) we see that we can compute deposit branch first and then join with the first part. However, deposit branch is likely to be a large relation as it contains one tuple for every account. The other part, is probably a small relation (comparatively). (ccity = PortChester (customer) So, if we compute first, we get a reasonably small relation. (ccity = PortChester (customer) deposit It has one tuple for each account held by a resident of Port Chester. This temporary relation is much smaller than deposit branch. Natural join is commutative: r1 r2 = r2 r1 Thus we could rewrite our relational algebra expression as: bname, assets (((ccity = PortChester (customer)) deposit)) branch) But there are no common attributes between customer and branch, so this is a Cartesian product. Lots of tuples! If a user entered this expression, we would want to use the associativity and commutativity of natural join to transform this into the more efficient expression we have derived earlier (join with deposit first, then with branch).

2. Describe the following with respect to SQL3 DB specification: A) Complex Structures B) Hierarchical Structures LOBs Ans 2: (A) Complex structures 1. Create row type Address_t defines the address structure that is used in line 8. C) Relationships D) Large Objects, LOBs E) Storage of

2. Street#, Street, ... are regular SQL2 specifications for atomic attributes. 3. PostCode and Geo-Loc are both defined as having user defined data types, Pcode and Point respectively. Pcode is typically locally defined as a list or table of valid postal codes, perhaps with the post office name. 4. Create function Age_f defines a function for calculation of an age, as a decimal value, given a start date as the input argument and using a simple algorithm based on the current date. This function is used as the data type in line 9 and will be activated each time the Person.age attribute is retrieved. The function can also be used as a condition clause in a SELECT statement. 5. Create table PERSON initiates specification of the implementation structure for the Person entity-type. 6. Id is defined as the primary key. The not null phrase only controls that some 'not null' value is given. The primary key phrase indicates that the DBM is to guaranty that the set of values for Id are unique. 7. Name has a data-type, PersName, defined as a Row type similar to the one defined in lines 1-3. BirthDate is a date that can be used as the argument for the function Age_f defined in line 4. 8. Address is defined using the row type Address_t, defined in lines 1-3. Picture is defined as a BLOB, or Binary Large Object. Note that there are no functions for content search, manipulation or presentation, which support BLOB data types. These must be defined either by the user as user-defined functions, UDFs, or by the ORDBMS vendor in a supplementary subsystem. In this case, we need functions for image processing. 9. Age is defined as a function, which will be activated each time the attribute is retrieved. This costs processing time (though this algorithm is very simple), but gives a correct value each time the attribute is used. (B) Hierarchical Structures 1. Create table STUDENT initiates specification of the implementation of a subclass entity type. 2. GPA, Level, ... are the attributes for the subclass, here with simple SQL2 data types. 3. under PERSON specifies the table as a subclass of the table PERSON. The DBM thus knows that when the STUDENT table is requested, all attributes and functions in PERSON are also relevant. An OR-DBMS will store and use the primary key of PERSON as the key for STUDENT, and execute a join operation to retrieve the full set of attributes. 4. Create table COURSE specifies a new table specification, as done for statements in lines 5 and 10 above. 5. Id, Name, and Level are standard atomic attribute types with SQL2 data types. Id is defined as requiring a unique, non null value, as specified for PERSON in line 6 above. 6. Note that attributes must have unique names within their tables, but the name may be reused, with different data domains in different tables. Both Id and Name are such attribute-names, appearing in both PERSON and COURSE, as is Level used in STUDENT and COURSE. 7. Course.Description is defined as a character large object, CLOB. A CLOB data type has the same defined character-string functions as char, varchar, and long

char, and can be compared to these. User_id is defined as Ucode, which is the name of a user defined data type, presumably a list of acceptable user codes. The DB implementer must define both the data type and the appropriate functions for processing this type. 8. User_Id is also specified as a foreign key which links the Course records to their "user" record, modeled as a category sub entity - type, through the primary key in the User table. (C) Relationships The relationship TakenBy is defined in Figure b. This definition needs only SQL2 specifications. Note that: {Sid, Cid, and Term} form the primary key, PK. Since the key is composite, a separate Primary key clause is required. (As compared with the single attribute PK specifications for PERSON.Id and COURSE.Id.) The 2 foreign key attributes in the PK, must be defined separately. TakenBy.Report is a foreign key to a report entity-type, forming a ternary relationship as modeled in Figure a. The ON DELETE trigger is activated if the Report relation is deleted and assures that the FK link has a valid value, in this case 'null'.

(D) Large OBjects, LOBs The SSM syntax includes data types for potentially very long media types, such as text, image, audio and video, as shown in Figure 6.8 . If this model is to be realized in a single database, the DMS will have to have the capability to manage - store, search, retrieve, and manipulate different media types. Object-relational dbms vendors claim to be able to do this.

Figure: Media objects as attributes SQL3 provides support for storage of Binary Large OBjects, BLOBs. A BLOB is simply a very long bit string, limited in many systems today to 2 or 4GB. Several OR-Dbms vendors differentiate BLOBs into data-types that give more information about the format of the content and provide basic/primitive manipulation functions for these large object, LOB, types. For example, IBM's DB2 has 3 LOB types: BLOB for long bit strings, CLOB for long character strings, and DBCLOB for double-byte character strings. Oracle data types for large objects are BLOB, CLOB, NCLOB (fixed-width multi-byte CLOB) and BFILE (binary file stored outside the DB). Note that the 1st 3 are equivalent to the DB2 LOBs, while the last is really not a data-type, but rather a link to an externally stored media object. SQL3 has no functions for processing, f.ex. indexing the content of a BLOB, and provides only functions to store and retrieve it given an external identifier. For example, if the BLOB is an image, SQL3 does not 'know' how to display it, i.e. it has no functions for image presentation. DBMS vendors who provide differentiated blob types have also extended the basic SQL string comparison operators so that they will function for LOBs, or at least CLOBs. These operators include the pattern match function "LIKE", which gives a true/false response if the search string is found/not found in the *LOB attribute. Note: "LIKE" is a standard SQL predicate that simply has been extended to search very long data domains. Storage of LOBs There are 3 strategies for storing LOBs in an or-DB: 1. Embedded in a column of the defining relation, or

2. Stored in a separate table within the DB, linked from the *LOB column of the defining relation. 3. Stored on an external (local or geographically distant) medium, again linked from the *LOB column of the defining relation. Embedded storage in the defining relation closely maps the logical view of the media object with its physical storage. This strategy is best if the other attributes of the table are primarily structural metadata used to specify display characteristics, for example length, language, format. The problem with embedded storage is that a DMS must transfer at least a whole tuple, more commonly a block of tuples, from storage for processing. If blobs are embedded in the tuples, a great deal of data must be transmitted even if the LOB objects are not part of the query selection criteria or the result. For example, a query retrieving the name and address of persons living in Bergen, Norway, would also retrieve large quantities of image data if the data for the Person.Picture attribute of Figure 8 were stored as an embedded column in the Person table. Separate table storage gives indirect access via a link in the defining relation and delays retrieval of the LOB until it is to be part of the query result set. Though this gives a two-step retrieval, for example when requesting an image of Joan Nordbotten, it will reduce general or average transfer time for the query processing system. A drawback of this storage strategy is a likely fragmentation of the DB area, as LOBs can be stored 'anywhere'. This will decrease the efficiency of any algorithm searching the content of a larger set of LOBs, for example to find images that are similar to or contain a given image segment. As usual, the storage structure chosen for a DB should be based on an analysis of anticipated user queries. External storage is useful if the DB data is 'connected' to established media databases, either locally on CD, DVD, ... or on other computers in a network as will most likely be the case when sharing media data stored in autonomous applications, such as cooperating museums, libraries, archives, or government agencies. This storage structure eliminates the need for duplication of large quantities of data that are normally offered in read-only mode. The cost is in access time which may currently be nearly unnoticeable. A good multimedia DMS should support each of these storage strategies.

3. Explain: A) Data Warehouse Architecture B) Data Storage Methods Ans 3: A. Data Warehouse Architecture The term Data Warehouse Architecture is primarily used today to describe the overall structure of a Business Intelligence system. Other historical terms include Decision Support Systems (DSS), Management Information Systems (MIS), and others. The Data Warehouse Architecture describes the overall system from various perspectives such as data, process, and infrastructure needed to communicate the structure, function and interrelationships of each component. The infrastructure or

technology perspective details the various hardware and software products used to implement the distinct components of the overall system. The data perspective typically diagrams the source and target data structures and aid the user in understanding what data assets are available and how they are related. The process perspective is primarily concerned with communicating the process and flow of data from the originating source system through the process of loading the data warehouse, and often the process that client products use to access and extract data from the warehouse.

B. Data Storage Methods In OLTP - Online Transaction Processing Systems relational database design use the discipline of data modeling and generally follow the Codd rules of data normalization in order to ensure absolute data integrity. Less complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules. Codd defines 5 increasing stringent rules of normalization and typically OLTP systems achieve a 3rd level normalization. Fully normalized OLTP database designs often result in having information from a business transaction stored in dozens to hundreds of tables. Relational database managers are efficient at managing the relationships between tables and result in very fast insert/update performance because only a little bit of data is affected in each relational transaction. OLTP databases are efficient because they are typically only dealing with the information around a single transaction. In reporting and analysis, thousands to billions of transactions may need to be reassembled imposing a huge workload on the relational database. Given enough time the software can usually return the requested results, but because of the negative performance impact on the machine and all of its hosted applications, data warehousing professionals recommend that reporting databases be physically separated from the OLTP database. Designing the data warehouse data Architecture synergy is the realm of Data Warehouse Architects. The goal of a data warehouse is to bring data together from a variety of existing databases to support management and reporting needs. The generally accepted principle is that data should be stored at its most elemental level because this provides for the most useful and flexible basis for use in reporting and information analysis. However, because of different focus on specific requirements, there can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the data in a data warehouse. In the "dimensional" approach, transaction data is partitioned into either a measured "facts", which are generally numeric data that captures specific values or "dimensions" which contain the reference information that gives each transaction

its context. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and salesperson. The main advantages of a dimensional approach are that the data warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-joined into the dimensional form, the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business. The main advantage of this approach is that it is quite straightforward to add new information into the database the primary disadvantage of this approach is that because of the number of tables involved, it can be rather slow to produce information and reports. Subject areas are just a method of organizing information and can be defined along any lines. The traditional approach has subjects defined as the subjects or nouns within a problem space. For example, in a financial services business, you might have customers, products and contracts. An alternative approach is to organize around the business transactions, such as customer enrollment, sales and trades. 4. Discuss, how the process of retrieving a Text Data differs from the process of retrieval of an Image? Text Retrieval Using SQL3/TextRetrieval SQL3 supports storage of multimedia data, such as text documents, in an ordatabase using the blob/clob data types. However, the standard SQL3 specification does not include support for such media content processing functions as indexing or searching using elements of the media content. For example SQL3's support for a query to retrieve documents about famous Norwegian artists is limited to using a serial search of all documents using the pattern match operator 'LIKE'. Queries using this operator are likely to miss the Web sites dedicated to the composer Seekers of information from text-based documents, commonly use 'free text' queries, i.e. queries that consist of a set of selection terms, as illustrated above. Depending on the underlying query processing system, the input can vary from a single search term to a longer document. This is a 'normal' input format for Information retrieval, IR, systems, such as the web search engines, but not for systems based on SQL. do not have a specific Therefore, most of the larger or-dbms vendors (IBM, Oracle, Ingres, Postgress, etc.) have used SQL3's UDT/UDF support to extend their or-dbms with sub-systems for the management of media data. The approach used has been to add-on own or purchased specialized media management systems to the basic or-dbms. Basically, the new - to SQL3 - functionality includes: Indexing Routines for the various types of media data, as discussed in CH.6, for example using: o Content terms for text data and

o Color, shape, and texture features for image data. Selection Operators for the SQL3 WHERE clause for specification of selection criteria for media retrieval. Text Processing Sub-Systems for similarity evaluation and result ranking. Unfortunately, the result of this 'independent' activity, is non standard or-dbms/mm (multimedia) systems that differ in the functionality included and limit data retrieval from multiple or-dbm system types. For example, unified access to data stored in Oracle and DB2 systems is difficult, both in query formulation and result presentation. Since the syntax of the SQL3 extensions varies between or-dbms/mm implementations, the examples used in the following are given in generic SQL3/TextRetrieval (or sql3/tr) statements. Text Document Retrieval Text-based documents are basically unstructured and can be complex. They can consist of the raw text only, have a tagged structure (such as for html documents), include embedded images, and can have a number of fixed attributes containing the metadata describing aspects of the document. They may also include links to supplementary materials. For example, a news report for an election could include the following components: where n, m, k, and x are the number of occurrences of each component type. 1. Identifier, date, and author(s) of the report, 2. n* text blocks - (titles, abstract, content text), 3. m* images - example: image_of_candidate 4. k* charts, and 5. x* maps. Note that the document elements listed in pt.1 above function as context metadata for the report, while the text itself can function as semantic metadata for both the text (through indexing) and the image materials. The Web document shown in illustrates elements of a semi-structured document. Since an OR-DB can contain text documents such as web pages, SQL3 should be extended with processing operators that support access to each of the element types listed above. Retrieval using Context Metadata In an OR-DB, document descriptors such as Document ID, Date, and Author(s) function as context metadata. The metadata can be implemented as standard atomic attributes and relationships, thus enabling use of standard SQL queries for retrieval of the document(s). For example, an SQL query to find recent articles on database management by Joan Nordbotten could be expressed as: Select R.* FROM WHERE Person P, Author A, Report R P.id = A.Pid AND A.Rid=R.id

AND AND AND

Name = 'Joan Nordbotten' A.Date > 1999-12-31 Title LIKE '%Database%';

Note that this query assumes that there could be reports on different topics and therefore requires use of a semantic descriptor to select only those whose documents that indicate that the report has something to do with databases. The Title attribute was used in this query, but other semantic metadata, such as the summary and/or keyword attributes could also have been chosen - alone or in combination. Execution optimization of this query, will place the LIKE operator 'last' so that its time consuming serial search of the Report.title attribute will be restricted to those reports that satisfy the Author.name and date conditions. However, as noted previously, no term index functionality for multiple term attributes has been included in the standard SQL3, thus there is no option to the serial search for the LIKE operator. Information retrieval using the standard SQL exact match operators functions well for the context metadata of all media types and moderately well for the semantic content metadata attributes. The problem is that the user must know the DB structure, the attribute names and the DB values in order to form a query. This will not be the case for Internet searchers. Text Retrieval by Semantic Content Researchers and developers of document collections strongly recommend that the semantic information content of the documents be described using such semantic content metadata attributes as a title, (a list of) subject keywords, and a content description - all multiple term descriptors. This information can be stored with the document as standard SQL attributes using variable length character data types. For example, an OR-DB for web-site maintenance could be developed to contain Web documents described using Dublin Core metadata elements. If the DB contained the Web page, it could be retrieved using the following SQL statement based on the semantic metadata and the text itself. Select * from Document where or (Title LIKE '%Edvard Grieg%' Text LIKE '%Edvard Grieg%');

In this case, the document was selected by a match in the title, since Edvard Grieg is not mentioned by full name in the text of the article. However, the following SLQ3 query will not return this document though it is relevant to the intent of the query, unless the phrase Norwegian composer has been defined in the Keywords list. Select Where or or * from Document (Title LIKE '%Norwegian composer%' Keywords LIKE '%Norwegian composer%'); Text LIKE '%Norwegian composer%');

The most obvious problems using a standard SQL3 system for text search include the: Lack of utilization of the document structure. Dependency on the serial search of the LIKE operator for the multiple term semantic metadata attributes and text body. The potential mismatch between the user query terms and the terms in the document descriptors. As noted earlier, SQL3 has no concept of a document or words and therefore there are no search operators for specification of the placement search terms in a document (adjacent, near, before, after,...). Since data retrieval in SQL3 is based on an exact match of the query terms and the DB values, no support is provided for similarity evaluation between the query terms and the document content. Obviously, more powerful operators are needed for text retrieval. Ideally, a query language that supports text search and retrieval by the semantic content of text documents must provide at least the following functionality. Search Criteria Example List of terms Norwegian, composer, Grieg Term proximity Edvard near Grieg Synonym concepts about "Norwegian Similar documents composers" like this document.

To help avoid problems with the use of various term forms, a root extraction function must be available for both document indexing and query pre-processing. Using the above examples some elements in the root-term table could be: Root Term Norway Compose Music Variations Norwegian, Norge, ... composer, Norsk, composers, song,

composes, ... tune, tunes, songs, ...

Note that there exist numerous electronic dictionaries, thesauri, taxonomies, ontologies that can be incorporated into a text query processor. SQL3/Text Information Retrieval Systems( IRS) have been under development since the mid 1950s. They provide search and retrieval functions for text document collections based on document structure, concepts of words, and grammar. It is functionality from these systems that has been added by or-DBMS vendors to support management of multimedia data. The resulting ORDBMS / MM (Multimedia) conforms (to some degree) to the Multimedia Information Retrieval Systems, MIRS, envisioned by Lu (1999).

Basic ORDBMS / MM - text retrieval functionality includes generation of multiple types of term indexes, as well as a contains operator with sub-operators for the WHERE clause. The contains operator differs from an exact match query in that it gives a probability for a match - a similarity score - between the query search terms and the documents in the database, rather than a true/false result. This operator can be used with multiple search terms and operators that specify relationships between the search terms, for example: the Boolean operators AND, OR, Not and location operators such as: adjacent, within same sentence or paragraph for text documents as illustrated in the following table. Term combination Term location Concept Various other operators AND, OR, NOT ADJACENT, NEAR, WITHIN, ... ABOUT, SIMILAR FUZZY, LIKE, ...

Assuming that whole Web pages are stored in an OR-DB attribute Document.text, the following examples will retrieve the document, in addition to other documents containing the search terms. 1) Select where 2) Select where 3) Select where * from Document Text CONTAINS ('Edvard' AND 'Grieg'); * from Document Text CONTAINS ('Edvard' ADJACENT 'Grieg'); * from Document Text ABOUT ('composers');

In processing the above queries, the SQL3/Text processing system utilizes the term indexes generated for the document set, as well as a thesaurus for query 3. Note that a term location index is required for query 2, while query 1 needs a frequency index if the retrieved documents are to be ranked /ordered by the frequency of the search terms within the documents. Image Retrieval Popular knowledge claims that an image is worth 1000 words. Unfortunately, these 1000 words may differ from one individual to another depending on their perspective and/or knowledge of the image context. For example, Figure 6 gives a familiar demonstration that an image can have multiple, quite different interpretations. Thus, even if a 1000-word image description were available, it is not certain that the image could be retrieved by a user with a different description.

The problem is fundamentally one of communication between an information/image seeker/user and the image retrieval system. Since the user may have differing needs and knowledge about the image collection, an image retrieval system must support various forms for query formulation. In general, image retrieval queries can be classified as: 1. Attribute-Based Queries: which use context and/ structural metadata values to retrieve images, for example: o Find image number 'x' or o Find images from the 17th of May (the Norwegian national holiday day). 2. Textual Queries: which use a term-based specification of the desired images that can be matched to textual image descriptors, for example: o Find images of Hawaiian sunsets or o Find images of President Bush delivering a campaign speech 3. Visual Queries: which give visual characteristics (color, texture) or an image that can be compared to visual descriptors. Examples include: o Find images where the dominant color is blue and gold or o Find images like <this one>. These query types utilize different image descriptors and require different processing functions. Image descriptors can be classified into: Metadata Descriptors: those that describe the image, as recommended in the numerous metadata standards, such as Dublin Core, CIDOC/CRM and MPEG-7, from the library, museum and motion picture communities respectively. These metadata can again be classified as: 1. Attribute-based context and structural metadata, such as creator, dates, genre, (source) image type, size, file name, ..., or 2. Text-based semantic metadata, such as title/caption, subject/keyword lists, freetext descriptions and/or the text surrounding embedded images, for example as

used in a html document. Note that for embedded images, content indexing can be generated using the nearby text.

5. What are differences in Centralized and Distributed Database Systems? List the relative advantages of data distribution.

Ans 5: Features of Distributed vs. Centralized Databases or Differences in


Distributed & Centralized Databases Centralized Control vs. Decentralized Control In centralized control one "database administrator" ensures safety of data whereas in distributed control, it is possible to use hierarchical control structure based on a "global database administrator" having the central responsibility of whole data along with "local database administrators", who have the responsibility of local databases. Data Independence In central databases it means the actual organization of data is transparent to the application programmer. The programs are written with "conceptual" view of the data (called "Conceptual schema"), and the programs are unaffected by physical organization of data. In Distributed Databases, another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. Distribution Dependency means programs are written assuming the data is not distributed. Thus correctness of programs is unaffected by the movement of data from one site to another; however, their speed of execution is affected. Reduction of Redundancy In centralized databases redundancy was reduced for two reasons: (a) inconsistencies among several copies of the same logical data are avoided, (b) storage space is saved. Reduction of redundancy is obtained by data sharing. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it, (b) the availability of the system can be increased, because a site failure does not stop the execution of applications at other sites if the data is replicated. With data replication, retrieval can be performed on any copy, while updates must be performed consistently on all copies. Complex Physical Structures and Efficient Access In centralized databases complex accessing structures like secondary indexed, interfile chains are used. All these features provide efficient access to data. In distributed databases efficient access requires accessing data from different sites. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer. Problems faced in the design of an optimizer can be classified in two categories: a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites.

b) Local optimization consists of deciding how to perform the local database accesses at each site. Integrity, Recovery and Concurrency Control A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. Failures and concurrency are two dangers of atomicity. Failures may cause the system to stop in midst of transaction execution, thus violating the atomicity requirement. Concurrent execution of different transactions may permit one transaction to observe an inconsistent, transient state created by another transaction during its execution. Concurrent execution requires synchronization amongst the transactions, which is much harder in all distributed systems. Privacy and Security In traditional databases, the database administrator, having centralized control, can ensure that only authorized access to the data is performed. In distributed databases, local administrators face the same as well as two new aspects of the problem; (a) security (protection) problems because of communication networks is intrinsic to database systems. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator. Distributed Query Processing The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In theory a distributed system can handle queries more quickly than a centralized one, by exploiting parallelism and reducing disc contention; in practice the main delays (and costs) will be imposed by the communications network. Routing algorithms must take many factors into account to determine the location and ordering of operations. Communications costs for each link in the network are relevant, as also are variable processing capabilities and loadings for different nodes, and (where data fragments are replicated) trade-offs between cost and currency. If some nodes are updated less frequently than others there may be a choice between querying the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing a distant location.. Distributed Directory (Catalog) Management Catalogs for distributed databases contain information like fragmentation description, allocation description, mappings to local names, access method description, statistics on the database, protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases. Relative Advantages of Distributed Databases over Centralized Databases Organizational and Economic Reasons

Many organizations are decentralized, and a distributed database approach fits more naturally the structure of the organization. The organizational and economic motivations are amongst the main reasons for the development of distributed databases. In organizations already having several databases and feeling the necessity of global applications, distributed databases is the natural choice. Incremental Growth In a distributed environment, expansion of the system in terms of adding more data, increasing database size, or adding more processors is much easier. Reduced Communication Overhead Many applications are local, and these applications do not have any communication overhead. Therefore, the maximization of the locality of applications is one of the primary objectives in distributed database design. Performance Considerations Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide area networks. Local queries and transactions accessing data at a single site have better performance because of the smaller local databases. In addition, each site has a smaller number of transactions executing than if all transactions are submitted to a single centralized database. Moreover, inter-query and intra-query parallelism can be achieved by executing multiple queries at different sites, or breaking up a query into a number of sub queries that execute in parallel. This contributes to improved performance. Reliability and Availability Reliability is defined as the probability that a system is running (not down) at a certain time point. Availability is the probability that the system is continuously available during a time interval. When the data and DBMS software are distributed over several sites, one site may fail while other sites continue to operate. Only the data and software that exist at the failed site cannot be accessed. This improves both reliability and availability. Further improvement is achieved by judiciously replicating data and software at more than one site. Management of Distributed Data with Different Levels of Transparency In a distributed database, following types of transparencies are possible: Distribution or Network Transparency This refers to freedom for the user from the operational details of the network. It may be divided into location and naming transparency. Location transparency refers to the fact that the command used to perform a task is independent of the location of data and the location of the system where the command was issued. Naming transparency implies that once a name is specified, the named objects can be accessed unambiguously without additional specification. Replication Transparency

Copies of the data may be stored at multiple sites for better availability, performance, and reliability. Replication transparency makes the user unaware of the existence of copies. Fragmentation Transparency Two main types of fragmentation are Horizontal fragmentation, which distributes a relation into sets of tuples (rows), and Vertical Fragmentation which distributes a relation into sub relations where each sub relation is defined by a subset of the column of the original relation. A global query by the user must be transformed into several fragment queries. Fragmentation transparency makes the user unaware of the existence of fragments.

6. What are Commit Protocols? Explain, how Two-Phase Commit Protocol responds to following types of failures:i) Failure of Participating Site, ii) Failure of Coordinator Ans 6:

Commit Protocols:

In distributed data base and transaction systems a distributed commit protocol is required to ensure that the effects of a distributed transaction are atomic, that is, either all the effects of the transaction persist or none persist, whether or not failures occur. Several commit protocols have been proposed in the literature. These are variations of what has become a standard and known as the two-phase commit (2PC) protocol.

Two-phase commit protocol


In transaction processing, databases, and computer networking, the two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (roll back) the transaction (it is a specialized type of consensus protocol). The protocol achieves its goal even in many cases of temporary system failure (involving either process, network node, communication, etc. failures), and is thus widely utilized. However, it is not resilient to all possible failure configurations, and in rare cases user (e.g., a system's administrator) intervention is needed to remedy outcome. To

accommodate recovery from failure (automatic in most cases) the protocol's participants use logging of the protocol's states. Log records, which are typically slow to generate but survive failures, are used by the protocol's recovery procedures. Though usually intended to be used infrequently, recovery procedures comprise a substantial portion of the protocol, due to many possible failure scenarios to be considered and supported by the protocol.

(I)

Failure of Participating Site:

The commit-request phase (or voting phase), in which a coordinator process attempts to prepare all the transaction's participating processes (named participants, cohorts, or workers) to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes": commit (if the transaction participant's local portion execution has ended properly), or "No": abort (if a problem has been detected with the local portion), and The commit phase, in which, based on voting of the cohorts, the coordinator decides whether to commit (only if all have voted "Yes") or abort the transaction (otherwise), and notifies the result to all the cohorts. The cohorts then follow with the needed actions (commit or abort) with their local transactional resources (also called recoverable resources; e.g., database data) and their respective portions in the transaction's other output (if applicable).

(ii)

Failure of Coordinator

If any cohort votes No during the commit-request phase (or the coordinator's timeout expires): (1) The coordinator sends a rollback message to all the cohorts. (2) Each cohort undoes the transaction using the undo log, and releases the resources and locks held during the transaction. (3) Each cohort sends an acknowledgement to the coordinator. (4) The coordinator undoes the transaction when all acknowledgements have been received.

Das könnte Ihnen auch gefallen