MC0077 - Advanced Database Systems

HARVINDER SINGH (MCA 4 Semester) Roll Number - 511025273
th
http://www.scribd.com/Harvinder_chauhan
ManipalU
Book ID: B0882

[Type the document subtitle]
HARVINDER SINGH
HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273
July 2011
Master of Computer Application (MCA) Semester 4 MC0077 Advanced Database Systems 4 Credits (Book ID: B0882) Assignment Set 1
1.
Describe the following:
Dimensional Model
Ans: The dimensional model is a specialized adaptation of the relational model used to represent data in data warehouses in a way that data can be easily summarized using OLAP queries. In the dimensional model, a database consists of a single large table of facts that are described using dimensions and measures. A dimension provides the context of a fact (such as who participated, when and where it happened, and its type) and is used in queries to group related facts together. Dimensions tend to be discrete and are often hierarchical; for example, the location might include the building, state, and country. A measure is a quantity describing the fact, such as revenue. Its important that measures can be meaningfully aggregated for example, the revenue from different locations can be added together. In an OLAP query, dimensions are chosen and the facts are grouped and added together to create a summary. The dimensional model is often implemented on top of the relational model using a star schema, consisting of one table containing the facts and surrounding tables containing the dimensions. Particularly complicated dimensions might be represented using multiple tables, resulting in a snowflake schema. A data warehouse can contain multiple star schemas that share dimension tables, allowing them to be used together. Coming up with a standard set of dimensions is an important part of dimensional modeling.
Object Database Models

Ans: In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases. These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program. This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.
A variety of these ways have been tried for storing objects in a database. Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent. This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. Others have attacked the problem from the database end, by defining an objectoriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities. Object databases suffered because of a lack of standardization: although standards were defined by ODMG, they were never implemented well enough to ensure interoperability between products. Nevertheless, object databases have been used successfully in many applications: usually specialized applications such as engineering databases or molecular biology databases rather than mainstream commercial data processing. However, object database ideas were picked up by the relational vendors and influenced extensions made to these products and indeed to the SQL language.
Post Relational Database Models

Ans: Several products have been identified as post-relational because the data model incorporates relations but is not constrained by the Information Principle, requiring that all information is represented by data values in relations. Products using a post-relational data model typically employ a model that actually pre-dates the relational model. These might be identified as a directed graph with trees on the nodes. Post-relational databases could be considered a sub-set of object databases as there is no need for object-relational mapping when using a post-relational data model. In spite of many attacks on this class of data models, with designations of being hierarchical or legacy, the post-relational database industry continues to grow as a multi-billion dollar industry, even if the growth stays below the relational database radar. Examples of models that could be classified as post-relational are PICK aka MultiValue, and MUMPS, aka M. 2. Explain the concept of a Query? How a Query Optimizer works.
Ans: Queries are essentially powerful Filters. Queries allow you to decide what fields or expressions are to be shown and what information to be sought. Queries are usually based on Tables but can also be based on an existing Query. Queries allow you seek from very basic information through to much more complicated specifications. They also allow you to list information in a particular order, such as listing all the resulting records in Surname order for example Queries can select records that fit certain criteria. If you had a list of people and had a gender field you could use a query to select just the males or females in the database. The gender field would have a criteria set as "male" which means that when the query is run only records with "male" in the
Gender field would be listed. For each record that meets the criteria the you could choose to list other fields that may be in the table like first name, surname, phone number, date of birth or whatever you may have in the database. Queries can do much more than just listing out records. It is also possible to list out totals, averages etc. from the data and do various other calculations. Queries can also be used to do other tasks, such as deleting records, updating records, adding new records, creating new tables and creating tabulated reports The query optimizer is the component of a database management system that attempts to determine the most efficient way to execute a query. The optimizer considers the possible query plans for a given input query, and attempts to determine which of those plans will be the most efficient. Costbased query optimizers assign an estimated "cost" to each possible query plan, and choose the plan with the smallest cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, the CPU requirements, and other factors determined from the data dictionary. The set of query plans examined is formed by examining the possible access paths (e.g. index scan, sequential scan) and join algorithms (e.g. sort-merge join, hash join, nested loop join). The search space can become quite large depending on the complexity of the SQL query. Generally, the query optimizer cannot be accessed directly by users: once queries are submitted to database server, and parsed by the parser, they are then passed to the query optimizer where optimization occurs. However, some database engines allow guiding the query optimizer with hints. Most query optimizers represent query plans as a tree of "plan nodes". A plan node encapsulates a single operation that is required to execute the query. The nodes are arranged as a tree, in which intermediate results flow from the bottom of the tree to the top. Each node has zero or more child nodesthose are nodes whose output is fed as input to the parent node. For example, a join node will have two child nodes, which represent the two join operands, whereas a sort node would have a single child node (the input to be sorted). The leaves of the tree are nodes which produce results by scanning the disk, for example by performing an index scan or a sequential scan. Join ordering The performance of a query plan is determined largely by the order in which the tables are joined. For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and 1,000,000 rows, respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to execute than one that joins A and C first. Most query optimizers determine join order via a dynamic programming algorithm pioneered by IBM's System R database project. This algorithm works in two stages: 1. First, all ways to access each relation in the query are computed. Every relation in the query can be accessed via a sequential scan. If there is an index on a relation that can be used to answer a predicate in the query, an index scan can also be used. For each relation, the optimizer records
the cheapest way to scan the relation, as well as the cheapest way to scan the relation that produces records in a particular sorted order.2. 2. The optimizer then considers combining each pair of relations for which a join condition exists. For each pair, the optimizer will consider the available join algorithms implemented by the DBMS. It will preserve the cheapest way to join each pair of relations, in addition to the cheapest way to join each pair of relations that produces its output according to a particular sort order 3. Then all three-relation query plans are computed, by joining each two-relation plan produced by the previous phase with the remaining relations in the query. In this manner, a query plan is eventually produced that joins all the queries in the relation. Note that the algorithm keeps track of the sort order of the result set produced by a query plan, also called an interesting order. During dynamic programming, one query plan is considered to beat another query plan that produces the same result, only if they produce the same sort order. This is done for two reasons. First, a particular sort order can avoid a redundant sort operation later on in processing the query. Second, a particular sort order can speed up a subsequent join because it clusters the data in a particular way. Historically, System-R derived query optimizers would often only consider left-deep query plans, which first join two base tables together, then join the intermediate result with another base table, and so on. This heuristic reduces the number of plans that need to be considered (n!instead of 4^n), but may result in not considering the optimal query plan. This heuristic is drawn from the observation that join algorithms such as nested loops only require a single tuple (aka row) of the outer relation at a time. Therefore, a left-deep query plan means that fewer tuples need to be held in memory at any time: the outer relation's join plan need only be executed until a single tuple is produced, and then the inner base relation can be scanned (this technique is called "pipelining"). Subsequent query optimizers have expanded this plan space to consider "bushy" query plans, where both operands to a join operator could be intermediate results from other joins. Such bushy plans are especially important in parallel computers because they allow different portions of the plan to be evaluated independently. 3. Explain the following with respect to Heuristics of Query Optimizations:
Equivalence of Expressions
Ans: Equivalence of Expressions: - The first step in selecting a query-processing strategy is to find a relational algebra expression that is equivalent to the given query and is efficient to execute. Well use the following relations as examples: Customer-scheme = (cname, street, ccity) Deposit-scheme = (bname, account#, name, balance)
Branch-scheme = (bname, assets, bcity)
Selection Operation
Ans: Selection Operation: 1. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is
bname, assets( ccity=Port Chester deposit branch)) (customer
This expression constructs a huge relation, customer
deposit branch of which we are only interested in a few tuples.
We also are only interested in two attributes of this relation. We can see that we only want tuples for which ccity = Port Chester. Thus we can rewrite our query as: bname, assets(ccity=Port Chester(customer)) customer deposit branch)
This should considerably reduce the size of the intermediate relation. 2. Suggested Rule for Optimization: Perform select operations as early as possible. If our original query was restricted further to customers with a balance over $1000, the selection cannot be done directly to the customer relation above. The new relational algebra query is
The selection cannot be applied to customer, as balance is an attribute of deposit.
We can still rewrite as
If we look further at the subquery (middle two lines above), we can split the selection predicate in two:
This rewriting gives us a chance to use our perform selections early rule again. We can now rewrite our subquery as:
3. Second Transformational Rule: Replace expressions of the form P1^P2(C) by P1( P2( C)) where P1 and P2 predicates and e is a relational algebra expression. Generally,
P1( P2( C)) = P2( P1( C)) = P1^P2(C) Projection Operation

Ans: Projection Operation 1. Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query:
2. When we compute the subexpression
We obtain a relation whose scheme is (cname, ccity, bname, account#, balance) 3. We can eliminate several attributes from this scheme. The only ones we need to retain are those that appear in the result of the query or are needed to process subsequent operations. 4. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. 5. In our example, the only attribute we need is bname (to join with branch). So we can rewrite our expression as:
Note that there is no advantage in doing an early project on a relation before it is needed for some other operation:
We would access every block for the relation to remove attributes. Then we access every block of the reduced-size relation when it is actually needed. We do more work in total, rather than less!
Natural Join Operation

Ans: Another way to reduce the size of temporary results is to choose an optimal ordering of the join operations. Natural join is associative:
Although these expressions are equivalent, the costs of computing them may differ. Look again at our expression
We see that we can compute deposit However, deposit
branch first and then join with the first part.
branch is likely to be a large relation as it contains one tuple for every account.
The other part, is probably a small relation (comparatively).
So, if we compute first, we get a reasonably small relation.
It has one tuple for each account held by a resident of Port Chester. This temporary relation is much smaller than deposit Natural join is commutative:
branch.
Thus we could rewrite our relational algebra expression as:
But there are no common attributes between customer and branch, so this is a Cartesian product. Lots of tuples! If a user entered this expression, we would want to use the associatively and commutatively of natural join to transform this into the more efficient expression we have derived earlier (join with deposit first, then with branch).
4.
There are a number of historical, organizational, and technological reasons explain the lack of an all-encompassing data management system. Discuss few of them with appropriate examples.
Ans: Most current data management systems, DMS, have been built on the assumption that the data collection, or database, to be administered consists of a single media type - structured tables of "fact" data or unstructured strings of bits representing such media objects as text documents, images, or video. The result is that most DMS' store and index a specific type of media data and provide a query (data access) language that is specialized for efficient access to and retrieval of this data type. A further assumption that has frequently been made is that the information requirements of the system users are known and can be used for structuring the data collection and tuning the data management system. It has also been assumed that the users would only infrequently require information/data from some other type of data management system. These assumptions have been criticized since the early 1980s by researchers who have pointed out that almost from the point of creation, a database would not (nor could) contain all of the data required by the user community (Gligor & Luckenbaugh, 1984; Landers & Rosenberg,1982; Litwin et al., 1982; among many others). A number of historical, organizational, and technological reasons explain the lack of an all-encompassing data management system. Among these are: The sensible advice - to build small systems with the plan to extend their scope in later implementation phases - allows a core system to be implemented relatively quickly, but has lead to a proliferation of relatively small systems. Department autonomy has led to construction of department specific rather than organization wide systems, again leading to many small, overlapping, and often incompatible systems within an organization. The continual evolution of the organization and its interactions both within and to its external environment prohibits complete understanding of future information requirements. Parallel development of data management systems for particular applications has lead to different and incompatible systems for management of tabular/administrative data, text/document data, historical/statistical data, spatial/geographic data, and streamed/audio and visual data The result is that only a portion of an organization's data is administered by any one data management system and most organizations have a multitude of special purpose databases, managed by different, and often incompatible, data management system types. The growing need toretrieve data from multiple databases within an organization, as well as the rapid dissemination of data through the Internet, has given rise to the requirement of providing integrated access to both internal and external data of multiple types. A major challenge and critical practical and research problem for the information, computer, and communication technology communities is to develop data management systems that can provide
efficient access to the data stored in multiple private and public databases (Brodie,1993; Hurson & Bright, 1996; Nordbotten, 1988a, 1988b and Nordbotten, 1994a). Problems to be resolved include: 1. Interoperability among systems (Fox & Sornil, 1999; Liwtin, & Abdellatif, 1986), 2. Incorporation of legacy systems (Brodie, 1993) and 3. Integration of management techniques for structured and unstructured data (Stonebraker & Brown, 1999). Each of the above problems entails an integration of concepts, methods, techniques and tools from separate research and development communities that have existed in parallel but independently and have had rather minimal interaction. One consequence of which is that there exists an overlapping and conflicting terminology between these communities In the previous chapter, a database was defined as a COLLECTION OF RELATED DATA REPRESENTING SOME LOGICALLY COHERENT ASPECT OFTHE REAL WORLD. With this definition, NO limitations are given as to the type of: Data in the collection, Model used to structure the collection, or Architecture and geographic location of the database. The focus of this text is on on-line - electronic and web accessible - databases containing multiple media data, thus restricting our interest/focus to multimedia databases stored on one or more computers (DB servers) and accessible from the Internet. Examples of such databases include the image collections of the Hermitage Museum, the catalog and full text materials of the ACM digital library, and the customer records for the 7 sites of Amazon.com. Electronic databases are important since they contain data recording the products and services, as well as the economic history and current status of the owner organization. They are also a source of information for the organization's employees and customers/users. However, databases cannot be used effectively unless there exist efficient and secure data management systems, DMS for the data in the databases
5.
Describe the Structural Semantic Data Model (SSM) with relevant examples.
Ans: Modeling Complex and Multimedia Data Data modeling addresses a need in information system analysis and design to develop a model of the information requirements as well as asset of viable database structure proposals. The data modeling process consists of: 1. Identifying and describing the information requirements for an information system, 2. Specifying the data to be maintained by the data management system, and 3. Specifying the data structures to be used for data storage that best support the information requirements. A fundamental tool used in this process is the data model, which is used both for specification of the information requirements at the user level and for specification of the data structure for the database. During implementation of a database, the data model guides construction of the schema or data catalog which contains the meta data that describe the DB structure and data semantics that are used to support database implementation and data retrieval. Data modeling, using a specific data model type, and as a unique activity during information system design, is commonly attributed to Charles Bachman (1969) who presented the Data Structure Diagram as one of the first, widely used data models for network database design. Several alternative data model types were proposed shortly thereafter, the best known of which are the: Relational model (Codd, 1970) and the Entity-relationship, ER, model (Chen, 1976). The relational model was quickly criticized for being 'flat' in the sense that all information is represented as a set of tables with atomic cell values. The definition of well-formed relational models requires that complex attribute types (hierarchic, composite, multi-valued, and derived)be converted to atomic attributes and that relations be normalized. Inter-entity (inter-relation) relationships are difficult to visualize in the resulting set of relations, making control of the completeness and correctness of the model difficult. The relational model maps easily to the physical characteristics of electronic storage media, and as such, is a good tool for design of the physical database. The entity-relationship approach to modeling, proposed by Chen (1976), had two primary objectives: first to visualize inter-entity relationships and second to separate the DB design process into two phases 1. Record, in an ER model, the entities and inter-entity relationships required "by the enterprise", i.e. by the owner/user of the information system or application. This phase and its resulting model should be independent of the DBMS tool that is to be used for realizing the DB. 2. Translate the ER model to the data model supported by the DBMS to be used for implementation.
This two-phase design supports modification at the physical level without requiring changes to the enterprise or user view of the DB content. Also Chen's ER model quickly came under criticism, particularly for its lack of ability to model classification structures. In 1977, (Smith &Smith) presented a method for modeling generalization and aggregation hierarchies that underlie the many extended/enhanced entity-relationship, EER, model types proposed and in use today. 6. Describe the following with respect to Fuzzy querying to relational databases:
Proposed Model
Ans: The easiest way of introducing fuzziness in the database model is to use classical relational databases and formulate a front end to it that shall allow fuzzy querying to the database. A limitation imposed on the system is that because we are not extending the database model nor are we defining a new model in any way, the underlying database model is crisp and hence the fuzziness can only be incorporated in the query. To incorporate fuzziness we introduce fuzzy sets / linguistic terms on the attribute domains / linguistic variables e.g. on the attribute domain AGE we may define fuzzy sets as YOUNG, MIDDLE and OLD. These are defined as the following:
Fig. 1: Age
For this we take the example of a student database which has a table STUDENTS with the following attributes:
Fig. 2: A snapshot of the data existing in the database
Meta knowledge
Ans: At the level of meta knowledge we need to add only a single table, LABELS with the following structure:
Fig. 3: Meta Knowledge
This table is used to store the information of all the fuzzy sets defined on all the attribute domains. A description of each column in this table is as follows: Label: This is the primary key of this table and stores the linguistic term associated with the fuzzy set. Column_Name: Stores the linguistic variable associated with the given linguistic term. Alpha,Beta, Gamma, Delta: Stores the range of the fuzzy set.
Implementation
Ans: The main issue in the implementation of this system is the parsing of the input fuzzy query. As the underlying database is crisp, i.e. no fuzzy data is stored in the database, the INSERT query will not change and need not be parsed therefore it can be presented to the database as it is. During parsing the query is parsed and divided into the following 1. Query Type: Whether the query is a SELECT, DELETE or UPDATE. 2. Result Attributes: The attributes that are to be displayed used only in the case of the SELECT query. 3. Source Tables: The tables on which the query is to be applied. 4. Conditions: The conditions that have to be specified before the operation is performed. It is further sub-divided into Query Attributes (i.e. the attributes on which the query is to be applied) and the linguistic term. If the condition is not fuzzy i.e. it does not contain a linguistic term then it need not be subdivided.

MC0077 - Advanced Database Systems

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

MC0077 - Advanced Database Systems

Hochgeladen von

Copyright:

Verfügbare Formate

HARVINDER SINGH (MCA 4 Semester) Roll Number - 511025273

Book ID: B0882

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

Describe the following:

Object Database Models

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

Post Relational Database Models

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

Branch-scheme = (bname, assets, bcity)

bname, assets( ccity=Port Chester deposit branch)) (customer

This expression constructs a huge relation, customer

deposit branch of which we are only interested in a few tuples.

The selection cannot be applied to customer, as balance is an attribute of deposit.

We can still rewrite as

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

P1( P2( C)) = P2( P1( C)) = P1^P2(C) Projection Operation

2. When we compute the subexpression

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

Natural Join Operation

We see that we can compute deposit However, deposit

branch first and then join with the first part.

The other part, is probably a small relation (comparatively).

So, if we compute first, we get a reasonably small relation.

Thus we could rewrite our relational algebra expression as:

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

Fig. 2: A snapshot of the data existing in the database

HARVINDER SINGH (MCA 4th Semester) Roll Number - 511025273

Fig. 3: Meta Knowledge

Das könnte Ihnen auch gefallen