MC0077

Master of Computer Application (MCA) Semester 4 MC0077 Advanced Database Systems 4 Credits
(Book ID: B0882) Assignment Set 1 1. Explain the theory of database internals. Answer: A DBMS is a set of software programs that controls the organization, storage, management, and retrieval of data in a database. DBMSs are categorized according to their data structures or types. The DBMS accepts requests for data from an application program and instructs the operating system to transfer the appropriate data. The queries and responses must be submitted and received according to a format that conforms to one or more applicable protocols. When a DBMS is used, information systems can be changed more easily as the organization's information requirements change. New categories of data can be added to the database without disruption to the existing system.It also manages the data and the information related to the data
Database servers are dedicated computers that hold the actual databases and run only the DBMS and related software. Database servers are usually multiprocessor computers, with generous memory and RAID disk arrays used for stable storage. Hardware database accelerators, connected to one or more servers via a high-speed channel, are also used in large volume transaction processing environments. DBMSs are found at the heart of most database applications. DBMSs may be built around a custom multitasking kernel with built-in networking support, but modern DBMSs typically rely on a standard operating system to provide these functions. 2. Describe the following with respect to Query processing: A) Query Optimizer B) Query Plan C) Implementation
Answer:
Query Optimizer A SQL statement can be executed in many different ways, such as full table scans, index scans, nested loops, and hash joins. The query optimizer determines the most efficient way to execute a SQL statement after considering many factors related to the objects referenced and the conditions specified in the query. This determination is an important step in the processing of any SQL statement and can greatly affect execution time. Note:
The optimizer might not make the same decisions from one version of Oracle Database to the next. In recent versions, the optimizer might make different decisions, because better information is available. The output from the optimizer is an execution plan that describes an optimum method of execution. The plans shows the combination of the steps Oracle Database uses to execute a SQL statement. Each step either retrieves rows of data physically from the database or prepares them in some way for the user issuing the statement. For any SQL statement processed by Oracle, the optimizer performs the operations listed in below table Operation Evaluation of expressions and conditions Statement transformation Choice of optimizer goals Choice of access paths Description The optimizer first evaluates expressions and conditions containing constants as fully as possible. For complex statements involving, for example, correlated subqueries or views, the optimizer might transform the original statement into an equivalent join statement. The optimizer determines the goal of optimization. See "Choosing an Optimizer Goal". For each table accessed by the statement, the optimizer chooses one or more of the available access paths to obtain table data. See "Understanding Access Paths for the Query Optimizer". For a join statement that joins more than two tables, the optimizer chooses which pair of tables is joined first, and then which table is joined to the result, and so on. See "How the Query Optimizer Chooses Execution Plans for Joins".
Choice of join orders
Query Plan A query plan (or query execution plan) is an ordered set of steps used to access or modify information in a SQL relational database management system. This is a specific case of the relational model concept of access plans. Since SQL is declarative, there are typically a large number of alternative ways to execute a given query, with widely varying performance. When a query is submitted to the database, the query optimizer evaluates some of the different, correct possible plans for executing the query and returns what it considers the best alternative. Because query optimizers are imperfect, database users and administrators sometimes need to manually examine and tune the plans produced by the optimizer to get better performance. Implementation Implementation is the carrying out, execution, or practice of a plan, a method, or any design for doing something. As such, implementation is the action that must follow any preliminary thinking in order for something to actually happen. In an information technology context, implementation encompasses all
the processes involved in getting new software or hardware operating properly in its environment, including installation, configuration, running, testing, and making necessary changes. The word deployment is sometimes used to mean the same thing. 3. Explain the following with respect to Heuristics of Query Optimizations: A) Equivalence of Expressions B) Selection Operation C) Projection Operation D) Natural Join Operation Answer: Equivalent expressions We often want to replace a complicated expression with a simpler one that means the same thing. For example, the expression x + 4 + 2 obviously means the same thing as x + 6, since 4 + 2 = 6. More interestingly, the expression x + x + 4 means the same thing as 2x + 4, because 2x is x + x when you think of multiplication as repeated addition. (Which of these is simpler depends on your point of view, but usually 2x + 4 is more convenient in Algebra.) Two algebraic expressions are equivalent if they always lead to the same result when you evaluate them, no matter what values you substitute for the variables. For example, if you substitute x := 3 in x + x + 4, then you get 3 + 3 + 4, which works out to 10; and if you substitute it in 2x + 4, then you get 2(3) + 4, which also works out to 10. There's nothing special about 3 here; the same thing would happen no matter what value we used, so x + x + 4 is equivalent to 2x + 4. (That's really what I meant when I said that they mean the same thing.) When I say that you get the same result, this includes the possibility that the result is undefined. For example, 1/x + 1/x is equivalent to 2/x; even when you substitute x := 0, they both come out the same (in this case, undefined). In contrast, x2/x is not equivalent to x; they usually come out the same, but they are different when x := 0. (Then x2/x is undefined, but x is 0.) To deal with this situation, there is a sort of trick you can play, forcing the second expression to be undefined in certain cases. Just add the words for x 0 at the end of the expression to make a new expression; then the new expression is undefined unless x 0. (You can put any other condition you like in place of x 0, whatever is appropriate in a given situation.) So x2/x is equivalent to x for x 0. To symbolise equivalent expressions, people often simply use an equals sign. For example, they might say x + x + 4 = 2x + 4. The idea is that this is a statement that is always true, no matter what x is. However, it isn't really correct to write 1/x + 1/x = 2/x to indicate an equivalence of expressions, because this statement is not correct when x := 0. So instead, I will use the symbol , which you can read is equivalent to (instead of is equal to for =). So I'll say, for example, x + x + 4 2x + 4, 1/x + 1/x 2/x, and x2/x x for x 0. The textbook, however, just uses = for everything, so you can too, if you want. Selection Operation 1. Consider the query to find the assets and branch-names of all banks who have depositors living in Port Chester. In relational algebra, this is 2. 3. (CUSTOMER
DEPOSIT
BRANCH))
4. This expression constructs a huge relation, CUSTOMER DEPOSIT BRANCH of which we are only interested in a few tuples. 5. We also are only interested in two attributes of this relation. 6. We can see that we only want tuples for which CCITY = ``PORT CHESTER''. 7. Thus we can rewrite our query as: DEPOSIT BRANCH) This should considerably reduce the size of the intermediate relation. Projection Operation 1. Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query:
2. When we compute the subexpression 3. we obtain a relation whose scheme is (CNAME, CCITY, BNAME, ACCOUNT#, BALANCE ) 4. We can eliminate several attributes from this scheme. The only ones we need to retain are those that o appear in the result of the query or o are needed to process subsequent operations. 5. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. 6. In our example, the only attribute we need is BNAME (to join with BRANCH). So we can rewrite our expression as: 7. 8. 9. 10. Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: o We would access every block for the relation to remove attributes. o Then we access every block of the reduced-size relation when it is actually needed. o We do more work in total, rather than less! Projection Operation Like selection, projection reduces the size of relations. It is advantageous to apply projections early. Consider this form of our example query:
When we compute the subexpression we obtain a relation whose scheme is (CNAME, CCITY, BNAME, ACCOUNT#, BALANCE ). We can eliminate several attributes from this scheme. The only ones we need to retain are those that appear in the result of the query or are needed to process subsequent operations. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result, and thus its size. In our example, the only attribute we need is BNAME (to join with
BRANCH).
So
we
can
rewrite
our
expression
as:
Note that there is no advantage in doing an early project on a relation before it is needed for some other operation: We would access every block for the relation to remove attributes. Then we access every block of the reduced-size relation when it is actually needed. We do more work in total, rather than less! 4. Explain the following: A) Data Management Functions
B) Database Design & Creation
Answer: Data Management Functions The Administer Database: responsible for maintaining the integrity of the Data Management database Perform Queries: receives a query request from Access and Dissemination and executes the query to generate a result set} that is transmitted to the requester The Generate Report: receives a report request and executes any queries or other processes necessary to generate the report then supplies the report to the requester Receive Database Update: adds, modifies or deletes information in the Data Management persistent storage Activate Request: maintains a record of subscription requests and periodically compares it to the contents of the archive to determine if all needed data is available. If needed data is available, this function generates a Dissemination Request which is sent to the Access. This function can also generate Dissemination Requests on a periodic basis Database Design & Creation This project will explore, among other things, getting data into and out of a SQL Database. In fact, well explore a number of ways of doing so. To get started, however, we need to create a database that is sufficient for the project and sufficiently illustrative of working with relational databases, without being overly complex. Fortunately, the project requirements are of about the right complexity. Returning to the spec, (such as it is) we can see that we need to capture the following information each time an entry is made in a participating blog: The name of the blog The title of the entry The URL of the entry
Info about the blogger (first name, last name, email & phone) It would also be good to know when the blog entry was first created and when it was last modified. Flat or Normalized? You certainly could create a flat file that has all the information we need:
The advantage to this approach is that it is quick, easy, simple to understand. Putting in a few records, however, quickly reveals why this kind of flat-file database is now reserved for nonprogrammers creating simple projects such as an inventory of their music: Even with just a few records, you can see that a great deal of data is duplicated in every record,
This duplication can make the database very hard to maintain and subject to corruption. To prevent this, databases are normalized a complex subject that boils down to eliminating the duplication by factoring out common elements into separate tables and then reaching into those tables by using unique values (foreign keys). Thus, without bogging ourselves down in db theory, we can quickly redesign this flat database into three tables:
In this normalized database, each entry has only the information that is unique to the particular BlogEntry: Title URL Date Created Date Modified Short Description The entry also has the ID of the Blogger who wrote the entry and the ID of the Blog that the entry belongs to. A second table holds the information about each Blog, and a third table holds the information about each Blogger. Thus, a given Bloggers first and last name, alias (email address at Microsoft.com) and phone number are entered only ONCE for each blogger, and referenced in each entry by ID (known as a foreign key). 5. Describe the Structural Semantic Data Model (SSM) with relevant examples. Answer Data modelling addresses a need in information system analysis and design to develop a model of the information requirements as well as a set of viable database structure proposals. The data modelling process consists of: 1. Identifying and describing the information requirements for an information system, 2. Specifying the data to be maintained by the data management system, and 3. Specifying the data structures to be used for data storage that best support the information requirements. A fundamental tool used in this process is the data model, which is used both for specification of the information requirements at the user level and for specification of the data structure for the database. During implementation of a database, the data model guides construction of the schema or data catalog which contains the metadata that describe the DB structure and data semantics that are used
to support database implementation and data retrieval. Data modelling, using a specific data model type, and as a unique activity during information system design, is commonly attributed to Charles Bachman (1969) who presented the Data Structure Diagram as one of the first, widely used data models for network database design. Several alternative data model types were proposed shortly thereafter, the best known of which are the: Relational model (Codd, 1970) and the Entity-relationship, ER, model (Chen, 1976). The relational model was quickly criticized for being 'flat' in the sense that all information is represented as a set of tables with atomic cell values. The definition of well-formed relational models requires that complex attribute types (hierarchic, composite, multi-valued, and derived) be converted to atomic attributes and that relations be normalized. Inter-entity (inter-relation) relationships are difficult to visualize in the resulting set of relations, making control of the completeness and correctness of the model difficult. The relational model maps easily to the physical characteristics of electronic storage media, and as such, is a good tool for design of the physical database. The entity-relationship approach to modelling, proposed by Chen (1976), had two primary objectives: first to visualize inter-entity relationships and second to separate the DB design process into two phases: 1. Record, in an ER model, the entities and inter-entity relationships required "by the enterprise", i.e. by the owner/user of the information system or application. This phase and its resulting model should be independent of the DBMS tool that is to be used for realizing the DB. 2. Translate the ER model to the data model supported by the DBMS to be used for implementation. This two-phase design supports modification at the physical level without requiring changes to the enterprise or user view of the DB content. Also Chen's ER model quickly came under criticism, particularly for its lack of ability to model classification structures. In 1977, (Smith & Smith) presented a method for modelling generalization and aggregation hierarchies that underlie the many extended/enhanced entity-relationship, EER, model types proposed and in use today. 6. Explain the following concepts with respect to SQL3: A) Result Presentation B) Image Retrieval Answer Result Presentation
While there are no new presentation operators in SQL3, both complex and derived attributes can be used as presentation criteria in the standard clauses "group by, having, and order by". However, Large objects, LOBs, cannot be used, since 2 LOBs are unlikely to be identical and have no logical order. SQL3 expands embedded attributes, displaying them in 1 'column' or as multiple rows. Depending on or-dbms implementation, The result set is presented either totally, the first 'n' rows or one tuple at a time. If an attribute of a relation in the result set is defined as a large object, LOB, its presentation may fill one or more screens/pages for each tuple. SQL3, as a relational language using exact match selection criteria, has no concept of degrees of relevance and thus no support for ranking the tuples in the result set by semantic
nearness to the query. Providing this functionality will require user defined output functions, or specialized document processing subsystems as provided by some OR-DBMS vendors. Image Retrieval Popular knowledge claims that an image is worth a 1000 words. Unfortunately, these words are generally not available for image retrieval. In addition, the 1000-word image description may differ from one individual to another depending on their perspective and/or knowledge of the image context. Finding images from an image collection, depends on the system being able to 'understand' the query specifications and match these specifications to the images. Matching each stored image to the query specifications at query request time can be very time consuming. Therefore, researchers and developers of information retrieval systems recommend the use of predefined image descriptors or metadata as the basis for image retrieval. Image descriptors can be classified into 2 types: TEXTUAL DESCRIPTORS, such as those recommended by the database community and in the numerous metadata standards, such as Dublin Core, MPEG-7 and CIDOC/CRM. These metadata can again be classified as: 1. Attribute-based metadata, such as creator, dates, genre, (source) image type, size, file name, ..., or 2. Text-based metadata, such as title/caption, subject/keyword lists, free-text descriptions and/or the text surrounding embedded images, for example as used in a html document. VISUAL DESCRIPTORS that can be extracted from the image implementation as recommended and used by the image interpretation community. These descriptors include: 1. The object set, identified within an image, possibly supplemented with a shape thesaurus in a way similar to that done for the terms in text document collections, and 2. Low/pixel level content features describing the color, texture, and/or (primitive) shape within the image. An Image retrieval system needs to be able to utilize each of the descriptor types listed above. Most of these systems support text based content indexing as well as some bit-level pattern matching.. For example, a search for documents containing images that look like Edvard Greig assumes that the query processor can use: 1. A picture of Edvard Greig, retrieved from the DB or given as input through the query language, to search DB for images containing similar images. and/or 2. A text-based search in the titles and text sections of the document collection and extract associated images, and/or
3. An attribute-based search using the metadata describing the image data. Image descriptors are used to form the basis for one or more image signatures that can be indexed. An image query is analyzed using the same descriptor technique(s) giving a query signature, which is then compared to the image signature(s) to determine similarity between the query specification and the DB image signatures. In general, queries to image collections can be classified as Textual queries, which use a term-based specification of the desired images which can be matched to textual image descriptors or Visual queries, giving an image example which can be compared to visual descriptors. Each query type has a typical form and requires different processing functions, which is discussed further in the following. Using Extended SQL3 for Image Retrieval SQL3, as defined by the ISO standard, primarily supports attribute-based search, though the string comparison operator LIKE has been extended to search in very long texts stored as CLOBS (character long objects). SQL3 'knows' little about, has no operators for analyzing/indexing the content of media objects (including texts) stored as large binary objects (LOBs). Therefore, additional functions for indexing and retrieval of media data have been added to many or-dbm systems, such as IBM's DB2 or Oracle's InterMedia system. This 'new' functionality is apparent in extensions to the SQL3 language, particularly the create index and the select ... where statements. The result is, in this text, called SQL3/mm (multimedia) or SQL3/Text and SQL3/Image for these specific media types. 1) Attribute-based search Attribute-based queries can be formulated as standard SLQ3 queries and processed using a standard or-dbms. For example, a query to retrieve 4th of July pictures taken by Joan Nordbotten could be expressed as: select * from image_table where Date_taken = "July 4" and Name='Joan Nordbotten'; This query assumes that the SQL3 query processor can: Transform the verbal date form to an internal representation, Concatenate structured attributes: Name.First, Name.Last, Travers the link from Image_table.creator to the creator.name attribute, and Output image (blob) data. If not, than relatively simple UDF's can be defined for the DB structure to perform these functions. 2) Using text retrieval Often, the image requester is able to give a verbal description/specification of the content of the required images, for example images showing "humpback whales at play" or "wedding parties at a New England church". These text-based queries can be formulated as free-text
or a term list that can be compared to such text descriptors as description, subjects, title and/or the text surrounding an embedded image, using the text retrieval techniques. Google/Image, utilizes this search strategy, returning 248 thumb-nail links (on April 2, 2006) for the query: humpback whales at play. (Note that the top ranked image is actually not of a whale.) If "Google's" whale images (i.e. the Web pages containing whale images) were collected in an extended (with multimedia functionality) or-DB, it would also be possible to use an extended sql3/Text query such as: select images from image_DB where contains (Web-pages, 'humpback' and 'whales' and 'play'); Both Google/Image and SQL3/Text use term indexes developed on text fields to facilitate query execution, as well as the contains and similar to operators. Query by Image Content, CBIR The term 'image content' is used in the image retrieval community to refer to the implementation content of the image, i.e. to its pixel content. An analysis of this low-level image content produces image signatures of color and texture distributions, as well as identification of primitive shapes. To date, there is little ability to connect this information to the semantics that the image depicts. Perhaps the most well known CBIR image retrieval system (or the one most easily found) is IBM's QBIC (query by image content) for DB2. QBIC indexes images based on their visual characteristics of color, texture and shape, as well as their descriptive keywords. A QBIC query is given as a sketch of the principle shapes and colors required. QBIC is used as one method for image retrieval in the picture galleries of the virtual Hermitage Museum. 9.2.1 Using content descriptors; color and texture An image is basically a long string of pixels for which each pixel is identified by its place in the image matrix, its color and its intensity. An analysis of the pixel set can give information about the distribution of dominant colors, the image texture, and the shapes formed by marked change in neighboring colors. Though there are many techniques used for color and texture extraction, they are variations and/or refinements of a basic color profile calculation that: Determines the number of colors to use and divides the color spectrum into related color sets. Determines the grid characteristics for the image, i.e. the number, shape and placement of the grid cells to used for the image analysis. Sums the number of pixels of each color set of each grid cell into a feature vector or image signature or feature histogram. Indexes the feature vector/signature for each image in the DB. An image query is given as an example or seed image or sketch for which a feature vector is calculated in the same way as that used for the DB image set. The query signature is then
compared to the DB image signatures, using a distance measure. Low/pixel level feature-based queries tend to compare whole images and may be formulated as find images similar to "this one"; The VISI prototype can be used to explore the effects of weighting low level features on image retrieval quality. 9.2.2 Identifying shapes image objects Shape identification and recognition is the most difficult challenge to image analysis since it relies on: 1. Isolation of the different objects/shapes within the image, which may not necessarily be whole or standardized, but are likely 'hidden' in the perspectives of the image and then 2. Normalizing the object's size and rotation 3. Identification and possible connection of object parts, for example 'completing' the car which has a person standing in front of it. 4. Semantic identification of the image components/objects. To date, automatic object recognition has only been accomplished for well-defined domains where objects of interest are well known and well defined within the image, such as: Police images of faces and fingerprints, Medical images from domain x-rays, mri and cat scans, and Industrial surveillance of building structures, such as bridges, tunnels or pipelines. An object-based query requires a query language that accepts a visual object/image as an example, asking a question like: find images that contain "this image"; The contains operator for shape/object identification needs to be adapted for image retrieval such that in addition to the Boolean operators AND, OR and NOT, the location/spatial operators should include near, overlapping, within, in foreground/background, .... Use of a shape thesaurus can also expand the search example and thus the likelihood of a good/relevant result list. IBM's QBIC system, included in the Hermitage Museum Web-site, provides some support for this query functionality, as does the VISI prototype, which is based on Oracle/Intermedia's image retrieval system.
Master of Computer Application (MCA) Semester 4 MC0077 Advanced Database Systems 4 Credits
(Book ID: B0882) Assignment Set 2 1. Explain the following with respect to Object Oriented databases: A) Query Processing Architecture B) Object Relational Database Implementation Answer: Query Processing Architecture There are several views on architecture of query processing. Figure below presents basic dependencies between three fundamental data structures involved in SBA: an object store, an environment stack and a query results stack. An object store contains volatile (non-shared) objects and persistent (shared) objects. Both stacks contain references to objects. Non-algebraic operators act on the query result stack and the object store and affect the environment stack. Query evaluation takes the state of the environment stack and the state of the objects store and puts query results on the query result stack.
Fig. Dependencies between object store, environment stack and query result stack The next figure presents a more detailed view on the architecture, which involves more data structures (figures with dashed lines) and program modules (grey boxes). This architecture is implemented in our newest project ODRA. The architecture takes into account the subdivision of the storage and processing between client and server, strong typing and query optimization (by rewriting and by indices).
Fig. Architecture of query processing with strong type checking and query optimization Below we present a short description of the architecture elements presented in Fig.25. On the side of
the client application we have the following elements. A source code of a query is created within the software development environment, which includes an editor, a debugger, storage of source programs, storage of compiled programs, etc. A parser of queries and programs takes a query source as input, makes syntactic analysis and returns a query/program syntactic tree. A query/program syntactic tree is a data structure which keeps the abstract query syntax in a well-structured form, allowing for easy manipulation (e.g. inserting new nodes or subtrees, moving some subtree to another part of the tree, removing some subtrees, etc.). Each node of the tree contains a free space for writing various query optimization information. The strong type checker takes a query/program syntactic tree and checks if it conforms to the declared types. Types are recorded within a client local metabase and within the metabase of persistent objects that is kept on the server. The metabases contain information from declarations of volatile object types (that are a part of source programs) and from a database schema. The module that organizes the metabases is not shown. The strong type checker uses two stacks, static ENVS and static QRES, which simulate actual execution of a query during compile time. Static stacks contain signatures of environments and signatures of results (will be explained much later). The static type checker has several other functions. In particular, it changes the query syntactic tree by introducing new nodes that allow, in particular, for automatic dereferences, automatic coercions, for resolving ellipses and for dynamic type checks (if static checks are impossible). The checker introduces also additional information to the nodes of the query syntactic tree that is necessary further for query optimization. The most important information is the level of the ENVS stack on which a particular name will be bound. Static ENVS (S_ENVS) - static environment stack (will be explained much later). Static QRES (S_QRES) - static result stack (will be explained much later). Local metabase - a data structure containing information of types introduced in source programs. Optimization by rewriting - this is a program module that changes the syntactic tree that is already decorated by the strong type checker. There are several rewriting methods that are developed for SBA, in particular: Performing calculations on literals. Changing the order of execution of algebraic operators. Application of the query modification technique, which changes invocations of views into view bodies. To this end, the optimization module refers to the register of views that is kept on the server. Removing dead subqueries, i.e. subqueries that do not influence the final query result. Factoring out independent subqueries, that is, subqueries whose result is not changed within some loop. Shifting conditions as close as possible to the proper operator, e.g. shifting selection condition before a join. Methods based on distributivity property of some query operators. Perhaps other rewriting methods that are currently under investigation. Optimization by indices - this is a program module that changes the syntactic tree that is already decorated by the strong type checker. Changes concerns some subtrees that can be substituted by invocation of indices. To this end, the optimization module refers to the register of indices that is kept on the server. Changes depend on the kind of an index. The module
can also be extended to deal with cached queries. Interpreter of queries/programs. It processes the optimized query syntactic tree and performs execution of the query. To this end it uses two run-time stacks, ENVS and QRES, referes to volatile (non-shared) objects that are kept on the client and to persistent (shared) objects that are kept on the server. Object on the server are available through object manager, i.e. some API that performs everything on persistent objects that is needed. On the side of the database server we have the following architectural elements: Persistent (shared) objects - this is a part of the object store commonly known as a database. Object manager - this is a low-level API that performs everything on persistent objects that is needed. Note that unlike SQL this API does not involve queries, but more atomic operations like get first object Emp, get next object Emp, etc. Metabase of persistent objects - this is a compiled database schema plus some additional information, e.g. necessary for optimization. Processing persistent abstractions (views, stored procedures, triggers) - essentially, this module contains all basic elements of the client side and extends them by additional functionalities. Register of indices and register of views are data structures that contain and externalize the information of created indices and created views. The information is used by the client for query optimization. Internally, this information is fulfilled by the administration module. Administration module - makes all operations that are necessary on the side of the server, e.g. introducing a new index, removing an index, introducing a new view, changing the database schema, etc. In this architecture we assume that the query/program interpreter (which will be further formalized as the procedure eval) acts directly on a syntactic tree. Syntactic trees are the most convenient way to represent and to process query evaluation plans, perhaps, involving some optimizations. Cost-based query optimizers can generate several such query evaluation plans to take out one of them that is the most promising in terms of the anticipated query evaluation cost. According to our implementation experience, the interpreter acting directly on a query syntactic tree (transformed by strong typing and optimization modules) is the most convenient, flexible and easy to implement solution. However, we can also assume that this tree will be further compiled to some bytecode and the interpreter will act on this bytecode. This solution was implemented in Loqis. It is also possible that the tree will be further converted to the machine code. This solution was implemented in Linda, however, actually we see no essential advantages of it and a lot of disadvantages. This view on the query processing architecture, although quite detailed, still can be augmented by new architectural elements, e.g. by a cost-based query optimizer, by introducing user sub-schemas, and many others. They will be refined in further parts of the SBA description. Object Relational Database Implementation
An object database (also object-oriented database management system) is a database management system in which information is represented in the form of objects as used in object-oriented programming. Object databases are a niche field within the broader database management system (DBMS) market dominated by relational database
management systems. Object databases have been considered since the early 1980s and 1990s, but they have made little impact on mainstream commercial data processing, though there is some usage in specialized areas. When database capabilities are combined with object-oriented programming language capabilities, the result is an object-oriented database management system (OODBMS). OODBMS allow object-oriented programmers to develop the product, store them as objects, and replicate or modify existing objects to make new objects within the OODBMS. Because the database is integrated with the programming language, the programmer can maintain consistency within one environment, in that both the OODBMS and the programming language will use the same model of representation. Relational DBMS projects, by way of contrast, maintain a clearer division between the database model and the application. As the usage of web-based technology increases with the implementation of Intranets and extranets, companies have a vested interest in OODBMS to display their complex data. Using a DBMS that has been specifically designed to store data as objects gives an advantage to those companies that are geared towards multimedia presentation or organizations that utilize computer-aided design (CAD). Some object-oriented databases are designed to work well with object-oriented programming languages such as Delphi, Ruby, Python, Perl, Java, C#, Visual Basic .NET, C++, Objective-C and Smalltalk; others have their own programming languages. OODBMSs use exactly the same model as object-oriented programming languages. Most object databases also offer some kind of query language, allowing objects to be found by a more declarative programming approach. It is in the area of object query languages, and the integration of the query and navigational interfaces, that the biggest differences between products are found. An attempt at standardization was made by the ODMG with the Object Query Language, OQL. Access to data can be faster because joins are often not needed (as in a tabular implementation of a relational database). This is because an object can be retrieved directly without a search, by following pointers. (It could, however, be argued that "joining" is a higher-level abstraction of pointer following.) Another area of variation between products is in the way that the schema of a database is defined. A general characteristic, however, is that the programming language and the database schema use the same type definitions. Multimedia applications are facilitated because the class methods associated with the data are responsible for its correct interpretation. Many object databases, for example VOSS, offer support for versioning. An object can be viewed as the set of all its versions. Also, object versions can be treated as objects in their own right. Some object databases also provide systematic support for triggers and constraints which are the basis of active databases. The efficiency of such a database is also greatly improved in areas which demand massive amounts of data about one item. For example, a banking institution could get the user's account information and provide them efficiently with extensive information such as
transactions, account information entries etc. The Big O Notation for such a database paradigm drops from O(n) to O(1), greatly increasing efficiency in these specific cases.
2. Describe the following with respect to SQL3 DB specification: A) Complex Structures B) Hierarchical Structures D) Large OBjects, LOBs E) Storage of LOBs
C) Relationships
Answer: Complex Structures: 1. Create row type Address_t defines the address structure that is used in line 8. 2. Street#, Street, ... are regular SQL2 specifications for atomic attributes. 3. PostCode and Geo-Loc are both defined as having user defined data types, Pcode
and Point respectively. 4. Pcode is typically locally defined as a list or table of valid postal codes, perhaps with the post office name. Point would typically be a data type defined in a spatial data management subsystem supplied by an OR-DBMS vendor (as an extender, data blade, or cartridge from IBM, Informix, or Oracle respectively). 5. Create function Age_f defines a function for calculation of an age, as a decimal value, given a start date as the input argument and using a simple algorithm based on the current date. This function is used as the data type in line 9 and will be activated each time the Person.age attribute is retrieved. The function can also be used as a condition clause in a SELECT statement. 6. Create table PERSON initiates specification of the implementation structure for the Person entity-type. 7. Id is defined as the primary key. The not null phrase only controls that some 'not null' value is given. The primary key phrase indicates that the DBM is to guaranty that the set of values for Id are unique. 8. Name has a data-type, PersName, defined as a Row type similar to the one defined i BirthDate is a date that can be used as the argument for the function Age_f defined in line 4. 9. Address is defined using the row type Address_t, defined in lines 1-3. Picture is defined as a BLOB, or Binary Large OBject. Note that there are no functions for content search, manipulation or presentation, which support BLOB data types. These must be defined either by the user as userdefined functions, UDFs, or by the OR-DBMS vendor in a supplementary subsystem. In this case, we need functions for image processing. 10. Age is defined as a function, which will be activated each time the attribute is retrieved. This costs processing time (though this algorithm is very simple), but gives a correct value each time the attribute is used. Hierarchical structures
1. Create table STUDENT initiates specification of the implementation of a subclass
entity type. 2. GPA, Level, ... are the attributes for the subclass, here with simple SQL2 data types. 3. under PERSON specifies the table as a subclass of the table PERSON. The DBM thus knows that when the STUDENT table is requested, all attributes and functions in PERSON are also relevant. An OR-DBMS will store and use the primary key of PERSON as the key for STUDENT, and execute a join operation to retrieve the full set of attributes. 4. Create table COURSE specifies a new table specification, as done for statements in lines 5 and 10 above. 5. Id, Name, and Level are standard atomic attribute types with SQL2 data types. Id is defined as requiring a unique, non null value, as specified for PERSON in line 6 above. Note that attributes must have unique names within their tables, but the name may be reused, with different data domains in different tables. Both Id and Name are such attribute-names, appearing in both PERSON and COURSE, as is Level used in STUDENT and COURSE. 6. Course.Description is defined as a character large object, CLOB. A CLOB data type has the same defined character-string functions as char, varchar, and long char, and can be compared to these. User_id is defined as a Ucode, which is the name of a user defined data type, presumably a list of acceptable user codes. The DB implementer must define both the data type and the appropriate functions for processing this type. 7. User_Id is also specified as a foreign key which links the Course records to their "user" record, modelled as a category subentity-type, through the primary key in the User table. Relationships This definition needs only SQL2 specifications. Note that: {Sid, Cid, and Term} form the primary key, PK. Since the key is composite, a separate Primary key clause is required. (As compared with the single attribute PK specifications for PERSON.Id and COURSE.Id.) The 2 foreign key attributes in the PK, must be defined separately. TakenBy.Report is a foreign key to a report entity-type, forming a ternary relationship as modelled in Figure 3.2a. The ON DELETE trigger is activated if the Report relation is deleted and assures that the FK link has a valid value, in this case 'null'. Large OBjects, LOBs The SSM syntax includes data types for potentially very long media types, such as text, image, audio and video. If this model is to be realized in a single database, the DMS will have to have the capability to manage - store, search, retrieve, and manipulate different media types. Object-relational dbms vendors claim to be able to do this.
SQL3 provides support for storage of Binary Large OBjects, BLOBs. A BLOB is simply a very long bit string, limited in many systems today to 2 or 4GB. Several or-dbms vendors differentiate BLOBs into data-types that give more information about the format of the content and provide basic/primitive manipulation functions for these large object, LOB, types. For example, IBM's DB2 has 3 LOB types: BLOB for long bit strings, CLOB for long character strings, and DBCLOB for double-byte character strings. Oracle data types for large objects are BLOB, CLOB, NCLOB (fixed-width multi-byte CLOB) and BFILE (binary file stored outside the DB). Note that the 1st 3 are equivalent to the DB2 LOBs, while the last is really not a media data-type, but rather is a link to an externally stored media object. SQL3 has no functions for processing, f.ex. indexing the content of a BLOB, and provides only functions to store and retrieve it given an external identifier. For example, if the BLOB is an image, SQL3 does not 'know' how to display it, i.e. it has no functions for image presentation. DBMS vendors who provide differentiated blob types have also extended the basic SQL string comparison operators so that they will function for LOBs, or at least CLOBs. These operators include the pattern match function "LIKE", which gives a true/false response if the search string is found/not found in the *LOB attribute. Note: "LIKE" is a standard SQL predicate that simply has been extended to search very long data domains. Storage of LOBs There are 3 strategies for storing LOBs in an or-DB: 1. Embedded in a column of the defining relation, or 2. Stored in a separate table within the DB, linked from the *LOB column of the defining relation. 3. Stored on an external (local or geographically distant) medium, again linked from the *LOB column of the defining relation. Embedded storage in the defining relation closely maps the logical view of the media object with its physical storage. This strategy is best if the other attributes of the table are primarily structural metadata used to specify display characteristics, for example length, language, format. The entity-types "Text" and "Image" are candidates for this storage form. The problem with embedded storage is that a DMS must transfer at least a whole tuple, more commonly a block of tuples, from storage for processing. If blobs are embedded in the tuples, a great deal of data must be transmitted even if the LOB objects are not part of the query selection criteria or the result. For example, a query retrieving the name and address of persons living in Bergen, Norway, would also retrieve large quantities of image data if the data for the Person.Picture attribute were stored as an embedded column in the Person table.
Separate table storage gives indirect access via a link in the defining relation and delays retrieval of the LOB until it is to be part of the query result set. Though this gives a two-step retrieval, for example when requesting an image of Joan Nordbotten, it will reduce general or average transfer time for the query processing system. A drawback of this storage strategy is a likely fragmentation of the DB area, as LOBs can be stored 'anywhere'. This will decrease the efficiency of any algorithm searching the content of a larger set of LOBs, for example to find images that are similar to or contain a given image segment. As usual, the storage structure chosen for a DB should be based on an analysis of anticipated user queries. External storage is useful if the DB data is 'connected' to established media databases, either locally on CD, DVD, ... or on other computers in a network as will most likely be the case when sharing media data stored in autonomous applications, such as cooperating museums, libraries, archives, or government agencies. This storage structure eliminates the need for duplication of large quantities of data that are normally offered in read-only mode. The cost is in access time which may currently be nearly unnoticeable. A good multimedia DMS should support each of these storage strategies. 3. Explain: A) Data Dredging Answer: Data Dredging Data dredging, sometimes referred to as "data fishing" is a data mining practice in which large volumes of data are analyzed seeking any possible relationships between data. The traditional scientific method, in contrast, begins with a hypothesis and follows with an examination of the data. Sometimes conducted for unethical purposes, data dredging often circumvents traditional data mining techniques and may lead to premature conclusions. Data dredging is sometimes described as "seeking more information from a data set than it actually contains." Data dredging sometimes results in relationships between variables announced as significant when, in fact, the data require more study before such an association can legitimately be determined. Many variables may be related through chance alone; others may be related through some unknown factor. To make a valid assessment of the relationship between any two variables, further study is required in which isolated variables are contrasted with a control group. Data dredging is sometimes used to present an unexamined concurrence of variables as if they led to a valid conclusion, prior to any such study. Although data dredging is often used improperly, it can be a useful means of finding surprising relationships that might not otherwise have been discovered. However, because
B) Data Mining Techniques
the concurrence of variables does not constitute information about their relationship (which could, after all, be merely coincidental), further analysis is required to yield any useful conclusions. Data Mining Techniques Data mining is sorting through data to identify patterns and establish relationships. Data mining parameters include: Association - looking for patterns where one event is connected to another event Sequence or path analysis - looking for patterns where one event leads to another later event Classification - looking for new patterns (May result in a change in the way the data is organized but that's ok) Clustering - finding and visually documenting groups of facts not previously known Forecasting - discovering patterns in data that can lead to reasonable predictions about the future (This area of data mining is known as predictive analytics.) Data mining techniques are used in a many research areas, including mathematics, cybernetics, genetics and marketing. Web mining, a type of data mining used in customer relationship management (CRM), takes advantage of the huge amount of information gathered by a Web site to look for patterns in user behavior.
4. Describe the following Data Mining Functions: A) Classification C) Sequential/Temporal patterns Answer:
B) Associations D) Clustering/Segmentation
Classification Data mine tools have to infer a model from the database, and in the case of supervised learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class. When learning classification rules the system has to find the rules that predict the class from the predicting attributes so firstly the user has to define conditions for each class, the data mine system then constructs descriptions for the classes. Basically the system should given a case or tuple with certain known attribute values be able to predict what class this case belongs to. Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative. A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class.
A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, are very probable. The categories of rules are: exact rule - permits no exceptions so each object of LHS must be an element of RHS strong rule - allows some exceptions, but the exceptions have a given limit probablistic rule - relates the conditional probability P(RHS|LHS) to the probability P(RHS) Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS. Associations Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A,B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule. A typical application, identified by IBM, that can be built using an association function is Market Basket Analysis. This is where a retailer run an association operator over the point of sales transaction log, which contains among other information, transaction identifiers and product identifiers. The set of products identifiers listed under the same transaction identifier constitutes a record. The output of the association function is, in this case, a list of product affinities. Thus, by invoking an association function, the market basket analysis application can determine affinities such as "20% of the time that a specific brand toaster is sold, customers also buy a set of kitchen gloves and matching cover sets." Another example of the use of associations is the analysis of the claim forms submitted by patients to a medical insurance company. Every claim form contains a set of medical procedures that were performed on a given patient during one visit. By defining the set of items to be the collection of all medical procedures that can be performed on a patient and the records to correspond to each claim form, the application can find, using the association function, relationships among medical procedures that are often performed together. Sequential/Temporal patterns Sequential/temporal pattern functions analyse a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i.e. consisting of a number of items drawn from a given collection of items). The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information, for each customer, of the sets of products that the customer buys in every
purchase order. A sequential pattern function will analyse such collections of related records and will detect frequently occurring patterns of products bought over time. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven. Sequential pattern mining functions are quite powerful and can be used to detect the set of customers associated with some frequent buying patterns. Use of these functions on for example a set of insurance claims can lead to the identification of frequently occurring sequences of medical procedures applied to patients which can help identify good medical practices as well as to potentially detect some medical insurance fraud. Clustering/Segmentation Clustering and segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A cluster is a set of objects grouped together because of their similarity or proximity. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters. Clustering according to similarity is a very powerful technique, the key to it being to translate some intuitive measure of similarity into a quantitative measure. When learning is unsupervised then the system has to discover its own classes i.e. the system clusters the data in the database. The system has to discover subsets of related objects in the training set and then it has to find descriptions that describe each of these subsets. There are a number of approachs for forming clusters. One approach is to form rules which dictate membership in the same group based on the level of similarity between members. Another approach is to build set functions that measure some property of partitions as functions of some parameter of the partition.
5. Explain the following concepts in the context of Fuzzy Databases: A) Need for Fuzzy Databases B) Techniques for implementation of Fuzziness in Databases C) Classification of Data Answer: Need for Fuzzy Databases
As the application of database technology moves outside the realm of a crisp mathematical world to the realm of the real world, the need to handle imprecise information becomes important, because a database that can handle imprecise information shall store not only raw data but also related information that shall allow us to interpret the data in a much deeper context, e.g. a query Which student is young and has sufficiently good grades? captures the real intention of the users query than a crisp query as SELECT * FROM STUDENT WHERE AGE < 19 AND GPA > 3.5
Such a technology has wide applications in areas such as medical diagnosis, employment, investment etc. because in such areas subjective and uncertain information is not only common but also very important. Techniques for implementation of Fuzziness in Databases
One of the major concerns in the design and implementation of fuzzy databases is efficiency i.e. these systems must be fast enough to make interaction with the human users feasible. In general, we have two feasible ways to incorporate fuzziness in databases: 1. Making fuzzy queries to the classical databases. 2. Adding fuzzy information to the system. Classification of Data
The information data can be classified as following : 1. Crisp : There is no vagueness in the information. e.g., X = 13 Temperature = 90 2. Fuzzy : There is vagueness in the information and this can be further divided into two types as a) Approximate Value : The information data is not totally vague and there is some approximate value, which is known and the data, lies near that value. e.g., 10 X 15 Temperature 85 These are considered have a triangular shaped possibility distribution as shown below
0 -d X +d ( APPROX X ) Figure: Possibility Distribution for an approximate value The parameter, d gives the range around which the information value lies.
a. Linguistic Variable: A linguistic variable is a variable that apart from representing a fuzzy number also represents linguistic concepts interpreted in a particular context. Each linguistic variable is defined in terms of a variable which either has a physical interpretation (speed, weight etc.) or any other numerical variable (salary, absences, gpa etc.) A linguistic variable is fully characterized by a quintuple <v,T,X,g,m> where, v is the name of the linguistic variable. T is the set of linguistic terms that apply to this variable. X is the universal set of the values of X. g is a grammar for generating the linguistic terms. m is a semantic rule that assigns to each term t T, a fuzzy set on X. The information in this case is totally vague and we associate a fuzzy set with the information. A linguistic term is the name given to the fuzzy set. e.g., X is SMALL Temperature is HOT
These are considered have a trapezoidal shaped possibility distribution as shown below SMALL 1
0 Figure: Possibility Distribution for a Linguistic Term SMALL for the Linguistic Variable HEIGHT There are four parameters associated with a linguistic term as and as shown in the Fig. 3. For the range [] the membership value is 1.0, while for the range [, and [] the membership value remains between [0.0, 1.0]. 6. Explain the following concepts with respect to Distributed Database Systems: A) Data Replication B) Options for Multi Master Replication Answer:
Data Replication Replication is the process of copying and maintaining database objects, such as tables, in multiple databases that make up a distributed database system. Changes applied at one site are captured and stored locally before being forwarded and applied at each of the remote locations. Replication supports a variety of applications that often have different requirements. For example, sales force automation, field service, retail, and other mass deployment applications typically require data to be periodically synchronized between central database systems and a large number of small, remote sites, which are often disconnected from the central database. Members of a sales force must be able to complete transactions, regardless of whether they are connected to the central database. In this case, remote sites must be autonomous. A replication object is a database object existing on multiple servers in a distributed database system. In a replication environment, any updates made to a replication object at one site are applied to the copies at all other sites. Oracle Replication enables you to replicate the following types of objects: tables, indexes, views and object views, packages and package bodies, procedures and functions, user-defined types and type bodies, triggers, synonyms, index types, and user-defined operators. Oracle manages replication objects using replication groups. A replication group is a collection of replication objects that are logically related. A replication group can contain objects from multiple schemas, and a single schema can have objects in multiple replication groups. However, each replication object can be a member of only one replication group. A replication group can exist at multiple replication sites. Replication environments support two basic types of sites: master sites and materialized view sites. The differences between master sites and materialized view sites are the following: A replication group at a master site is more specifically referred to as a master group. A replication group at a materialized view site is based on a master group and is more specifically referred to as a materialized view group. Additionally, every master group has exactly one master definition site. A replication group's master definition site is a master site serving as the control center for managing the replication group and the objects in the group. A master site maintains a complete copy of all objects in a replication group, while materialized views at a materialized view site can contain all or a subset of the table data within a master group. All master sites in a multimaster replication environment communicate directly with one another to continually propagate data changes in the replication group. Materialized view sites contain an image, or materialized view, of the table data from a certain point in time. Typically, a materialized view is refreshed periodically to synchronize it with its master site. You can organize materialized views into refresh groups. Materialized views in a refresh group can belong to one or more materialized view groups, and they are refreshed at the same time to ensure that the data in all materialized views in the refresh group correspond to the same transactionally
consistent point in time. Options for Multi Master Replication Asynchronous replication is the most common way to implement multi-master replication. However, you havetwo other options: Synchronous Replication and Procedural Replication. Synchronous Replication A Multi-Master replication environment can use either asynchronous or synchronous replication to copy data. With asynchronous replication, changes made at one master site occur at a later time at all other participating master sites. With synchronous replication, changes made at one master site occur immediately at all other participating master sites. When you use synchronous replication, an update of a table results in the immediate replication of the update at all participating master sites. In fact, each transaction includes all master sites. Therefore, if one master site cannot process a transaction for any reason, then the transaction is rolled back at all master sites. Although you avoid the possibility of conflicts when you use synchronous replication, it requires a very stable environment to operate smoothly. If communication to one master site is not possible because of a network problem, for example, then users can still query replicated tables, but no transactions can be completed until communication is reestablished. Also, it is possible to configure asynchronous replication so that it simulates synchronous replication. Procedural Replication Batch processing applications can change large amounts of data within a single transaction. In such cases,typical row-level replication might load a network with many data changes. To avoid such problems, a batch processing application operating in a replication environment can use Oracles Procedural Replication to replicate simple stored procedure calls to converge data replicas. Procedural replication replicates only the c a l l t o a s t o r e d p r o c e d u r e t h a t a n a p p l i c a t i o n u s e s t o u p d a t e a t a b l e . I t d o e s n o t r e p l i c a t e t h e d a t a modifications themselves.To use procedural replication, you must replicate the packages that modify data in the system to all sites. After replicating a package, you must generate a wrapper for the package at each site. When an application calls a packaged procedure at the local site to modify data, the wrapper ensures that the call is ultimately made to the same packaged procedure at all other sites in the replication environment. Procedural replication can occur asynchronously or synchronously. Conflict Detection and Procedural Replication When a replicating data uses procedural replication, the procedures that replicate data are responsible for ensuring the integrity of the replicated data. That is, you must design such procedures to either avoid or detect replication conflicts and to
resolve them appropriately. Consequently, procedural replication is most typically used when databases are modified only with large batch operations. In such situations, replication conflicts are unlikely because numerous transactions are not contending for the same data

MC0077

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

MC0077

Hochgeladen von

Copyright:

Verfügbare Formate

Master of Computer Application (MCA) Semester 4 MC0077 Advanced Database Systems 4 Credits

Choice of join orders

B) Database Design & Creation

1. Create table STUDENT initiates specification of the implementation of a subclass

B) Data Mining Techniques

Das könnte Ihnen auch gefallen