You are on page 1of 31

Question Paper

Data Warehousing and Data Mining (MC332) : January 2007


Section A : Basic Concepts (30 Marks)
This section consists of questions with serial number 1 - 30. Answer all questions. Each question carries one mark. Maximum time for answering Section A is 30 Minutes.

1. Which of the following forms the logical subset of the complete data warehouse? (a)Dimensional model (b)Fact table (c)Dimensional table (d)Operational Data Store (e)Data Mart. 2.Which of the following is not included in Modeling Applications? (a)Forecasting models (b)Behavior scoring models (c)Allocation models (d)Data mining Models (e)Metadata driven models. 3.Which of the following is a dimension that means the same thing with every possible fact table to which it can be joined? (a)Permissible snowflaking (b)Confirmed Dimensions (c)Degenerate dimensions (d)Junk Dimensions (e)Monster Dimensions.

4.Which of the following is not the managing issue in the modeling process? (a)Content of primary units column (b)Document each candidate data source (c)Do regions report to zones (d)Walk through business scenarios (e)Ensure that the transaction edit flat is used for analysis. 5.Which of the following criteria is not used for selecting the data sources? (a)Data Accessibility (b)Platform (c)Data accuracy

(d)Longevity of the feed (e)Project scheduling.

6.Which of the following does not relate to the data modeling tool? (a)Link to the dimension table designs (b)Business user Documentation (c)Helps assure consistency in naming (d)Length of the logical column. (e)Generates physical object DDL. 7.Which of the following is true on building a Matrix for Data warehouse bus architecture? (a)Data marts as columns and dimensions as rows (b)Dimensions as rows and facts as columns (c)Data marts as rows and dimensions as columns (d)Data marts as rows and facts as columns (e)Facts as rows and data marts as columns. 8.Which of the following should not be considered for each dimension attribute? (a)Attribute name (b)Rapid changing dimension policy (c)Attribute definition (d)Sample data (e)Cardinality. 9.Which of following form the set of data created to support a specific short lived business situation? (a)Personal Data Marts (b)Application Models (c)Downstream systems (d)Disposable Data Marts (e)Data mining models. 10.Which of the following does not form future access services? (a)Authentication (b)Report linking (c)Push toward centralized services (d)Vendor consolidation (e)Web based customer access. 11.What is the special kind of clustering that identifies events or transactions that occur simultaneously? (a)Affinity grouping (b)Classifying (c)Clustering (d)Estimating (e)Predicting. 12.Of the following team members, who do not form audience for Data warehousing? (a)Data architects

(b)DBAs (c)Business Intelligence experts (d)Managers (e)Customers/users. 13.The precalculated summary values are called as (a)Assertions (b)Triggers (c)Aggregates (d)Schemas (e)Indexes. 14.OLAP stands for (a)Online Analytical Processing (b)Online Attribute Processing (c)Online Assertion Processing (d)Online Association Processing (e)Online Allocation Processing. 15.Which of the following employ data mining techniques to analyze the intent of a user query, provided additional generalized or associated information relevant to the query? (a)Iceberg Query Method (b)Data Analyzer (c)Intelligent Query answering (d)DBA (e)Query Parser.

16.Of the following clustering algorithm what is the method which initially creates a hierarchical decomposition of the given set of data objects? (a)Partitioning Method (b)Hierarchical Method (c)Density-based method (d)Grid-based Method (e)Model-based Method.

17.Which one of the following can be performed using the attribute-oriented induction in a manner similar to concept characterization? (a)Analytical characterization (b)Concept Description. (c)OLAP based approach (d)Concept Comparison (e)Data Mining.

18.Which one of the following is an efficient association rule mining algorithm that explores the levelwise mining? (a)FP-tree algorithm (b)Apriori Algorithm (c)Level-based Algorithm

(d)Partitioning Algorithm (e)Base Algorithm.

19.What allows users to focus the search for rules by providing metarules and additional mining constraints? (a)Correlation rule mining (b)Multilevel Association rule mining (c)Single level Association rule mining (d)Constraint based rule mining (e)Association rule mining.

20.Which of the following can be used in describing central tendency and data description from the descriptive statistics point of view? (a)Concept measures (b)Statistical measures (c)T-weight (d)D-weight (e)Generalization.

21.Which of the following is the collection of data objects that are similar to one another within the same group? (a)Partitioning (b)Grid (c)Cluster (d)Table (e)Data source.

22.In which of the following binning strategy, each bin has approximately the same number of tuples assigned to it? (a)Equiwidth binning (b)Equidepth binning (c)Homogeneity-based binning (d)Equilength binning (e)Frequent predicate set.

23.Which of the following binning strategy has the interval size of each bin the same? (a)Equiwidth binning (b)Ordinary binning (c)Heterogeneity-based binning (d)Un-Equaling binning (e)Predicate Set.

24.Which of the following association shows relationships between discrete objects? (a)Quantitative (b)Boolean

(c)Single Dimensional (d)Multidimensional (e)Bidirectional.

25.What algorithms attempt to improve accuracy by removing tree branches reflecting noise in the data? (a)Partitioning (b)Apriori (c)Clustering (d)FP tree (e)Pruning.

26.Which of the following process includes data cleaning, data integration, data selection, data transformation, data mining, pattern evolution, and knowledge presentation? (a)KDD Process (b)ETL Process (c)KTL Process (d)MDX process (e)DW&DM.

27.What is the target physical machine on which the data warehouse is organized and stored for direct querying by end users, report writers, and other applications? (a)Presentation server (b)Application server (c)Database server (d)Interface server (e)Data staging server.

28.Which of the following cannot form a category of queries? (a)Simple constraints (b)Correlated subqueries (c)Simple behavioral queries (d)Derived Behavioral queries (e)Clustering queries.

29.Which of the following is not related to dimension table attributes? (a)Verbose (b)Descriptive (c)Equally unavailable (d)Complete (e)Indexed.

30.Type 1: Overwriting the dimension record, thereby loosing the history, Type 2: Create a new additional dimension record using a new value of the surrogate key and Type 3: Create an old field in the dimension record to store the immediate previous attribute value. Belong to:

(a)Slow changing Dimensions (b)Rapidly changing Dimensions (c)Artificial Dimensions (d)Degenerate Dimensions (e)Caveats.

END OF SECTION A

Section B : Problems (50 Marks)


1. One of the most important assets of an organization is its information. Data warehouse forms one of the assets. To deliver the data to the end users, build a dimensional model taking Sales Application starting from ER diagram. (10 marks) 2. a. b. How do you optimize the Backup process for a Data Warehouse? Compare and Contrast Nave Bayesian classification and Bayesian Belief networks. (4 + 6 = 10 marks) 3. Discuss why analytical characterization is needed and how it can be performed with an example. (10 marks) This section consists of questions with serial number 1 5. Answer all questions. Marks are indicated against each question. Detailed workings should form part of your answer. Do not spend more than 110 - 120 minutes on Section B.

4. Briefly outline how to compute the dissimilarity between objects described by the following types of variables. i. ii. iii. iv. Asymmetric binary variables. Normal Variables. Ratio-Scaled Variables. Interval-Scaled Variables. (10 marks)

5.

Analyze and give the Benefits of having a data warehouse architecture. (10 marks) END OF SECTION B

Section C : Applied Theory (20 Marks)


This section consists of questions with serial number 6 - 7. Answer all questions. Marks are indicated against each question. Do not spend more than 25 -30 minutes on section C.

6. Describe the Data Warehouse architecture framework. (10 marks) 7. Write short notes on any two of the following. a. b. c. Factless fact tables. Web mining. Market Basket Analysis. (5 + 5 = 10 marks) END OF SECTION C END OF QUESTION PAPER

Suggested Answers

Data Warehousing and Data Mining (MC332) : January 2007


Section A : Basic Concepts 1. Answer : (e)

Reason : Data Mart forms the logical subset of the complete data warehouse.

2. Answer : 3.

(e)

Reason : Metadata driven models are not included in Modeling Applications.

Answer :

(b)

Reason : Confirmed Dimensions are means the same thing with every possible fact table to which it can be joined.

4. Answer :

(e)

Reason : Ensure that the transaction edit flat is used for analysis is not the managing issue in the modeling process.

5. Answer : 6. Answer : 7. Answer :

(b)

Reason : Platform is not used for selecting the data sources.

(d)

Reason : Length of the logical column does not relate to the data modeling tool.

(c)

Reason : Data marts as rows and dimensions as columns is true on building a Matrix for Data warehouse bus architecture.

8. Answer : 9. Answer :

(b)

Reason : Rapid changing dimension policy should not be considered for each dimension attribute.

(d)

Reason : Disposable data marts form the set of data created to support a specific short lived business situation.

10. Answer : 11. Answer :

(b)

Reason : Report Linking does not form future access services.

(a)

Reason : Affinity grouping is the special kind of clustering that identifies events or transactions that occur simultaneously.

12. Answer : 13. Answer :

(e)

Reason : Customers /users does not form audience for Data warehousing.

(c)

Reason : Aggregates are the precalculated summary values.

14. Answer : 15. Answer :

(a)

Reason : Online Analytical Processing.

(c)

Reason : Intelligent Query answering employees data mining techniques to analyze the intent of a user query, provided additional generalized or associated information relevant to the query.

16. Answer :

(b)

Reason : Hierarchical method is a clustering algorithm which first creates a hierarchical decomposition of the given set of data objects.

17. Answer :

(d)

Reason : Concept comparison can be performed using the attribute-oriented induction in a manner similar to concept characterization.

18. Answer :

(b)

Reason : Apriori Alg. is an efficient association rule mining algorithm that explores the level-wise mining.

19. Answer :

(d)

Reason : Constraint based rule mining allows users to focus the search for rules by providing metarules and additional mining constraints.

20. Answer :

(c)

Reason : Statistical Measures can be used in describing central tendency and data description from the descriptive statistics point of view.

21. Answer : 22. Answer :

(c)

Reason : Cluster is the collection of data objects that are similar to one another within the same group.

(b)

Reason : Equidepth binning is a strategy where each bin has approximately the same number of tuples assigned to it.

23. Answer :

(a)

Reason : Equiwidth binning is the binning strategy where the interval size of each bin is the same.

24. Answer : 25. Answer : 26. Answer :

(b)

Reason : Boolean association shows relationships between discrete objects.

(d)

Reason : FP tree attempt to improve accuracy by removing tree branches reflecting noise in the data.

(a)

Reason : KDD Process includes data cleaning, data integration, data selection, data transformation, data mining, pattern evolution, and knowledge presentation.

27. Answer :

(a)

Reason : Presentation Server is the target physical machine on which the data warehouse data is organized and stored for direct querying by end users, report writers, and other applications.

28. Answer : 29. Answer : 30. Answer :

(e)

Reason : Clustering queries cannot form a category of queries.

(c)

Reason : Equally unavailable is not related to dimension table attributes.

(a)

Reason : Slow changing Dimensions have Type 1: Overwriting the dimension record, thereby loosing the history, Type 2: Create a new additional dimension record using a new value of the surrogate key and Type 3: Create an old field in the dimension record to store the immediate previous attribute value.

Section B : Problems 1. Central Data Warehouse Design This represents the wholesale level of the datawarehouse, which is used to supply data marts with data. The most important requirement of the central data warehouse is that it provides a consistent, integrated and flexible source of data. We argue that traditional data modeling techniques (Entity Relationship models and normalization) are most appropriate at this level. A normalized database design ensures maximum consistency and integrity of the data. It also provides the most flexible data structure-new data can be easily added to the warehouse in a modular way, and the database structure will support any analysis requirements. Aggregation or demoralization at this stage will lose.

Information and restrict the kind of analyses which can be carried out. An enterprise data model, if one exists, should be used as the basis for structuring the central data warehouse. Data Mart Design Data marts represent the retail level of the data warehouse, where data is accessed directly by end users.Data is extracted from the central data warehouse into data marts to support particular analysis requirements. The most important requirement at this level is that data is structured in a way that is easy for users to understand and use. For this reason, dimensional modeling techniques are most appropriate at this level. This ensures that data structures are as simple as possible in order to simplify user queries. Next describes an approach for developing dimensional models from an enterprise data model. DATA WAREHOUSE DESIGN A simple example is used to illustrate the design approach. Following figure shows an operational data model for a sales application. The highlighted attributes indicate the primary keys of each entity.

Such a model is typical of data models that are used by operational (OLTP) systems. Such a model is well suited to a transaction processing environment. It contains no redundancy, thus maximizing efficiency of updates, and explicitly shows all the data and the relationships between them. Unfortunately most decision makers would find this schema incomprehensible. Even quite simple queries require multi-table joins and complex subqueries. As a result, end users will be dependent on technical specialists to write queries for them. Step 1. Classify Entities The first step in producing a dimensional model from an Entity Relationship model is to classify the entities into three categories: Transaction Entities Transaction entities record details about particular events that occur in the business .for example, orders,

insurance claims, salary payments and hotel bookings. Invariably, it is these events that decision makers want to understand and analyze. The key characteristics of a transaction entity are: It describes an event that happens at a point in time It contains measurements or quantities that may be summarized e.g. dollar amounts, weights, volumes. For example, an insurance claim records a particular business event and (among other things) the amount claimed. Transaction entities are the most important entities in a data warehouse, and form the basis for constructing fact tables in star schemas. Not all transaction entities will be of interest for decision support, so user input will be required in identifying which transactions are important. Component Entities A component entity is one which is directly related to a transaction entity via a one-tomany relationship. Component entities define the details or components of each business transaction. Component entities answer the who, what, when, where, how and why of a business event. For example, a sales transaction may be defined by a number of components: Customer: who made the purchase Product: what was sold Location: where it was sold Period: when it was sold An important component of any transaction is time- historical analysis is an important part of any data warehouse. Component entities form the basis for constructing dimension tables in star schemas. Classification Entities Classification entities are entities which are related to component entities by a chain of one-to-many relationships-that is, they are functionally dependent on a component entity (directly or transitively). Classification entities represent hierarchies embedded in the data model, which may be collapsed into component entities to form dimension tables in a star schema. Figure shows the classification of the entities in the example data model. In the diagram, Black entities represent Transaction entities Grey entities indicate Component entities White entities indicate Classification entities

Resolving Ambiguities

In some cases, entities may fit into multiple categories. We therefore define a precedence hierarchy for resolving such ambiguities: 1. Transaction entity (highest precedence) 2. Classification entity 3. Component entity (lowest precedence) For example, if an entity can be classified as either a classification entity or a component entity, it should be classified as a classification entity. In practice, some entities will not fit into any of these categories. Such entities do not fit the hierarchical structure of a dimensional model, and cannot be included in star schemas. This is where real world data sometimes does not fit the star schema mould. Step 2. Identify Hierarchies Hierarchies are an extremely important concept in dimensional modelling, and form the primary basis for deriving dimensional models from Entity Relationship models. As mentioned , most dimension tables in star schemas contain embedded hierarchies. A hierarchy in an Entity Relationship model is any sequence of entities joined together by one-to-many relationships, all aligned in the same direction. Figure shows a hierarchy extracted from the example data model, with State at the top and Sale Item at the bottom.

In hierarchical terminology: State is the parent of Region Region is the child of State Sale Item, Sale, Location and Region are all descendants of State Sale, Location, Region and State are all ancestors of Sale Item Maximal Hierarchy A hierarchy is called maximal if it cannot be extended upwards or downwards by including another entity. In all, there are 14 maximal hierarchies in the example data model: Customer Type-Customer-Sale-Sale Fee Customer Type-Customer-Sale-Sale Item Fee Type-Sale Fee Location Type-Location-Sale-Sale Fee Location Type-Location-Sale-Sale Item Period (posted)-Sale-Sale Fee Period (posted)-Sale-Sale Item Period (sale)-Sale-Sale Fee Period (sale)-Sale-Sale Item Product Type-Product-Sale Item State-Region-Customer-Sale-Sale Fee

State-Region-Customer-Sale-Sale Item State-Region-Location-Sale-Sale Fee State-Region-Location-Sale-Sale Item An entity is called minimal if it is at the bottom of a maximal hierarchy and maximal if it is at the top of one. Minimal entities can be easily identified as they are entities with no one-to-many relationships (or leaf entities in hierarchical terminology), while maximal entities are entities with no many to one relationships (or root entities). In the example data model there are Two minimal entities: Sale Item and Sale Fee Six maximal entities: Period, Customer_Type, State, Location Type, Product Type and Fee Type. Step 3. Produce Dimensional Models Operators For Producing Dimensional Models We use two operators to produce dimensional models from Entity Relationship models. Higher level entities can be collapsed into lower level entities within hierarchies. Figure shows the State entity being collapsed into the Region entity. The Region entity contains its original attributes plus the attributes of the collapsed table. This introduces redundancy in the form of a transitive dependency, which is a violation to third normal form. Collapsing a hierarchy is therefore a form of denormalisation .

Figure 8. State Entity collapsed into region Figure shows Region being collapsed into Location. We can continue doing this until we reach the bottom of the hierarchy, and end up with a single table (Sale Item).

Aggregation

The aggregation operator can be applied to a transaction entity to create a new entity containing summarized data. A subset of attributes is chosen from the source entity to aggregate (the aggregation attributes) and another subset of attributes chosen to aggregate by (the grouping attributes). Aggregation attributes must be numerical quantities. For example, we could apply the aggregation operator to the Sale Item entity to create a new entity called Product Summary as in Figure. This aggregated entity shows for each product the total sales amount (quantity*price), the average quantity per order and average price per item on a daily basis. The aggregation attributes are quantity and price, while the grouping attributes are Product ID and Date. The key of this entity is the combination of the attributes used to aggregate by (grouping attributes). Note that aggregation loses information : we cannot reconstruct the details of individual sale items from the product summary table.

Figure 10. Aggregation Operator Dimensional Design Options There is a wide range of options for producing dimensional models from an Entity Relationship model. These include: Flat schema Terraced schema Star schema Snowflake schema Star cluster schema Each of these options represent different trade-offs between complexity and redundancy. Here we discuss how the operators previously defined may be used to produce different dimensional models. Option 1: Flat Schema A flat schema is the simplest schema possible without losing information. This is formed by collapsing all entities in the data model down into the minimal entities. This minimizes the number of tables in the database and therefore the possibility that joins will be needed in user queries. In a flat schema we end up with one table for each minimal entity in the original data model. Figure 11 shows the flat schema which results from the example data model.

Figure 11. Flat Schema Such a schema is similar to the flat files used by analysts using statistical packages such as SAS and SPSS. Note that this structure does not lose any information from the original data model. It contains redundancy, in the form of transitive and partial dependencies, but does not involve any aggregation. One problem with a flat schema is that it may lead to aggregation errors when there are hierarchical relationships between transaction entities. When we collapse numerical amounts from higher level transaction entities into another they will be repeated. In the example data model, if a Sale consists of three Sale Items, the discount amount will be stored in three different rows in the Sale Item table. Adding the discount amounts together then results in double-counting (or in this case, triple counting). Another problem with flat schemas is that they tend to result in tables with large numbers of attributes, which may be unwieldy. While the number of tables (system complexity) is minimised, the complexity of each table (element complexity) is greatly increased. Option 2: Terraced Schema A terraced schema is formed by collapsing entities down maximal hierarchies, but stopping when they reach a transaction entity. This results in a single table for each transaction entity in the data model. Figure shows the terraced schema that results from the example data model. This schema is less likely to cause problems for an inexperienced user, because the separation between levels of transaction entities is explicitly shown.

Figure 12. Terraced Schema Option 3: Star Schema A star schema can be easily derived from an Entity Relationship model. Each star schema is formed in the following way: A fact table is formed for each transaction entity. The key of the table is the combination of the keys of its associated component entities. A dimension table is formed for each component entity, by collapsing hierarchically related classification entities into it. Where hierarchical relationships exist between transaction entities, the child entity inherits all dimensions (and key attributes) from the parent entity. This provides the ability to drill down between transaction levels. Numerical attributes within transaction entities should be aggregated by key attributes (dimensions). The aggregation attributes and functions used depend on the application. Figure shows the star schema that results from the Sale transaction entity. This star schema has four dimensions, each of which contains embedded hierarchies. The aggregated fact is Discount amount.

Figure 13. Sale Star Schema Figure shows the star schema which results from the Sale Item transaction entity. This star schema has five dimensions. This includes four dimensions from its parent

transaction entity (Sale) and one of its own (Product). The aggregated facts are quantity and item cost (quantity * price).

Figure 14. Sale item star schema A separate star schema is produced for each transaction table in the original data model. Constellation Schema Instead of a number of discrete star schemas, the example data model can be transformed into a constellation schema. A constellation schema consists of a set of star schemas with hierarchically linked fact tables. The links between the various fact tables provide the ability to drill down between levels of detail (e.g. from Sale to Sale Item). The constellation schema which results from the example data model is shown in Figure links between fact tables are shown in bold.

Figure 15. Sales Constellation Schema Galaxy Schema More generally, a set of star schemas or constellations can be combined together to form a galaxy. A galaxy is of a collection of star schemas with shared dimensions. Unlike a constellation schema, the fact tables in a galaxy do not need to be directly related. Option 4: Snowflake Schema In a star schema, hierarchies in the original data model are collapsed or demoralized to form dimension tables. Each dimension table may contain multiple independent hierarchies. A snowflake schema is a star schema with all hierarchies explicitly shown. A snowflake

schema can be formed from a star schema by expanding out (normalizing) the hierarchies in each dimension. alternatively, a snowflake schema can be produced directly from an Entity Relationship model by the following procedure: A fact table is formed for each transaction entity. The key of the table is the combination of the keys of the associated component entities. Each component entity becomes a dimension table. Where hierarchical relationships exist between transaction entities, the child entity inherits all relationships to component entities (and key attributes) from the parent entity. Numerical attributes within transaction entities should be aggregated by the key attributes. The attributes and functions used depend on the application. Figure shows the snowflake schema which results from the Sale transaction entity.

Option 5: Star Cluster Schema Kimball (1996) argues that snowflaking is undesirable, because it adds complexity to the schema and requires extra joins. Clearly, expanding all hierarchies defeats the purpose of producing simple, user friendly database designs-in the example above, it more than doubles the number of tables in the schema. here, we argue that neither a pure star schema (fully collapsed hierarchies) nor a pure snowflake schema (fully expanded hierarchies) results in the best solution. As in many design problems, the optimal solution is a balance between two extremes. The problem with fully collapsing hierarchies occurs when hierarchies overlap, leading to redundancy between dimensions when they are collapsed. This can result in confusion for users, increased complexity in extract processes and inconsistent results from queries if hierarchies become inconsistent. For these reasons, we require that dimensions should be orthogonal. Overlapping dimensions can be identified via forks in hierarchies. A fork occurs when an entity acts as a parent in two different dimensional hierarchies. This results in the entity and all of its ancestors being collapsed into two separate dimension tables. Fork entities can be identified as classification entities with multiple one-to-many relationships. The exception to this rule occurs when the hierarchy converges again lower down-Dampney (1996) calls this a commuting loop . In the example data model, a fork occurs at the Region entity. Region is a parent of Location and Customer, which are both components of the Sale transaction. In the star

schema representation, State and Region would be included in both the Location and Customer dimensions when the hierarchies are collapsed. This results in overlap between the dimensions.

Figure 17. Intersecting Hierarchies in Example Data Model We define a star cluster schema as one which has the minimal number of tables while avoiding overlap between dimensions. It is a star schema which is selectively snowflaked to separate out hierarchical segments or subdimensions which are shared between different dimensions. Subdimensions effectively represent the highest common factor between dimensions. A star cluster schema may be produced from an Entity relationship model using the following procedure. Each star cluster is formed by: A fact table is formed for each transaction entity. The key of the table is the combination of the keys f the associated component entities. Classification entities should be collapsed down their hierarchies until they reach either a fork entity or a component entity. If a fork is reached, a subdimension table should be formed. The subdimension table will consist of the fork entity plus all its ancestors. Collapsing should begin again after the fork entity. When a component entity is reached, a dimension table should be formed. Where hierarchical relationships exist between transaction entities, the child entity should inherit all dimensions (and key attributes) from the parent entity. Numerical attributes within transaction entities should be aggregated by the key attributes (dimensions). The attributes and functions used depend on the application. Figure shows the star cluster schema that results from the model fragment of Figure .

Figure 18. Star Cluster Schema

Figure shows how entities in the original data model were clustered to form the star cluster schema. The overlap between hierarchies has now been removed.

Figure 19. Revised Clustering If required, views may be used to reconstruct a star schema from a star cluster schema. This gives the best of both worlds: the simplicity of a star schema while preserving consistency between dimensions. As with star schemas, star clusters may be combined together to form constellations or galaxies. Step 4. Evaluation and Refinement In practice, dimensional modelling is an iterative process. The clustering procedure described in Step 3 is useful for producing a first cut design, but this will need to be refined to produce the final data mart design. Most of these modifications have to do with further simplifying the model and dealing with nonhierarchical patterns in the data. Combining Fact Tables Fact tables with the same primary keys (i.e. the same dimensions) should be combined. This reduces the number of star schemas and facilitates comparison between related facts (e.g. budget and actual figures). Combining Dimension Tables Creating dimension tables for each component entity often results in a large number of dimension tables. To simplify the data mart structure, related dimensions should be consolidated together into a single dimension table. Many to Many Relationships Most of the complexities which arise in converting a traditional Entity Relationship model to a dimensional model result from many-to-many relationships or intersection entities. Many-to-many relationships cause problems in dimensional modelling because they represent a break in the hierarchical chain, and cannot be collapsed. There are a number of options for dealing with many-to-many relationships: (a) Ignore the intersection entity (eliminate it from the data mart) (b) Convert the many-to-many relationship to a oneto- many relationship, by defining a primary relationship (c) Include it as a many-to-many relationship in the data mart such entities may be useful to expert analysts but will not be amenable to analysis using an OLAP tool. For example, in the model below, each client may be involved in a number of industries. The intersection entity Client Industry breaks the hierarchical chain and cannot be collapsed into Client.

Figure 20. Multiple Classification The options are (a) to exclude the industry hierarchy, (b) convert it to a one-to-many relationship or (c) include it as a many-to-many relationship.

Figure 21. Design Options Handling Subtypes Super type/subtype relationships can be converted to a hierarchical structure by removing the subtypes and creating a classification entity to distinguish between subtypes. This can then be converted to a dimensional model in a straightforward manner.

Figure 22. Conversion of subtypes to Hierarchical Form

2. a. The following approaches can be used to optimize the back up process of a data warehouse: Partitioning can be used to increase operational flexibility. Incremental backup to reduce elapsed time to complete an operation. Parallel processing can be used to divide and conquer large data volumes. Concurrent backup is allowed to extend availability. RAID is used to recover from media failure. b. Nave Bayesian classification and Bayesian belief networks are based on Bayes theorem of posterior probability. Unlike Nave Bayesian classification, Bayesian belief networks allow class conditional independencies to be defined between subsets of variables. 3. Example: Analytical Characterization Task Mine general characteristics describing graduate students using analytical characterization

Given attributes name, gender, major, birth_place, birth_date, phone#, and gpa Gen(ai) = concept hierarchies on ai Ui = attribute analytical thresholds for ai Ti = attribute generalization thresholds for ai R = attribute relevance threshold

2. Analytical generalization using Ui attribute removal remove name and phone# attribute generalization generalize major, birth_place, birth_date and gpa accumulate counts candidate relation: gender, major, birth_country , age_range and gpa

Example: Analytical characterization (2)


gender major birth_country age_range gpa count

M F M F M F

Science Science Engineering Science Science Engineering

Canada Foreign Foreign Foreign Canada Canada

20-25 25-30 25-30 25-30 20-25 20-25

Very_good Excellent Excellent Excellent Excellent Excellent

16 22 18 25 21 18

Candidate relation for Target class: Graduate students (=120)


gender major birth_country age_range gpa count

M F M F M F

Science Business Business Science Engineering Engineering

Foreign Canada Canada Canada Foreign Canada

<20 <20 <20 20-25 20-25 <20

Very_good Fair Fair Fair Very_good Excellent

18 20 22 24 22 24

Candidate relation for Contrasting class: Undergraduate students (=130)

Example: Analytical Characterization (4)

Calculate expected info required to classify a given sample if S is partitioned according to the attribute
E(major) 126 82 42 I ( s11 , s 21 ) I ( s12 , s 22 ) I ( s13 , s 23 ) 0.7873 250 250 250

Calculate information gain for each attribute


Gain(major ) I(s 1, s 2 ) E(major) 0.2115

Information gain for all attributes


Gain(gender) Gain(birth_country) Gain(major) Gain(gpa) Gain(age_range) = 0.0003 = 0.0407 = 0.2115 = 0.4490 = 0.5971

Example: Analytical characterization (3)

3. Relevance analysis Calculate expected info required to classify an arbitrary tuple


I(s1, s 2 ) I(120,130) 120 120 130 130 log 2 log 2 0.9988 250 250 250 250

Calculate entropy of each attribute: e.g. major


For major=Science: For major=Business:
S11=84 S21=42 S22=46 S23=42

I(s11,s21)=0.9183 I(s12,s22)=0.9892 I(s13,s23)=0


Number of undergrad students in Science

For major=Engineering: S12=36


S13=0

Number of grad students in Science

Example: Analytical characterization (5)


4. Initial working relation (W0) derivation

R = 0.1 remove irrelevant/weakly relevant attributes from candidate relation => drop gender, birth_country remove contrasting class candidate relation
major Science Science Science Engineering Engineering age_range 20-25 25-30 20-25 20-25 25-30 gpa Very_good Excellent Excellent Excellent Excellent count 16 47 21 18 18

Initial target class working relation W0: Graduate students

5. Perform attribute-oriented induction on W0 using Ti


4. Interval-valued variables: Internal scaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude and longitude coordinates. The measurement unit used can effect the clustering analysis.

For example changing measurement units from meters to inches for height. How can the data for a variable be standardized? To standardized measurements one choice is to convert the original measurements to unit less variables. Given measurements for a variable f this can be performed as follows Standardize data Calculate the mean absolute deviation: Where Calculate the standardized measurement (z-score) Using mean absolute deviation is more robust than using standard deviation Binary Variables: A contingency table for binary data Object j

Object I

Simple matching coefficient (invariant, if the binary variable is symmetric): Jaccard coefficient (noninvariant if the binary variable is asymmetric): Dissimilarity between Binary Variables

Nominal Variables: A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching m: # of matches, p: total # of variables Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states Ratio-Scaled Variables Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, treat them like interval-scaled variables not a good choice! apply logarithmic transformation yif = log(xif) treat them as continuous ordinal data treat their rank as interval-scaled.

5. Benefits of having a data warehouse architecture Provides an organizing framework - the architecture draws the lines on the map in terms of what the individual components are, how they fit together, who owns what parts, and priorities. Improved flexibility and maintenance - allows you to quickly add new data sources, interface standards allow plug and play, and the model and meta data allow impact analysis and single-point changes. Faster development and reuse - warehouse developers are better able to understand the data warehouse process, data base contents, and business rules more quickly. Management and communications tool - define and communicate direction and scope to set expectations, identify roles and responsibilities, and communicate requirements to vendors. Coordinate parallel efforts - multiple, relatively independent efforts have a chance to converge successfully. Also, data marts without architecture become the stovepipes of tomorrow. Section C: Applied Theory 6. In the information systems world, an architecture adds value in mush the same way blueprint for a construction of a project. An effective architecture will increase the flexibility of the system, facilitate learning, and improve productivity. For data warehousing, the architecture is a description of the elements and services of the warehouse, with details showing how the components will fit together and how the system will grow over time. Like the house analogy, the warehouse architecture is a set of documents, plans, models, drawings, and specifications, with separate sections for each key component area and enough detail to allow their implementation by skilled professionals. Key Component Areas A complete data warehouse architecture includes data and technical elements.

Thornthwaite breaks down the architecture into three broad areas. The first, data architecture, is centered on business processes. The next area, infrastructure, includes hardware, networking, operating systems, and desktop machines. Finally, the technical area encompasses the decision-making technologies that will be needed by the users, as well as their supporting structures.

Data Architecture (Columns) The data architecture portion of the overall data warehouse architecture is driven by business processes. For example, in a manufacturing environment the data model might include orders, shipping, and billing. Each area draws on a different set of dimensions. But where dimensions intersect in the data model and these data items should have a common structure and content, and involve a single process to create and maintain. Business requirements essentially drive the architecture, so talk to business managers, analysts, and power users. From your interviews look for major business issues, as well as indicators of business strategy, direction, frustrations, business processes, timing, availability, and performance expectations. Document everything well. From an IT perspective, talk to existing data warehouse/DSS support staff, OLTP application groups, and DBAs; as well as networking, OS, and desktop support staff. Also speak with architecture and planning professionals. Here you want to get their opinions on data warehousing considerations from the IT viewpoint. Learn if there are existing architecture documents, IT principles, organizational power centers, etc. Not many standards exist for data warehousing, but there are standards for a lot of the components. The following are some to keep in mind: Middleware - ODBC, OLE, OLE DB, DCE, ORBs, and JDBC. Data base connectivity - ODBC, JDBC, OLE DB, and others. Data management - ANSI SQL and FTP. Network access - DCE, DNS, and LDAP. Technical Architecture When you develop the technical architecture model, draft the architecture requirements document first. Next to each business requirement write down its architecture implications. Group these implications according to architecture areas (remote access, staging, data access tools, etc.) Understand how it fits in with the other areas. Capture the definition of the area and its contents. Then refine and document the model. Technical Architecture covers the processes and tools we apply to the data. This area answers the question How. i.e. how do we get data this data at its source, put it in a form that meets the business requirements, and move it to a place that is accessible. The technical architecture is made up of the tools, utilities, code, and so on that bring the warehouse to life. Two main subsets of the technical architecture area have different requirements to warrant independent consideration. These two area are back room and the front room. The back room is the part responsible gathering data and preparing the data. Another common term for the back room is data access.

The front room is the part responsible for delivering data to the user community. Another common term for the front room is data access. Infrastructure Architecture Area. Infrastructure is about the platforms that host the data and processes. The infrastructure is the physical plant of the data warehouse. Defining the levels of Detail( the rows) Business Requirements level. The business requirements level is explicitly non-technical. The systems planner must understand the major business forces and boundary conditions that affect the data warehouse project. Architectural Level Architecture models are the first level of serious response to the requirements. An architecture model proposes the major components of the architecture that must be available to address the requirements. At this level the system perspective addresses whether the various technical components can communicate with each other or not. Detailed Models Level Detailed models are the functional specifications each of the architectural components, at a significant level of detail. The detail models must include enough information to serve as a reliable implementation guide for the team member. Detailed models also must be enough to create a legal contract so that when the work is done, it can be held up to the functional specification to see whether the implementation is complete and confirms to the specification. Implementation Level The implementation level is the response to the detailed models. For software deliverable, it is the code itself. For the data area, it is the data definition language used to build the data base. In implementation level all the above must be documented

7. a.

b.

c.

A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information. Common examples of factless fact tables include: Identifying product promotion events (to determine promoted products that didnt sell) Tracking student attendance or registration events Tracking insurance-related accident events Identifying building, facility, and equipment schedules for a hospital or University" Web mining, when looked upon in data mining terms, can be said to have three operations of interests - clustering (finding natural groupings of users, pages etc.), associations (which URLs tend to be requested together), and sequential analysis (the order in which URLs tend to be accessed). As in most real-world problems, the clusters and associations in Web mining do not have crisp boundaries. And often overlap considerably. In addition, bad exemplars (outliers) and incomplete data can easily occur in the data set, due to a wide variety of reasons inherent to web browsing and logging. Thus, Web Mining and Personalization requires modeling of an unknown number of overlapping sets in the presence of significant noise and outliers, (i. e., bad exemplars). Moreover, the data sets in Web Mining are extremely large. Market basket analysis studies the buying habits of customers by searching for sets of items that are frequently purchased together OR in sequence.