Sie sind auf Seite 1von 14

Assignment On Data mining

Submitted to: Ms.Bobby Thomas Dept. of.MCA SJCET,Palai

submitted by: Praveen p s5,mca,no:23 SJCET,Palai

1) Briefly compare the following concept with example a)snowflake schema, fact constellation, star net query model

snowflake: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake. The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be kept in normalized form. Such a table is easy to maintain and also saves storage space because a large dimension table can be extremely large when the dimensional structure is included as columns. Since much of this space is redundant data, creating a normalized structure will reduce the overall space requirement .However, the snowflake structure can reduce the effectiveness of browsing since more joins will be needed to execute a query. Consequently, the system performance may be adversely impacted. Performance benchmarking can be used to determine what is best for your design.

An example of a snowflake schema for All Electronics sales is given in Figure. Here, the sales fact table is identical to that of the star schema in Figure 2.4. The main difference between the two schemas is in the definition of dimension tables. The single dimension table for item in the star schema is normalized in the snowflake schema, resulting in new item and supplier tables. For example, the item dimension table now contains the attributes supplier key, type, brand, item name, and item key, the latter of which is linked to the supplier dimension table, containing supplier type and supplier key information. Similarly, the single dimension table for location in the star schema can be normalized into two tables: new location and city. The location key of the new location table now links to the city dimension. Notice that further normalization can be performed on province or state and country in the snowflake schema shown in Figure , when desirable.

A compromise between the star schema and the snowflake schema is to adopt a mixed schema where only the very large dimension tables are normalized. Normalizing large dimension tables saves storage space, while keeping small dimension tables unnormalized may reduce the cost and performance degradation due to joins on multiple dimension tables. Doing both may lead to an overall performance gain. However, careful performance tuning could be required to determine which dimension tables should be normalized and split into multiple tables.

Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.

An example of a fact constellation schema is shown in Figure . This schema specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star schema. The shipping table has five dimensions, or keys: time key, item key, shipper key, from location, and to location, and two measures: dollars cost and units shipped. A fact constellation schema allows dimension tables to be shared between fact tables. For example, the dimensions tables for time, item, and location, are shared between both the sales and shipping fact tables. Star schema: The star schema is a modelling paradigm in which the data warehouse contains ( a large central table (fact table), and (2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.

An example of a star schema for AllElectronics sales is shown in Figure . Sales are considered along four dimensions, namely time, item, branch, and location. The schema contains a central fact table for sales which contains keys to each of the four dimensions, along with two measures: dollars sold and units sold. b)Data cleaning, data transformation Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. In this section, you will study basic methods for data cleaning. Missing values Imagine that you need to analyse All Electronics sales and customer data. You note that many tuples have no recorded value for several attributes, such as customer income. How can you go about filling in the missing values for this attribute? Let's look at the following methods. 1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task involves classification or description). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. 2. Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values. 3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like \Unknown", or 1. If missing values are replaced by, say, \Unknown", then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common | that of \Unknown". Hence, although this method is simple, it is not recommended. 4. Use the attribute mean to fill in the missing value: For example, suppose that the average income of All Electronics customers is $28,000. Use this value to replace the missing value for income. 5. Use the attribute mean for all samples belonging to the same class as the given tuple: For example,

if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. 6. Use the most probable value to _ll in the missing value: This may be determined with inference-based tools using a Bayesian formalism or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.. The filled-in value may not be correct. Method 6, however, is a popular strategy. In comparison to the other methods, it uses the most information from the present data to predict missing values. Noisy data \What is noise?" Noise is a random error or variance in a measured variable. Given a numeric attribute such as, say, price, how can we \smooth" out the data to remove the noise? Let's look at the following data smoothing techniques. 1. Binning methods: Binning methods smooth a sorted data value by consulting the \neighborhood", or values around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. Figure 3.2 illustrates some binning techniques. In this example, the data for price are first sorted and partitioned into equi-depth bins (of depth 3). In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width, the greater the effect of the smoothing. Alternatively, bins may be equi-width, where the interval range of values in each bin is constant. Binning is also used as a discretization technique and is further discussed ino association rule mining. 2. Clustering: Outliers may be detected by clustering, where similar values are organized into groups or \clusters".Intuitively, values which fall outside of the set of clusters may be considered outliers. Data transformation data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following: 1. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0. 2. Smoothing, which works to remove the noise from data. Such techniques include binning, clustering, and regression. 3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities 4. Generalization of the data, where low level or 'primitive' (raw) data are replaced by higher level concepts through the use of concept hierarchies. For example, categorical attributes, like

street, can be generalized to higher level concepts, like city or county. Similarly, values for numeric attributes, like age, may be mapped to higher level concepts, like young, middleaged, and senior c)Enterprise warehouse, data mart, virtual warehouse Enterprise warehouse: It collects all of the information about subjects spanning the entire organization. It provide cooperate wide data integration usually from one or more operational systems providers and is cross functional in scope. It typically contains detailed data as well as summarized data, and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. Data mart: It contains a subset of cooperate wide data that is of value a specific group of users . The scope is confined to specific selected subjects . for example a marketing data mart may confine its subjects to customer, item and sales . the data contained in data marts tend to be summarized. Virtual warehouse: A virtual warehouse provides the opportunity for retailers to advertise a whole new line of items online that they would not otherwise have room for on their own shelves. Through the distributors virtual warehouse services, customers can order products from the retailers website. The order is sent to the distributors warehouse, where it is picked, packed and shipped directly to the customer. The benefits: Customers appreciate a fully-stocked inventory, with multiple ordering options and fast shipments. Taking advantage of a virtual warehouse gives retailers the ability to expand their customer base with new products, while increasing customer loyalty through superior services. Because the distributors provide the inventory space, in addition to the picking, packing and shipping labour, retailers can cut costs significantly while improving profits. Distributors expand business while reducing inventories, and with the ability to continually update prices online, distributors no longer have to honour outdated prices that are often found in catalogues. 2) Present an example where data mining is critical to the success of a business. What data mining functions does this business need? Can they be performed? Data mining is used today in a wide range of applications, from tracking down criminals to brokering information for supermarkets, from developing community knowledge for a business to cross-selling, routing warranty claims, holding on to good customers, and weeding out bad customers . Some applications include marketing, financial investment, fraud detection, manufacturing and production, and network management. Data mining is not limited to the business environment. Data mining is also useful for data analysis of sky survey cataloguing, mapping the datasets of Venus, bio sequence databases, and geosciences systems. Data mining is the process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial business decisions. In essence, data mining is distinguished by the fact that it is aimed at the discovery of information, without a previously formulated hypothesis. The field of data mining addresses the question of how best to use the historical data to discover general regularities and improve the process of making decisions.

Data mining in customer relationship management applications can contribute significantly to the bottom line. Rather than randomly contacting a prospect or customer through a call center or sending mail, a company can concentrate its efforts on prospects that are predicted to have a high likelihood of responding to an offer. More sophisticated methods may be used to optimize resources across campaigns so that one may predict to which channel and to which offer an individual is most likely to respondacross all potential offers. Additionally, sophisticated applications could be used to automate the mailing. Once the results from data mining (potential prospect/customer and channel/offer) are determined, this "sophisticated application" can either automatically send an e-mail or regular mail. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer. Data clustering can also be used to automatically discover the segments or groups within a customer data set. Businesses employing data mining may see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than one model to predict how many customers will churn, a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to loyal customers. Finally, it may want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move to automated data mining. Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees. Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels. Another example of data mining, often called the market basket analysis, relates to its use in retail sales. If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with association rules within transaction-based data. Not all data are transaction based and logical or inexact rules may also be present within a database. Market basket analysis has also been used to identify the purchase patterns of the Alpha consumer. Alpha Consumers are people that play a key role in connecting with the concept behind a product, then adopting that product, and finally validating it for the rest of society. Analysing the data collected on this type of user has allowed companies to predict future buying trends and forecast supply demands. Data mining is a highly effective tool in the catalogue marketing industry. Cataloguers have a rich history of customer transactions on millions of customers dating back several years. Data mining tools can identify patterns among customers and help identify the most likely customers to respond to upcoming mailing campaigns.

Data mining for business applications is a component which needs to be integrated into a complex modelling and decision making process. Reactive business intelligence (RBI) advocates a holistic approach that integrates data mining, modeling and interactive visualization, into an end-to-end discovery and continuous innovation process powered by human and automated learning.In the area of decision making the RBI approach has been used to mine the knowledge which is progressively acquired from the decision maker and self-tune the decision method accordingly. Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing. "In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure. These patterns are then utilized to decide in real time which die to test next and when to stop testing. This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products. A department store for example can use data mining to assist with its target marketing mail campaign. using data mining functions such as association, the store can use the mined strong association rules to determine which product bought by one group of customers are likely to read the buying certain products. with this information, the store can then mail marking material only to those kind of customers who exhibit a high likelihood of purchasing additional products. Data query processing is used for data or information retrieval and does not have the means for finding association rules. Similarly, simple statistical analysis cannot handle large amount of data such as those of customer records in a department store. Without data mining many business may not be able to perform effective market analysis, compare customer feedback on similar products ,discover the strengths and weakness of their competitors, retain highly variable customers and make smart business decisions. data mining is the core of business intelligence online analytic processing. 3) suppose that a data ware for big university consist of the following 4 dimensional Student ,course, semester, and instructor & measure :count average grade.When at the lowest conceptual level ,the average grade of the student .At higher conceptual level, average grade store the average grade for the given combination .a) draw a snowflake schema

. Stud id Stud name Stud age Stud qual Branch key Stud id C dur Stud id Semi id Start date End date Item key Inst id Inst name Inst type

Stud mk Stud grade Stud per Avg grade count

Branch key Branch name Branch type

4) A data has 4 transaction . let min-sup=60% and min-count=80%

Count id 01

Tid T100

Item-bought {kings crab i1,sunset-milk i2,daryland-chesse i3,best-bread i4} {best-chesses i5,diaryland-milk i6,goldenfarm-apple i7,tastypie i8,wonder-brea i9} {west coast-apple i9,tusty-pie i8} i10,diaryland-milk i6,wonder-bread

02

T200

01

T300

03

T400

{wonder-bread i8,sunset-milk i2,daryland-chesses i3}

For all x e transaction, buy (x, item)^buy(x,item2)=>buy(x,item3) List the frequent k-item set for the largest k and all of the strong association rule containing the frequent itemset?

Tid T100 T200 T300 T400

List of item i1,i2,i3,i4 i5,i6,i7,i8,i9 i10,i9,i6,i8 i9,i2,i3 Scan d for each candidate

Step 1: Item set {i1} {i2} {i3} {i4} {i5} {i6} {i7} {i8} {i9} support

Support count 1 2 2 1 1 2 1 2 3 Compare candidate count with min

Min support count 2

item i2 i3 i6 i7 i8

Support count 2 2 2 2 3 Generate candidate from l1

Item set {i2,i3} {i2,i6} {i2,i8} {i2,i9} {i3,i6} {i3,i8} {i3,i9} {i6,i8} {i6,i9} {i8,i9} min support count

Supt.count 2 0 0 1 0 0 1 1 1 2 Compare candidate support count with

item {i2,i3} {i6,i8} {i6,i9} {i8,i9}

Supt.count 2 2 2 2 Generate l1

item {i2,i3,i6} {i2,i3,i8} {i2,i3,i9} {i2,i6,i8} {i2,i6,i9} {i3,i6,i8} {i3,i6,i9} {i6,i8,i9}

Supt.count 0 0 1 0 0 0 0 2

Compare candidate support count with min support count

item {i6,i8,i9}

Supt.count 2

5) State why for the integration of multiple heterogeneous information sources, many companies in industry prefer the update driven approach rather than using query driven approach .describe situations where the query driven approach is preferable over the update driven? data warehouse (DW) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting. The typical data warehouse uses staging, integration, and access layers to house its key functions. The staging layer stores raw data, the integration layer integrates the data and moves it into hierarchal groups, and the access layer helps users retrieve data. Data warehouses can be subdivided into data marts. Data marts store subsets of data from a warehouse. This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, catalogued and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support . However, the means to retrieve and analyse data, to extract, transform and load data,

and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata.

online analytical processing, or OLAP is an approach to swiftly answer multi-dimensional analytical queries.OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining.Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM),[3] budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture The term OLAP was created as a slight modification of the traditional database term OLTP . OLAP tools enable users to interactively analyse multidimensional data from multiple perspectives. OLAP consists of three basic analytical operations: consolidation , drill-down, and slicing and dicing. Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions. For example, all sales offices are rolled up to the sales department or sales division to anticipate sales trends. In contrast, the drill-down is a technique that allows users to navigate through the details. For instance, users can access to the sales by individual products that make up a regions sales. Slicing and dicing is a feature whereby users can take out a specific set of data of the cube and view the slices from different viewpoints. Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time. They borrow aspects of navigational databases, hierarchical databases and relational databases. .

The traditional data base approach to heterogeneous database integration is to build wrappers and integrators on top of the multiple heterogeneous databases, when a query is posed to client site a metadata dictionary is used to translate the query into queries approach for the individual heterogeneous site involved .These query are the mapped and sent to local query processors .The result returned from the different sites are integrated into a global answer set. This query driven approach requires complex information filtering and integration process, and competes with local sites for processing resources . It is inefficient and potentially expensive for frequent queries especially queries requiring aggregations Data warehousing provide an interesting attractive to this traditional approach. rather than using a query driven approach, data warehousing employs an update driven approach in advance in which information from multiple , heterogeneous process is integrated in advance and store in a warehouse for direct querying and analysis unlike online transaction processing databases, data warehouses do not contain the most current information. however a data warehouse bring high performance to the integrated heterogeneous database system because data are copied, pre-processed, integrated and

restricted into one sematic data store query processing in data warehouse doesnt interfere with the proceeding at local sources. More over data warehouse can store and integrated historic information and support complex multidimensional queries. As result data warehouse has become popular in industry. 6)Recent application pay special attention to spatio temporal data streams .a spatio temporal data stream spatial information that changes over time and is in the form of stream of data. a) present 3 application example of spatio temporal data streams. b) Discusses what kind of intersecting knowledge can be mined from such data streams with limited time resources. Spatro temporal data are those that relate to both space and time. Spatro temporal data mining refers to the process of discovering patterns and knowledge from spatro temporal data. Typical eg: of spatro temporal data mining include discovering the evolutionary history of cities and clands. Data mining multimedia data. It is the special pattern from multimedia database that store and manage large collection of multimedia object including image, data video, audio etc. multimedia data mining is a interdisciplinary field that integrates image processing and understanding , computer vision, data mining , and pattern recognition. Issues in multimedia data mining include content based retrieval and similarity search and generalization and multidimensional analysis. Mining text data: Text mining is an interdisciplinary field that draws on information retrieval , data mining, machine learning ,statistics and computational linguistics. Mining web data: It contain rich and dynamic collection of information about web page content with hypertext Structures and multimedia ,hyperlink information and access and usage information , providing fertile sources for data mining. Web mining is the techniques to discover patterns, structure and knowledge from the web.

Das könnte Ihnen auch gefallen