UNIT I DATA WAREHOUSING Data Warehouse Introduction A data warehouse is a collection of data marts representing historical data from different operations in the company. This data is stored in a structure optimized for querying and data analysis as a data warehouse. Table design, dimensions and organization should be consistent throughout a data warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also be viewed as a database for historical data from different functions within a company. The term Data Warehouse was coined by Bill nmon in !""#, which he defined in the following way$ A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of managements decision making process! %e defined the terms in the sentence as follows$ Subject Oriented: Data that gives information about a particular sub&ect instead of about a company's ongoing operations. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Time-variant: All data in the data warehouse is identified with a particular time period. Non-volatile: Data is stable in a data warehouse. (ore data is added but data is never removed. This enables management to gain a consistent picture of the business. t is a single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business conte)t. t can be *sed for decision +upport *sed to manage and control business *sed by managers and end,users to understand the business and ma-e &udgments Data Warehousing is an architectural construct of information systems that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores "ther important terminolog# Enterprise Data warehouse: t collects all information about sub&ects .customers, products, sales, assets, personnel/ that span the entire organization Data (art$ Departmental subsets that focus on selected sub&ects. A data mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in 1 CS2032 DATA WAREHOUSING AND DATA MINING the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse. Decision Support System (DSS): nformation technology to help the -nowledge wor-er .e)ecutive, manager, and analyst/ ma-es faster 0 better decisions Drill-down: Traversing the summarization levels from highly summarized data to the underlying current or old detail Metadata: Data about data. 1ontaining location and description of warehouse system components$ names, definition, structure2 $enefits of data warehousing Data warehouses are designed to perform well with aggregate queries running on large amounts of data. The structure of data warehouses is easier for end users to navigate, understand and query against unli-e the relational databases primarily designed to handle lots of transactions. Data warehouses enable queries that cut across different segments of a company's operation. 3.g. production data could be compared against inventory data even if they were originally stored in different databases with different structures. 4ueries that would be comple) in very normalized databases could be easier to build and maintain in data warehouses, decreasing the wor-load on transaction systems. Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non uniform and scattered throughout a company. Data warehousing is an efficient way to manage demand for lots of information from lots of users. 5Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom that can provide an organization with competitive advantage. "perational and informational Data 6perational Data$ 7ocusing on transactional function such as ban- card withdrawals and deposits Detailed *pdateable 8eflects current data nformational Data$ 7ocusing on providing answers to problems posed by decision ma-ers +ummarized 9on updateable Data Warehouse %haracteristics A data warehouse can be viewed as an information system with the following attributes$ : t is a database designed for analytical tas-s : t's content is periodically updated : t contains current and historical data to provide a historical perspective of information 2 CS2032 DATA WAREHOUSING AND DATA MINING 6perational data store .6D+/ ; 6D+ is an architecture concept to support day,to,day operational decision support and contains current value data propagated from operational applications ; 6D+ is sub&ect,oriented, similar to a classic definition of a Data warehouse ; 6D+ is integrated 6D+ DATA WA83%6*+3 <olatile 9on volatile <ery current data 1urrent and historical data Detailed data =re calculated summaries &!Data warehouse Architecture and its seven components !. Data sourcing, cleanup, transformation, and migration tools >. (etadata repository ?. Warehouse@database technology A. Data marts B. Data query, reporting, analysis, and mining tools C. Data warehouse administration and management 3 CS2032 DATA WAREHOUSING AND DATA MINING D. nformation delivery system
Data warehouse is an environment, not a product which is based on relational database management system that functions as the central repository for informational data. The central repository information is surrounded by number of -ey components designed to ma-e the environment is functional, manageable and accessible. The data source for data warehouse is coming from operational applications. The data entered into the data warehouse transformed into an integrated structure and format. The transformation process involves conversion, summarization, filtering and condensation. The data warehouse must be capable of holding and managing large volumes of data as well as different structure of data structures over the time. &! Data warehouse database This is the central part of the data warehousing environment. This is the item number > in the above arch. diagram. This is implemented based on 8DB(+ technology. '! (ourcing, Ac)uisition, %lean up, and *ransformation *ools This is item number ! in the above arch diagram. They perform conversions, summarization, -ey changes, structural changes and condensation. The data transformation is required so that the information can by used by decision support tools. The transformation produces programs, control statements, E1F 4 CS2032 DATA WAREHOUSING AND DATA MINING code, 16B6F code, *9G scripts, and +4F DDF code etc., to move the data into data warehouse from multiple operational systems. The functionalities of these tools are listed below$ To remove unwanted data from operational db 1onverting to common data names and attributes 1alculating summaries and derived data 3stablishing defaults for missing data 5Accommodating source data definition changes Issues to be considered while data sourcing, cleanup, extract and transformation: Data heterogeneity$ t refers to DB(+ different nature such as it may be in different data modules, it may have different access languages, it may have data navigation methods, operations, concurrency, integrity and recovery processes etc., Data heterogeneity$ t refers to the different way the data is defined and used in different modules. Some experts involved in the development of such tools: =rism +olutions, 3volutionary Technology nc., <ality, =ra)is and 1arleton +!,eta data t is data about data. t is used for maintaining, managing and using the data warehouse. t is classified into two$ Technical Meta data $ t contains information about data warehouse data used by warehouse designer, administrator to carry out development and management tas-s. t includes, 5 nfo about data stores Transformation descriptions. That is mapping methods from operational db to warehouse db Warehouse 6b&ect and data structure definitions for target data The rules used to perform clean up, and data enhancement Data mapping operations Access authorization, bac-up history, archive history, info delivery history, data acquisition history, data access etc., Business Meta data: t contains info that gives info stored in data warehouse to users. t includes, +ub&ect areas, and info ob&ect type including queries, reports, images, video, audio clips etc. nternet home pages nfo related to info delivery system 5Data warehouse operational info such as ownerships, audit trails etc., 5 CS2032 DATA WAREHOUSING AND DATA MINING (eta data helps the users to understand content and find the data. (eta data are stored in a separate data stores which is -nown as informational directory or (eta data repository which helps to integrate, maintain and view the contents of the data warehouse. The following lists the characteristics of info directory@ (eta data$ t is the gateway to the data warehouse environment t supports easy distribution and replication of content for high performance and availability t should be searchable by business oriented -ey words 5 t should act as a launch platform for end user to access data and analysis tools t should support the sharing of info 5 t should support scheduling options for request 5 t should support and provide interface to other applications t should support end user monitoring of the status of the data warehouse environment - Access tools ts purpose is to provide info to business users for decision ma-ing. There are five categories$ 5Data query and reporting tools Application development tools 3)ecutive info system tools .3+/ 56FA= tools Data mining tools 4uery and reporting tools are used to generate query and report. There are two types of reporting tools. They are$ =roduction reporting tool used to generate regular operational reports Des-top report writer are ine)pensive des-top tools designed for end users. Managed Query tools: used to generate +4F query. t uses (eta layer software in between users and databases which offers a point,and,clic- creation of +4F statement. This tool is a preferred choice of users to perform segment identification, demographic analysis, territory management and preparation of customer mailing lists etc. pplication de!elopment tools: This is a graphical data access environment which integrates 6FA= tools with data warehouse and can be used to access all db systems "#$ Tools: are used to analyze the data in multi dimensional and comple) views. To enable multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base and (8DB refers multi relational data bases. Data mining tools: are used to discover -nowledge from the data warehouse data also can be used for data visualization and data correction purposes. 6 CS2032 DATA WAREHOUSING AND DATA MINING .!Data marts Departmental subsets that focus on selected sub&ects. They are independent used by dedicated user group. They are used for rapid delivery of enhanced decision support functionality to end users. Data mart is used in the following situation$ 3)tremely urgent user requirement The absence of a budget for a full scale data warehouse strategy The decentralization of business needs The attraction of easy to use tools and mind sized pro&ect Data mart presents two problems$ !. Scala%ility: A small data mart can grow quic-ly in multi dimensions. +o that while designing it, the organization has to pay more attention on system scalability, consistency and manageability issues >. Data integration /!Data warehouse admin and management The management of data warehouse includes, +ecurity and priority management (onitoring updates from multiple sources Data quality chec-s (anaging and updating meta data Auditing and reporting data warehouse usage and status =urging data 8eplicating, sub setting and distributing data Bac-up and recovery Data warehouse storage management which includes capacity planning, hierarchical storage management and purging of aged data etc., 0!Information deliver# s#stem ; t is used to enable the process of subscribing for data warehouse info. ; Delivery to one or more destinations according to specified scheduling algorithm. '!$uilding a Data warehouse 7 CS2032 DATA WAREHOUSING AND DATA MINING There are two reasons why organizations consider data warehousing a critical need. n other words, there are two factors that drive you to build and use data warehouse. They are$ Business &actors: Business users want to ma-e decision quic-ly and correctly using all available data. Technological &actors: To address the incompatibility of operational data stores T infrastructure is changing rapidly. ts capacity is increasing and cost is decreasing so that building a data warehouse is easy *here are several things to be considered while building a successful data warehouse Business considerations$ 6rganizations interested in development of a data warehouse can choose one of the following *wo approaches1 1. Top : Down Approach .+uggested by Bill nmon/ 2. Bottom : *p Approach .+uggested by 8alph Himball/ &!*op 2 Down Approach n the top down approach suggested by Bill nmon, we build a centralized repository to house corporate wide business data. This repository is called 3nterprise Data Warehouse .3DW/. The data in the 3DW is stored in a normalized form in order to avoid redundancy. The central repository for corporate wide data helps us maintain one version of truth of the data.The data in the 3DW is stored at the most detail level. The reason to build the 3DW on the most detail level is to leverage !. 7le)ibility to be used by multiple departments. >. 7le)ibility to cater for future requirements. *he disadvantages of storing data at the detail level are !. The comple)ity of design increases with increasing level of detail. >. t ta-es large amount of space to store data at detail level, hence increased cost. 8 CS2032 DATA WAREHOUSING AND DATA MINING 6nce the 3DW is implemented we start building sub&ect area specific data marts which contain data in a de normalized form also called star schema. The data in the marts are usually summarized based on the end users analytical requirements. The reason to de normalize the data in the mart is to provide faster access to the data for the end users analytics. f we were to have queried a normalized schema for the same analytics, we would end up in a comple) multiple level &oins that would be much slower as compared to the one on the de normalized schema. We should implement the top,down approach when !. The business has complete clarity on all or multiple sub&ect areas data warehosue requirements. >. The business is ready to invest considerable time and money. The advantage of using the Top Down approach is that we build a centralized repository to cater for one version of truth for business data. This is very important for the data to be reliable, consistent across sub&ect areas and for reconciliation in case of data related contention between sub&ect areas. The disadvantage of using the Top Down approach is that it requires more time and initial investment. The business has to wait for the 3DW to be implemented followed by building the data marts before which they can access their reports. '! $ottom 3p Approach The bottom up approach suggested by 8alph Himball is an incremental approach to build a data warehouse. %ere we build the data marts separately at different points of time as and when the specific sub&ect area requirements are clear. The data marts are integrated or combined together to form a data warehouse. +eparate data marts are combined through the use of conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one that can be shared across data marts. A 1onformed dimension has consistent dimension -eys, consistent attribute names and consistent values across separate data marts. The conformed dimension means e)act same thing with every fact table it is &oined. A 1onformed fact has the same definition of measures, same dimensions &oined to it and at the same granularity across data marts. The bottom up approach helps us incrementally build the warehouse by developing and integrating data marts as and when the requirements are clear. We don't have to wait for -nowing the overall requirements of the warehouse. We should implement the bottom up approach when !. We have initial cost and time constraints. >. The complete warehouse requirements are not clear. We have clarity to only one data mart. 9 CS2032 DATA WAREHOUSING AND DATA MINING The advantage of using the Bottom *p approach is that they do not require high initial costs and have a faster implementation timeI hence the business can start using the marts much earlier as compared to the top,down approach. The disadvantages of using the Bottom *p approach is that it stores data in the de normalized format, hence there would be high space usage for detailed data. We have a tendency of not -eeping detailed data in this approach hence loosing out on advantage of having detail data .i.e. fle)ibility to easily cater to future requirements. Bottom up approach is more realistic but the comple)ity of the integration may become a serious obstacle. !SI"N #ONSI!$%TIONS To be a successful data warehouse designer must adopt a holistic approach that is considering all data warehouse components as parts of a single comple) system, and ta-e into account all possible data sources and all -nown usage requirements. (ost successful data warehouses that meet these requirements have these common characteristics$ Are based on a dimensional model 1ontain historical and current data nclude both detailed and summarized data 1onsolidate disparate data from multiple sources while retaining consistency Data warehouse is difficult to build due to the following reason$ %eterogeneity of data sources *se of historical data Jrowing nature of data base Data warehouse design approach muse be business driven, continuous and iterative engineering approach. n addition to the general considerations there are following specific points relevant to the data warehouse design$ Data content The content and structure of the data warehouse are reflected in its data model. The data model is the template that describes how information will be organized within the integrated warehouse framewor-. The data warehouse data must be a detailed data. t must be formatted, cleaned up and transformed to fit the warehouse data model. 10 CS2032 DATA WAREHOUSING AND DATA MINING ,eta data t defines the location and contents of data in the warehouse. (eta data is searchable by users to find definitions or sub&ect areas. n other words, it must provide decision support oriented pointers to warehouse data and thus provides a logical lin- between warehouse data and decision support applications. Data distribution 6ne of the biggest challenges when designing a data warehouse is the data placement and distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary to -now how the data should be divided across multiple servers and which users should get access to which types of data. The data can be distributed based on the sub&ect area, location .geographical region/, or time .current, month, year/. *ools A number of tools are available that are specifically designed to help in the implementation of the data warehouse. All selected tools must be compatible with the given data warehouse environment and with each other. All tools must be able to use a common (eta data repository. Design steps The following nine,step method is followed in the design of a data warehouse$ !. 1hoosing the sub&ect matter >. Deciding what a fact table represents ?. dentifying and conforming the dimensions A. 1hoosing the facts B. +toring pre calculations in the fact table C. 8ounding out the dimension table D. 1hoosing the duration of the db K. The need to trac- slowly changing dimensions ". Deciding the query priorities and query models T!#&NI#%' #ONSI!$%TIONS A number of technical issues are to be considered when designing a data warehouse environment. These issues include$ 11 CS2032 DATA WAREHOUSING AND DATA MINING The hardware platform that would house the data warehouse The dbms that supports the warehouse data The communication infrastructure that connects data marts, operational systems and end users The hardware and software to support meta data repository The systems management framewor- that enables admin of the entire environment I()'!(!NT%TION #ONSI!$%TIONS The following logical steps needed to implement a data warehouse$ 1ollect and analyze business requirements 1reate a data model and a physical design Define data sources 1hoose the db tech and platform 3)tract the data from operational db, transform it, clean it up and load it into the warehouse 1hoose db access and reporting tools 1hoose db connectivity software 1hoose data analysis and presentation s@w *pdate the data warehouse Access tools Data warehouse implementation relies on selecting suitable data access tools. The best way to choose this is based on the type of data can be selected using this tool and the -ind of access it permits for a particular user. The following lists the various type of data that can be accessed$ +imple tabular form data 8an-ing data (ultivariable data Time series data Jraphing, charting and pivoting data 1omple) te)tual search data +tatistical analysis data Data for testing of hypothesis, trends and patterns =redefined repeatable queries Ad hoc user specified queries 12 CS2032 DATA WAREHOUSING AND DATA MINING 8eporting and analysis data 1omple) queries with multiple &oins, multi level sub queries and sophisticated search criteria Data e4traction, clean up, transformation and migration A proper attention must be paid to data e)traction which represents a success factor for a data warehouse architecture. When implementing data warehouse several the following selection criteria that affect the ability to transform, consolidate, integrate and repair the data should be considered$ Timeliness of data delivery to the warehouse The tool must have the ability to identify the particular data and that can be read by conversion tool The tool must support flat files, inde)ed files since corporate data is still in this type The tool must have the capability to merge data from multiple data stores The tool should have specification interface to indicate the data to be e)tracted The tool should have the ability to read data from data dictionary The code generated by the tool should be completely maintainable The tool should permit the user to e)tract the required data The tool must have the facility to perform data type and character set translation The tool must have the capability to create summarization, aggregation and derivation of records The data warehouse database system must be able to perform loading data directly from these tools Data placement strategies : As a data warehouse grows, there are at least two options for data placement. 6ne is to put some of the data in the data warehouse into another storage media. : The second option is to distribute the data in the data warehouse across multiple servers. 3ser levels The users of data warehouse data can be classified on the basis of their s-ill level in accessing the warehouse. There are three classes of users$ 'asual users: are most comfortable in retrieving info from warehouse in pre defined formats and running pre e)isting queries and reports. These users do not need tools that allow for building standard and ad hoc reports $ower (sers: can use pre defined as well as user defined queries to create simple and ad hoc reports. These users can engage in drill down operations. These users may have the e)perience of using reporting and query tools. 13 CS2032 DATA WAREHOUSING AND DATA MINING E)pert users: These users tend to create their own comple) queries and perform standard analysis on the info they retrieve. These users have the -nowledge about the use of query and report tools $enefits of data warehousing 1Data warehouse usage includes, : Focating the right info : =resentation of info : Testing of hypothesis : Discovery of info : +haring the analysis *he benefits can be classified into two1 Tangible benefits .quantified @ measureable/$t includes, : mprovement in product inventory : Decrement in production cost : mprovement in selection of target mar-ets : 3nhancement in asset and liability management ntangible benefits .not easy to quantified/$ t includes, : mprovement in productivity by -eeping all data in single location and eliminating re-eying of data : 8educed redundant processing : 3nhanced customer relation +! ,apping the data warehouse architecture to ,ultiprocessor architecture The functions of data warehouse are based on the relational data base technology. The relational data base technology is implemented in parallel manner. There are two advantages of having parallel relational data base technology for data warehouse$ #inear Speed up: refers the ability to increase the number of processor to reduce response time. #inear Scale up: refers the ability to provide same performance on the same requests as the database size increases *#pes of parallelism There are two types of parallelism$ 14 CS2032 DATA WAREHOUSING AND DATA MINING *nter +uery $arallelism: n which different server threads or processes handle multiple requests at the same time. *ntra +uery $arallelism: This form of parallelism decomposes the serial +4F query into lower level operations such as scan, &oin, sort etc. Then these lower level operations are e)ecuted concurrently in parallel. ntra query parallelism can be done in either of two ways$ ,ori-ontal parallelism: which means that the data base is partitioned across multiple dis-s and parallel processing occurs within a specific tas- that is performed concurrently on different processors against different set of data .ertical parallelism: This occurs among different tas-s. All query components such as scan, &oin, sort etc are e)ecuted in parallel in a pipelined fashion. n other words, an output from one tas- becomes an input into another tas-. Data partitioning1 Data partitioning is the -ey component for effective parallel e)ecution of data base operations. =artition can be done randomly or intelligently. /andom portioning includes random data striping across multiple dis-s on a single server. Another option for random portioning is round robin fashion partitioning in which each record is placed on the ne)t dis- assigned to the data base. *ntelligent partitioning assumes that DB(+ -nows where a specific record is located and does not waste time searching for it across all dis-s. The various intelligent partitioning include$ ,ash partitioning: A hash algorithm is used to calculate the partition number based on the value of the partitioning -ey for each row 15 CS2032 DATA WAREHOUSING AND DATA MINING 0ey range partitioning: 8ows are placed and located in the partitions according to the value of the partitioning -ey. That is all the rows with the -ey value from A to H are in partition !, F to T are in partition > and so on. Schema portioning: an entire table is placed on one dis-I another table is placed on different dis- etc. This is useful for small reference tables. (ser de&ined portioning: t allows a table to be partitioned on the basis of a user defined e)pression. Data base architectures of parallel processing There are three DB(+ software architecture styles for parallel processing$ !. +hared memory or shared everything Architecture >. +hared dis- architecture ?. +hred nothing architecture Shared (emor* %rchitecture Tightly coupled shared memory systems, illustrated in following figure have the following characteristics$ (ultiple =*s share memory. 3ach =* has full access to all shared memory through a common bus. 1ommunication between nodes occurs via shared memory. =erformance is limited by the bandwidth of the memory bus. +ymmetric multiprocessor .+(=/ machines are often nodes in a cluster. (ultiple +(= nodes can be used with 6racle =arallel +erver in a tightly coupled system, where memory is shared among the multiple =*s, and is accessible by all the =*s through a memory bus. 3)amples of tightly coupled systems include the =yramid, +equent, and +un +parc+erver. 16 CS2032 DATA WAREHOUSING AND DATA MINING =erformance is potentially limited in a tightly coupled system by a number of factors. These include various system components such as the memory bandwidth, =* to =* communication bandwidth, the memory available on the system, the @6 bandwidth, and the bandwidth of the common bus. =arallel processing advantages of shared memor# s#stems are these$ (emory access is cheaper than inter,node communication. This means that internal synchronization is faster than using the Foc- (anager. +hared memory systems are easier to administer than a cluster. A disadvantage of shared memor# s#stems for parallel processing is as follows$ +calability is limited by bus bandwidth and latency, and by available memory. Shared is+ %rchitecture +hared dis- systems are typically loosely coupled. +uch systems, illustrated in following figure, have the following characteristics$ 3ach node consists of one or more =*s and associated memory. (emory is not shared between nodes. 1ommunication occurs over a common high,speed bus. 17 CS2032 DATA WAREHOUSING AND DATA MINING 3ach node has access to the same dis-s and other resources. A node can be an +(= if the hardware supports it. Bandwidth of the high,speed bus limits the number of nodes .scalability/ of the system. The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed Foc- (anager .DF( / is required. 3)amples of loosely coupled systems are <AGclusters or +un clusters. +ince the memory is not shared among the nodes, each node has its own data cache. 1ache consistency must be maintained across the nodes and a loc- manager is needed to maintain the consistency. Additionally, instance loc-s using the DF( on the 6racle level must be maintained to ensure that all nodes in the cluster see identical data. There is additional overhead in maintaining the loc-s and ensuring that the data caches are consistent. The performance impact is dependent on the hardware and software components, such as the bandwidth of the high,speed bus through which the nodes communicate, and DF( performance. =arallel processing advantages of shared dis- systems are as follows$ +hared dis- systems permit high availability. All data is accessible even if one node dies. These systems have the concept of one database, which is an advantage over shared nothing systems. +hared dis- systems provide for incremental growth. 18 CS2032 DATA WAREHOUSING AND DATA MINING =arallel processing disadvantages of shared dis- systems are these$ nter,node synchronization is required, involving DF( overhead and greater dependency on high, speed interconnect. f the wor-load is not partitioned well, there may be high synchronization overhead. There is operating system overhead of running shared dis- software. Shared Nothing %rchitecture +hared nothing systems are typically loosely coupled. n shared nothing systems only one 1=* is connected to a given dis-. f a table or database is located on that dis-, access depends entirely on the =* which owns it. +hared nothing systems can be represented as follows$ +hared nothing systems are concerned with access to dis-s, not access to memory. 9onetheless, adding more =*s and dis-s can improve scale up. 6racle =arallel +erver can access the dis-s on a shared nothing system as long as the operating system provides transparent dis- access, but this access is e)pensive in terms of latency. +hared nothing systems have advantages and disadvantages for parallel processing$ Advantages 19 CS2032 DATA WAREHOUSING AND DATA MINING +hared nothing systems provide for incremental growth. +ystem growth is practically unlimited. (==s are good for read,only databases and decision support applications. 7ailure is local$ if one node fails, the others stay up. Disadvantages (ore coordination is required. (ore overhead is required for a process wor-ing on a dis- belonging to another node. f there is a heavy wor-load of updates or inserts, as in an online transaction processing system, it may be worthwhile to consider data,dependent routing to alleviate contention. 5arallel D$,( features +cope and techniques of parallel DB(+ operations 6ptimizer implementation Application transparency =arallel environment which allows the DB(+ server to ta-e full advantage of the e)isting facilities on a very low level DB(+ management tools help to configure, tune, admin and monitor a parallel 8DB(+ as effectively as if it were a serial 8DB(+ =rice @ =erformance$ The parallel 8DB(+ can demonstrate a non linear speed up and scale up at reasonable costs. 5arallel D$,( vendors 6racle$ =arallel 4uery 6ption .=46/ Architecture$ shared dis- arch 20 CS2032 DATA WAREHOUSING AND DATA MINING Data partition$ Hey range, hash, round robin =arallel operations$ hash &oins, scan and sort nformi)$ eGtended =arallel +erver .G=+/ Architecture$ +hared memory, shared dis- and shared nothing models Data partition$ round robin, hash, schema, -ey range and user defined =arallel operations$ 9+38T, *=DAT3, D3F3FT3 B($ DB> =arallel 3dition .DB> =3/ Architecture$ +hared nothing models Data partition$ hash =arallel operations$ 9+38T, *=DAT3, D3F3FT3, load, recovery, inde) creation, bac-up, table reorganization +LBA+3$ +LBA+3 (== Architecture$ +hared nothing models Data partition$ hash, -ey range, +chema =arallel operations$ %orizontal and vertical parallelism -! D$,( schemas for decision support The basic concepts of dimensional modeling are$ facts, dimensions and measures. A fact is a collection of related data items, consisting of measures and conte)t data. t typically represents business items or business transactions. A dimension is a collection of data that describe one business dimension. Dimensions determine the conte)tual bac-ground for the factsI they are the parameters over which we want to perform 6FA=. A measure is a numeric attribute of a fact, representing the performance or behavior of the business relative to the dimensions. 1onsidering 8elational conte)t, there are three basic schemas that are used in dimensional modeling$ !. +tar schema >. +nowfla-e schema ?. 7act constellation schema (tar schema 21 CS2032 DATA WAREHOUSING AND DATA MINING The multidimensional view of data that is e)pressed using relational data base semantics is provided by the data base schema design called star schema. The basic of stat schema is that information can be classified into two groups$ 7acts Dimension +tar schema has one large central table .fact table/ and a set of smaller tables .dimensions/ arranged in a radial pattern around the central table. 7acts are core data element being analyzed while dimensions are attributes about the facts. The determination of which schema model should be used for a data warehouse should be based upon the analysis of pro&ect requirements, accessible tools and pro&ect team preferences. What is star schemaM The star schema architecture is the simplest data warehouse schema. t is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. *sually the fact tables in a star schema are in third normal form.?97/ whereas dimensional tables are de,normalized. Despite the 22 CS2032 DATA WAREHOUSING AND DATA MINING fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by 6racle. 6act *ables A fact table is a table that contains summarized numerical and historical data .facts/ and a multipart inde) composed of foreign -eys from the primary -eys of related dimension tables. A fact table typically has two types of columns$ foreign -eys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level. Dimension *ables Dimensions are categories by which summarized data can be viewed. 3.g. a profit summary in a fact table can be viewed by a Time dimension .profit by month, quarter, year/, 8egion dimension .profit by country, state, city/, =roduct dimension .profit for product!, product>/. A dimension is a structure usually composed of one or more hierarchies that categorizes data. f a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary -eys of each of the dimension tables are part of the composite primary -ey of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, te)tual values. Dimension tables are generally small in size then fact table. Typical fact tables store data about sales while dimension tables data about geographic region .mar-ets, cities/, clients, products, times, channels. ,easures (easures are numeric data based on columns in a fact table. They are the primary data which end users are interested in. 3.g. a sales fact table may contain a profit measure which represents profit on each sale. Aggregations are pre calculated numeric data. By calculating and storing the answers to a query before users as- for it, the query processing time can be reduced. This is -ey in providing fast query performance in 6FA=. 1ubes are data processing units composed of fact tables and dimensions from the data warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients. The main characteristics of star schema$ +imple structure ,N easy to understand schema 23 CS2032 DATA WAREHOUSING AND DATA MINING Jreat query effectives ,N small number of tables to &oin 8elatively long time of loading data into dimension tables ,N de,normalization, redundancy data caused that size of the table could be large. The most commonly used in the data warehouse implementations ,N widely supported by a large number of business intelligence tools (nowflake schema1 The snowfla-e schema is an e)tension of the star schema, where each point of the star e)plodes into more points. n a star schema, each dimension is represented by a single dimensional table, whereas in a snowfla-e schema, that dimensional table is normalized into multiple loo-up tables, each representing a level in the dimensional hierarchy. 7or e)ample, the Time Dimension that consists of > different hierarchies$ !.LearO(onthODay >. Wee- O Day We will have A loo-up tables in a snowfla-e schema$ A loo-up table for year, a loo-up table for month, a loo-up table for wee-, and a loo-up table for day. Lear is connected to (onth, which is then connected to Day. Wee- is only connected to Day. The main advantage of the snowfla+e schema is the improvement in query performance due to minimized dis- storage requirements and &oining smaller loo-up tables. The main disadvantage of the snowfla+e schema is the additional maintenance efforts needed due to the increase number of loo-up tables. 24 CS2032 DATA WAREHOUSING AND DATA MINING t is the result of decomposing one or more of the dimensions. The many,to,one relationships among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The decomposed snowfla-e structure visualizes the hierarchical structure of dimensions very well. 7act constellation schema$ 7or each star schema it is possible to construct fact constellation schema .for e)ample by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies/. The fact constellation architecture contains multiple fact tables that share many dimension tables. The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular -inds of aggregation must be considered and selected. (oreover, dimension tables are still large. -! Data 74traction, %leanup, and *ransformation *ools 3TF stands for 3)tract, Transform, Foad is Data Warehouse acquisition processes that involves 3)tract the data from outside sources. Transform the data to fit business needs and ultimately Foad the the transform data to the data warehouse. 7or e)ample$ !. nformatics. >. Data +tage. ?. 6racle warehouse builder. A. Ab initio. 3TF can also be used for the integration with legacy systems. 3TF is the Data Warehouse acquisition processes of 3)tracting, Transforming and Foading data from source systems into the data warehouse. 74traction 25 CS2032 DATA WAREHOUSING AND DATA MINING 3)traction is the operation of e)tracting data from a source system for further use in a data warehouse environment. This is the first step of the 3TF process. After the e)traction, this data can be transformed and loaded into the data warehouse. *ntroduction to E)traction Methods in Data 1arehouses The e)traction method you should choose is highly dependent on the source system and also from the business needs in the target data warehouse environment. <ery often, therePs no possibility to add additional logic to the source systems to enhance an incremental e)traction of data due to the performance or the increased wor-load of these systems. +ometimes even the customer is not allowed to add anything to an out,of,the,bo) application system. Lou have to decide how to e)tract data logically and physically 'ogical !xtraction: There are two -inds of logical e)traction !. 7ull e)traction, >. ncremental e)traction )h*sical !xtraction: There are two -inds of physical e)traction !. 6nline e)traction >. 6ffline e)traction *ransformation tools ts purpose is to provide info to business users for decision ma-ing. There are five categories$ Data query and reporting tools Application development tools 3)ecutive info system tools .3+/ 6FA= tools Data mining tools 4uery and reporting tools are used to generate query and report. There are two types of reporting tools. They are$ =roduction reporting tool used to generate regular operational reports Des-top report writer are ine)pensive des-top tools designed for end users. Managed Query tools: used to generate +4F query. t uses (eta layer software in between users and databases which offers a point,and,clic- creation of +4F statement. This tool is a preferred choice of 26 CS2032 DATA WAREHOUSING AND DATA MINING users to perform segment identification, demographic analysis, territory management and preparation of customer mailing lists etc. pplication de!elopment tools: This is a graphical data access environment which integrates 6FA= tools with data warehouse and can be used to access all db systems "#$ Tools: are used to analyze the data in multi dimensional and comple) views. To enable multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base and (8DB refers multi relational data bases. Data mining tools: are used to discover -nowledge from the data warehouse data also can be used for data visualization and data correction purposes. .! ,etadata (eta data$ data about data ,eta Data in Data Warehouse ,eta Data is one of the most important aspect of data warehousing. t is the data about data stored in data warehouse and its users. ,eta Data provides decision,support,oriented pointer to warehouse data and thus provide logical lin- between warehouse data and decision support application. ,eta Data is the -ey to providing user and application with a road map to the information stored in the warehouse. ,eta Data can define all attributes, data sources and timing, and rules that govern data use and data transformation of all data elements. (etadata .metacontent/ is defined as data providing information about one or more aspects of the data, such as$ (eans of creation of the data =urpose of the data Time and date of creation 1reator or author of data 27 CS2032 DATA WAREHOUSING AND DATA MINING Focation on a computer networ- where the data was created +tandards used *#pes1 ,- Technical (eta data1 t contains information about data warehouse data used by warehouse designer, administrator to carry out development and management tas-s. t includes, nfo about data stores Transformation descriptions. That is mapping methods from operational db to warehouse db Warehouse 6b&ect and data structure definitions for target data The rules used to perform clean up, and data enhancement Data mapping operations Access authorization, bac-up history, archive history, info delivery history, data acquisition history, data access etc., .- /usiness (eta data: t contains info that gives info stored in data warehouse to users. t includes, +ub&ect areas, and info ob&ect type including queries, reports, images, video, audio clips etc. nternet home pages nfo related to info delivery system Data warehouse operational info such as ownerships, audit trails etc., "ther *#pes1 28 CS2032 DATA WAREHOUSING AND DATA MINING (tructural metadata is used to describe the structure of computer systems such as tables, columns and inde)es. 8uide metadata is used to help humans find specific items and is usually e)pressed as a set of -eywords in a natural language. According to 8alph Himballmetadata can be divided into > similar categoriesQTechnical metadata and Business metadata. Technical metadata correspond to internal metadata, business metadatato e)ternal metadata. Himball adds a third category named 5rocess metadata. 6n the other hand, 9+6 distinguishes between three types of metadata$ descriptive, structural and administrative. Descriptive metadata is the information used to search and locate an ob&ect such as title, author, sub&ects, -eywords, publisherI structural metadata gives a description of how the components of the ob&ect are organizedI and administrative metadata refers to the technical information including file type. Two sub,types of administrative metadata are rights management metadata and preservation metadata. *#pes of Data Warehouse There are mainly three type of Data Warehouse. !/. 3nterprise Data Warehouse. >/. 6perational data store. ?/. Data (art. 7nterprise Data Warehouse provide a control Data Base for decision support through out the enterprise. "perational data store has a broad enterprise under scope but unli-e a real enterprise DW. Data is refreshed in rare real time and used for routine business activity. Data ,art is a sub part of Data Warehouse. t support a particular reason or it is design for particular lines of business such as sells, mar-eting or finance, or in any organization documents of a particular department will be data mart 29 CS2032 DATA WAREHOUSING AND DATA MINING
UNIT II #USINESS ANA$%SIS &! 9eporting and :uer# tools and Applications 2 *ool %ategories 2 the ;eed for Applications Data )uer# and reporting tools 4uery and reporting tools are divided in to two parts. 8eporting tools (anaged query tools 9eporting tools further dividing in to two parts. 5roduction reporting tools will let companies generate regular operational reports or support high level batch &ob, such as calculating and printing paychec-s. 9eport writer, on the other hand, are e)pensive des-top tools designed for end users. 30 CS2032 DATA WAREHOUSING AND DATA MINING ,anaged )uer# tools protect end users from comple)ities of +4F and database structure by inserting a (eta layer between user and the database. (eta layer is software that provides sub&ect oriented view of database and support point,and,clic- creation of +4F. These tools are designed for easy,to,use point:and,clic- and visual navigation operations that either accept +4F or generate +4F statements to query relational data stored warehouse. +ome of these tools are used to format the received data in to easy,to,read report. Data Warehouse Access *ools The principal purpose of data warehousing is to providing information to business users for strategic decision ma-ing. These users interact with data warehouse using front,end tools. Although regular reports and custom reports are the primary delivery vehicles for analysis done in most data warehouse, many development efforts in data warehouse arena are focusing on e)ceptional reporting also -nown as alerts2 74ample1 f the data warehouse designed for accessing the ris- for currency treading, an alert can be activated when a certain currency rate drops below a predefined threshold. Access tools can be divided in to five main groups. Data query and reporting tools. Application development tools. 3)ecutive information system .3+/ tools. Data mining tools. '! %ognous Impromptu What is impromptu< mpromptu is an interactive database reporting tool. t allows =ower *sers to query data without programming -nowledge. When using the mpromptu tool, no data is written or changed in the database. t is only capable of reading the data. mpromptu's main features includes, 31 CS2032 DATA WAREHOUSING AND DATA MINING 5 nteractive reporting capability 53nterprise,wide scalability 5+uperior user interface 57astest time to result 5Fowest cost of ownership %atalogs mpromptu stores metadata in sub&ect related folders. This metadata is what will be used to develop a query for a report. The metadata set is stored in a file called a Rcatalog'. The catalog does not contain any data. t &ust contains information about connecting to the database and the fields that will be accessible for reports. A catalog contains1 ; 7oldersQmeaningful groups of information representing columns from one or more tables ; 1olumnsQindividual data elements that can appear in one or more folders ; 1alculationsQe)pressions used to compute required values from e)isting data ; 1onditionsQused to filter information so that only a certain type of information is displayed ; =romptsQpre,defined selection criteria prompts that users can include in reports they create ; 6ther components, such as metadata, a logical database name, &oin information, and user classes =ou can use catalogs to ; view, run, and print reports ; e)port reports to other applications ; disconnect from and connect to the database ; create reports ; change the contents of the catalog ; add user classes 5rompts Lou can use prompts to ; filter reports ; calculate data items ; format data 5icklist 5rompts A pic-list prompt presents you with a list of data items from which you select one or more values, so you need not be familiar with the database. The values listed in pic-list prompts can be retrieved from 32 CS2032 DATA WAREHOUSING AND DATA MINING a database via a catalog when you want to select information that often changes. a column in another saved mpromptu report, a snapshot, or a %ot7ile A report can include a prompt that as-s you to select a product type from a list of those available in the database. 6nly the products belonging to the product type you select are retrieved and displayed in your report. 9eports 8eports are created by choosing fields from the catalog folders. This process will build a +4F .+tructured 4uery Fanguage/ statement behind the scene. 9o +4F -nowledge is required to use mpromptu. The data in the report may be formatted, sorted and@or grouped as needed. Titles, dates, headers and footers and other standard te)t formatting features .italics, bolding, and font size/ are also available.6nce the desired layout is obtained, the report can be saved to a report file. This report can then be run at a different time, and the query will be sent to the database. t is also possible to save a report as a snapshot. This will provide a local copy of the data. This data will not be updated when the report is opened. 1ross tab reports, similar to 3)cel =ivot tables, are also easily created in mpromptu. 6rame-$ased 9eporting 7rames are the building bloc-s of all mpromptu reports and templates. They may contain report ob&ects, such as data, te)t, pictures, and charts. There are no limits to the number of frames that you can place within an individual report or template. Lou can nest frames within other frames to group report ob&ects within a report. Different types of frames and its purpose for creating frame based reporting 7orm frame$ An empty form frame appears. Fist frame$ An empty list frame appears. Te)t frame$ The flashing ,beam appears where you can begin inserting te)t. =icture frame$ The +ource tab .=icture =roperties dialog bo)/ appears. Lou can use this tab to select the image to include in the frame. 1hart frame$ The Data tab .1hart =roperties dialog bo)/ appears. Lou can use this tab to select the data item to include in the chart. 33 CS2032 DATA WAREHOUSING AND DATA MINING 6F3 6b&ect$ The nsert 6b&ect dialog bo) appears where you can locate and select the file you want to insert, or you can create a new ob&ect using the software listed in the 6b&ect Type bo). Impromptu features1 (ni&ied +uery and reporting inter&ace: t unifies both query and reporting interface in a single user interface "%3ect oriented architecture: t enables an inheritance based administration so that more than !### users can be accommodated as easily as single user 'omplete integration with $ower$lay: t provides an integrated solution for e)ploring trends and patterns Scala%ility: ts scalability ranges from single user to !### user Security and 'ontrol: +ecurity is based on user profiles and their classes. Data presented in a %usiness conte)t: t presents information using the terminology of the business. "!er 45 pre de&ined report templates: t allows users can simply supply the data to create an interactive report 6rame %ased reporting: t offers number of ob&ects to create a user designed report Business rele!ant reporting: t can be used to generate a business relevant report through filters, pre conditions and calculations Data%ase independent catalogs: +ince catalogs are in independent nature they require minimum maintenance +! ">A5 6FA= stands for 6nline Analytical =rocessing. t uses database tables .fact and dimension tables/ to enable multidimensional viewing, analysis and querying of large amounts of data. 3.g. 6FA= technology could provide management with fast answers to comple) queries on their operational data or enable them to analyze their company's historical data for trends and patterns. 6nline Analytical =rocessing .6FA=/ applications and tools are those that are designed to as- Scomple) queries of large multidimensional collections of data.T Due to that 6FA= is accompanied with data warehousing. -! ;eeds of ">A5 The -ey driver of 6FA= is the multidimensional nature of the business problem. These problems are characterized by retrieving a very large number of records that can reach gigabytes and terabytes and summarizing this data into a form information that can by used by business analysts. 34 CS2032 DATA WAREHOUSING AND DATA MINING 6ne of the limitations that +4F has, it cannot represent these comple) problems. A query will be translated in to several +4F statements. These +4F statements will involve multiple &oins, intermediate tables, sorting, aggregations and a huge temporary memory to store these tables. These procedures required a lot of computation which will require a long time in computing. The second limitation of +4F is its inability to use mathematical models in these +4F statements. f an analyst, could create these comple) statements using +4F statements, still there will be a large number of computation and huge memory needed. Therefore the use of 6FA= is preferable to solve this -ind of problem. .! %ategories of ">A5 *ools (O'%) This is the more traditional way of 6FA= analysis. n (6FA=, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats. That is, data stored in array,based structures. Advantages$ 3)cellent performance$ (6FA= cubes are built for fast data retrieval, and are optimal for slicing and dicing operations. 1an perform comple) calculations$ All calculations have been pre,generated when the cube is created. %ence, comple) calculations are not only doable, but they return quic-ly. Disadvantages$ Fimited in the amount of data it can handle$ Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. ndeed, this is possible. But in this case, only summary,level information will be included in the cube itself. 8equires additional investment$ 1ube technology are often proprietary and do not already e)ist in the organization. Therefore, to adopt (6FA= technology, chances are additional investments in human and capital resources are needed. 35 CS2032 DATA WAREHOUSING AND DATA MINING 3)amples$ %yperion 3ssbase, 7usion .nformation Builders/ $O'%) This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional 6FA='s slicing and dicing functionality. n essence, each action of slicing and dicing is equivalent to adding a SW%383T clause in the +4F statement. Data stored in relational tables Advantages$ 1an handle large amounts of data$ The data size limitation of 86FA= technology is the limitation on data size of the underlying relational database. n other words, 86FA= itself places no limitation on data amount. 1an leverage functionalities inherent in the relational database$ 6ften, relational database already comes with a host of functionalities. 86FA= technologies, since they sit on top of the relational database, can therefore leverage these functionalities. Disadvantages$ =erformance can be slow$ Because each 86FA= report is essentially a +4F query .or multiple +4F queries/ in the relational database, the query time can be long if the underlying data size is large. Fimited by +4F functionalities$ Because 86FA= technology mainly relies on generating +4F statements to query the relational database, and +4F statements do not fit all needs .for e)ample, it is difficult to perform comple) calculations using +4F/, 86FA= technologies are therefore traditionally limited by what +4F can do. 86FA= vendors have mitigated this ris- by building 36 CS2032 DATA WAREHOUSING AND DATA MINING into the tool out,of,the,bo) comple) functions as well as the ability to allow users to define their own functions. 3)amples$ (icrostrategy ntelligence +erver, (eta1ube .nformi)@B(/ &O'%) 0(1!: (anaged 1uer* !nvironment2 %6FA= technologies attempt to combine the advantages of (6FA= and 86FA=. 7or summary, type information, %6FA= leverages cube technology for faster performance. t stores only the inde)es and aggregations in the multidimensional form while the rest of the data is stored in the relational database. 3)amples$ =ower=lay .1ognos/, Brio, (icrosoft Analysis +ervices, 6racle Advanced Analytic +ervices /! ,ultidimensional ?ersus ,ultirelational ">A5 These relational implementations of multidimensional database systems are sometimes referred to as multirelationaldatabase systems. To achieve the required speed, these products use the star or snowfla-e schemas,specially optimized and denormalized data models that involve data restructuring and 37 CS2032 DATA WAREHOUSING AND DATA MINING aggregation. .The snowfla-e schema is an e)tension of the star schema that supports multiple fact tables and &oins between them./ 6ne benefit of the star schema approach is reduced comple)ity in the data model, which increases data Slegibility,T ma-ing it easier for users to pose business questions of 6FA= nature.Data warehouse queries can be answered up to !# times faster because of improved navigations. Two types of database activity$ ! 6FT=$ 6n,Fine Transaction =rocessing +hort transactions, both queries and updates .e.g., update account balance, enroll in course/ 4ueries are simple .e.g., find account balance, find grade in course/ *pdates are frequent .e.g., concert tic-ets, seat reservations, shopping carts/ >. 6FA=$ 6n,Fine Analytical =rocessing U Fong transactions, usually comple) queries U .e.g., all statistics about all sales, grouped by dept and U month/ U SData miningT operations U nfrequent updates O'T) vs O'%) 6FT= stands for 6n Fine Transaction =rocessing and is a data modeling approach typically used to facilitate and manage usual business applications. (ost of applications you see and use are 6FT= based. 6FT= technology used to perform updates on operational or transactional systems .e.g., point of sale systems/ 6FA= stands for 6n Fine Analytic =rocessing and is an approach to answer multi, dimensional queries. 6FA= was conceived for (anagement nformation +ystems and Decision +upport +ystems. 6FA= technology used to perform comple) analysis of the data in a data warehouse. The following table summarizes the major dieren!es between "#T$ and "#%$ s&stem design' "#T$ (&stem "#%$ (&stem 38 CS2032 DATA WAREHOUSING AND DATA MINING "nline Transa!tion $ro!essing )"*erational (&stem+ "nline %nal&ti!al $ro!essing ),ata -arehouse+ +ource of data 6perational dataI 6FT=s are the original source of the data. 1onsolidation dataI 6FA= data comes from the various 6FT= Databases =urpose of data To control and run fundamental business tas-s To help with planning, problem solving, and decision support What the data 8eveals a snapshot of ongoing business processes (ulti,dimensional views of various -inds of business activities nserts and *pdates +hort and fast inserts and updates initiated by end users =eriodic long,running batch &obs refresh the data 4ueries 8elatively standardized and simple queries 8eturning relatively few records 6ften comple) queries involving aggregations =rocessing +peed Typically very fast Depends on the amount of data involvedI batch data refreshes and comple) queries may ta-e many hoursI query speed can be improved by creating inde)es +pace 8equirements 1an be relatively small if historical data is archived Farger due to the e)istence of aggregation structures and history dataI requires more inde)es than 6FT= Database Design %ighly normalized with many tables Typically de,normalized with fewer tablesI use of star and@or snowfla-e schemas Bac-up and 8ecovery Bac-up religiouslyI operational data is critical to run the business, data loss is li-ely to entail significant monetary loss and legal liability nstead of regular bac-ups, some environments may consider simply reloading the 6FT= data as a recovery method 0! *he ,ultidimensional data ,odel The multidimensional data model is an integral part of 6n,Fine Analytical =rocessing, or 6FA=. Because 6FA= is on,line, it must provide answers quic-lyI analysts pose iterative queries during interactive sessions, not in batch &obs that run overnight. And because 6FA= is also analytic, the queries are comple). The multidimensional data model is designed to solve comple) queries in real time. (ultidimensional data model is to view it as a cube. The cable at the left contains detailed sales data by product, mar-et and time. The cube on the right associates sales number .unit sold/ with dimensions,product type, mar-et and time with the unit variables organized as cell in an array. 39 CS2032 DATA WAREHOUSING AND DATA MINING This cube can be e)pended to include another array,price,which can be associates with all or only some dimensions. As number of dimensions increases number of cubes cell increase e)ponentially. Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years, quarters, months, wea- and day. J36J8A=%L may contain country, state, city etc. n this cube we can observe, that each side of the cube represents one of the elements of the question. The ),a)is represents the time, the y,a)is represents the products and the z,a)is represents different centers. The cells of in the cube represents the number of product sold or can represent the price of the items. This 7igure also gives a different understanding to the drilling down operations. The relations defined must not be directly related, they related directly. The size of the dimension increase, the size of the cube will also increase e)ponentially. The time response of the cube depends on the size of the cube. "perations in ,ultidimensional Data ,odel1 ; Aggregation .roll-up/ : dimension reduction$ e.g., total sales by city : summarization over aggregate hierarchy$ e.g., total sales by city and year ,N total sales by region and by year 40 CS2032 DATA WAREHOUSING AND DATA MINING ; +election .slice/ defines a subcube : e.g., sales where city V =alo Alto and date V !@!B@"C ; 9avigation to detailed data .drill-down/ : e.g., .sales : e)pense/ by city, top ?W of cities by average income ; <isualization 6perations .e.g., =ivot or dice/ @! ">A5 8uidelines Dr. 3.7. 1odd the SfatherT of the relational model, created a list of rules to deal with the 6FA= systems. *sers should priorities these rules according to their needs to match their business requirements .reference ?/. These rules are$ !/ (ultidimensional conceptual view$ The 6FA= should provide an appropriate multidimensional Business model that suits the Business problems and 8equirements. >/ Transparency$ The 6FA= tool should provide transparency to the input data for the users. ?/ Accessibility$ The 6FA= tool should only access the data required only to the analysis needed. A/ 1onsistent reporting performance$ The +ize of the database should not affect in any way the performance. B/ 1lient@server architecture$ The 6FA= tool should use the client server architecture to ensure better performance and fle)ibility. C/ Jeneric dimensionality$ Data entered should be equivalent to the structure and operation requirements. D/ Dynamic sparse matri) handling$ The 6FA= too should be able to manage the sparse matri) and so maintain the level of performance. K/ (ulti,user support$ The 6FA= should allow several users wor-ing concurrently to wor- together. "/ *nrestricted cross,dimensional operations$ The 6FA= tool should be able to perform operations across the dimensions of the cube. !#/ ntuitive data manipulation. S1onsolidation path re,orientation, drilling down across columns or rows, zooming out, and other manipulation inherent in the consolidation path outlines should be accomplished via direct action upon the cells of the analytical model, and should neither require the use of a menu nor multiple trips across the user interface.T.8eference A/ !!/ 7le)ible reporting$ t is the ability of the tool to present the rows and column in a manner suitable to be analyzed. !>/ *nlimited dimensions and aggregation levels$ This depends on the -ind of Business, where multiple dimensions and defining hierarchies can be made. n addition to these guidelines an 6FA= system should also support$ 1omprehensive database management tools$ This gives the database management to control distributed Businesses 41 CS2032 DATA WAREHOUSING AND DATA MINING The ability to drill down to detail source record level$ Which requires that The 6FA= tool should allow smooth transitions in the multidimensional database. ncremental database refresh$ The 6FA= tool should provide partial refresh. +tructured 4uery Fanguage .+4F interface/$ the 6FA= system should be able to integrate effectively in the surrounding enterprise environment. UNIT III DATA MINING &! Data mining Aknowledge discover# in databasesB 3)traction of interesting .non,trivial, implicit, previously un-nown and potentially useful/ information or patterns from data in large databases. 42 CS2032 DATA WAREHOUSING AND DATA MINING Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also -nown as Hnowledge Discovery in Data .HDD/. The -ey properties of data mining are$ Automatic discovery of patterns =rediction of li-ely outcomes 1reation of actionable information 7ocus on large data sets and databases Data mining can answer questions that cannot be addressed through simple query and reporting techniques. '! Data ,ining 6unctions A basic understanding of data mining functions and algorithms is required for using 6racle Data (ining. This section introduces the concept of data mining functions. Algorithms are introduced in XData (ining AlgorithmsX. 3ach data mining function specifies a class of problems that can be modeled and solved. Data mining functions fall generally into two categories$ supervised and unsupervised. 9otions of supervised and unsupervised learning are derived from the science of machine learning, which has been called a sub, area of artificial intelligence. Artificial intelligence refers to the implementation and study of systems that e)hibit autonomous intelligence or behavior of their own. (achine learning deals with techniques that enable devices to learn from their own performance and modify their own functioning. Data mining applies machine learning concepts to data. Supervised ata (ining: +upervised learning is also -nown as directed learning. The learning process is directed by a previously -nown dependent attribute or target. Directed data mining attempts to e)plain the behavior of the target as a function of a set of independent attributes or predictors. 43 CS2032 DATA WAREHOUSING AND DATA MINING +upervised learning generally results in predictive models. This is in contrast to unsupervised learning where the goal is pattern detection. The building of a supervised model involves training, a process whereby the software analyzes many cases where the target value is already -nown. n the training process, the model XlearnsX the logic for ma-ing the prediction. 7or e)ample, a model that see-s to identify the customers who are li-ely to respond to a promotion must be trained by analyzing the characteristics of many customers who are -nown to have responded or not responded to a promotion in the past. 3nsupervised ata (ining *nsupervised learning is non,directed. There is no distinction between dependent and independent attributes. There is no previously,-nown result to guide the algorithm in building the model. *nsupervised learning can be used for descriptive purposes. t can also be used to ma-e predictions. ata pre-processing Data pre-processing is an often neglected but important step in the data mining process. The phrase Xgarbage in, garbage outX is particularly applicable to data mining and machine learning pro&ects. Data,gathering methods are often loosely controlled, resulting in out,of,range values .e.g., ncome$ Y!##/, impossible data combinations .e.g., Jender$ (ale, =regnant$ Les/, missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. f there is much irrelevant and redundant information present or noisy and unreliable data, then -nowledge discovery during the training phase is more difficult. Data preparation and filtering steps can ta-e considerable amount of processing time. Data pre,processing includes cleaning, normalization, transformation, feature e)traction and selection, etc. The product of data pre,processing is the final training set. Hotsiantis et al. .>##C/ present a well,-nown algorithm for each step of data pre,processing. +! %lassification of Data ,ining (#stems ata mining classification scheme: !. Decisions in data mining : Hinds of databases to be mined 44 CS2032 DATA WAREHOUSING AND DATA MINING : Hinds of -nowledge to be discovered : Hinds of techniques utilized : Hinds of applications adapted >. Data mining tas-s : Descriptive data mining : =redictive data mining ,- Decisions in data mining Databases to be mined o 8elational, transactional, ob&ect,oriented, ob&ect,relational, active, spatial, time, series, te)t, multi,media, heterogeneous, legacy, WWW, etc. - Hnowledge to be mined o 1haracterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. o (ultiple@integrated functions and mining at multiple levels - Techniques utilized o Database,oriented, data warehouse .6FA=/, machine learning, statistics, visualization, neural networ-, etc. - Applications adapted o 8etail, telecommunication, ban-ing, fraud analysis, D9A mining, stoc- mar-et analysis, Web mining, Weblog analysis, etc. .- Data mining tasks : =rediction Tas-s o *se some variables to predict un-nown or future values of other variables : Description Tas-s o 7ind human,interpretable patterns that describe the data. 45 CS2032 DATA WAREHOUSING AND DATA MINING #ommon data mining tas+s : 1lassification Z=redictive[ : 1lustering ZDescriptive[ : Association 8ule Discovery ZDescriptive[ : +equential =attern Discovery ZDescriptive[ : 8egression Z=redictive[ : Deviation Detection Z=redictive[ %lassifications of data mining s#stems1 +upervised learning .classification/ +upervision$ The training data .observations, measurements, etc./ are accompanied by labels indicating the class of the observations 9ew data is classified based on the training set *nsupervised learning .clustering/ The class labels of training data is un-nown Jiven a set of measurements, observations, etc. with the aim of establishing the e)istence of classes or clusters in the data. %lassification predicts categorical class labels .discrete or nominal/ classifies data .constructs a model/ based on the training set and the values .class labels/ in a classifying attribute and uses it in classifying new data ;umeric 5rediction models continuous,valued functions, i.e., predicts un-nown or missing values *#pical applications 1redit@loan approval (edical diagnosis$ if a tumor is cancerous or benign 7raud detection$ if a transaction is fraudulent Web page categorization$ which category it is 46 CS2032 DATA WAREHOUSING AND DATA MINING -! Data ,ining *ask 5rimitives The set of tas7-rele!ant data to be mined$ This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest .referred to as the rele!ant attri%utes or dimensions/. The 7ind o& 7nowledge to be mined$ This specifies the data mining &unctions to be per, formed, such as characterization, discrimination, association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution analysis. The %ac7ground 7nowledge to be used in the discovery process$ This -nowledge about the domain to be mined is useful for guiding the -nowledge discovery process and for evaluating the patterns found. 'oncept hierarchies are a popular form of bac-,ground -nowledge, which allow data to be mined at multiple levels of abstraction. An e)ample of a concept hierarchy for the attribute .or dimension/ age is shown in 7igure. *ser beliefs regarding relationships in the data are another formof bac-, ground -nowledge.
The interestingness measures and thresholds for pattern evaluation$ They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different -inds of -nowledge may have different interestingness measures. 7or e)am, ple, interestingness measures for association rules include support and con&idence. 8ules whose support and confidence values are below user,specified thresholds are considered uninteresting. The e)pected representation &or !isuali-ing the discovered patterns$ This refers to the forminwhich discovered patterns are to be displayed,which may include rules, tables, charts, graphs, decision trees, and cubes. 47 CS2032 DATA WAREHOUSING AND DATA MINING
.! Data5reprocessing! The real,world data that is to be analyzed by data mining techniques are$ 1' Incomplete1 lac-ing attribute values or certain attributes of interest, or containing only aggregate data. (issing data, particularly for tuples with missing values for some attributes, may need to be inferred. 2' ;ois# $ containing errors, or outlier values that deviate from the e)pected. ncorrect data may also result from inconsistencies in naming conventions or data codes used, or inconsistent formats for input fields, such as date. t is hence necessary to use some techniques to replace the noisy data. 3' Inconsistent 1 containing discrepancies between different data items. some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies. 9aming inconsistencies may also occur for attribute values. The inconsistency in data needs to be removed. 4' Aggregate Information1 t would be useful to obtain aggregate information such as to the sales per customer regionQsomething that is not part of any pre,computed data cube in the data warehouse. 48 CS2032 DATA WAREHOUSING AND DATA MINING 5' 7nhancing mining process1 Farge number of data sets may ma-e the data mining process slow. %ence, reducing the number of data sets to enhance the performance of the mining process is important. 6' Improve Data :ualit#1 Data preprocessing techniques can improve the quality of the data, thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data preprocessing is an important step in the -nowledge discovery process, because quality decisions must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision ma-ing. Different forms of Data 5rocessing Data %leaning1 Data cleaning routines wor- to ScleanT the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. f users believe the data are dirty, they are unli-ely to trust the results of any data mining that has been applied to it. Also, dirty data can cause confusion for the mining procedure, resulting in unreliable output. But, they are not always robust. Therefore, a useful preprocessing step is used some data,cleaning routines. Data Integration1 Data integration involves integrating data from multiple databases, data cubes, or files. +ome attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies. 7or e)ample, the attribute for customer identification may be referred to as customer\id in one data store and cust\id in another. 9aming inconsistencies may also occur for attribute values. Also, some attributes may be inferred from others .e.g., annual revenue/. %aving a large amount of redundant data may slow down or confuse the -nowledge discovery process. Additional data cleaning can be performed to detect and remove redundancies that may have resulted from data integration. Data *ransformation1 Data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process. 9ormalization$ 9ormalization is scaling the data to be analyzed to a specific range such as Z#.#, !.#[ for providing better results. 49 CS2032 DATA WAREHOUSING AND DATA MINING Aggregation$ Also, it would be useful for data analysis to obtain aggregate information such as the sales per customer region. As, it is not a part of any pre,computed data cube, it would need to be computed. This process is called Aggregation. Data 9eduction1 Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same .or almost the same/ analytical results. There are a number of strategies for data reduction. data aggregation .e.g., building a data cube/, attribute subset selection .e.g., removing irrelevant attributes through correlation analysis/, dimensionality reduction .e.g., using encoding schemes such as minimum length encoding or wavelets/, and numerosity reduction .e.g., SreplacingT the data by alternative, smaller representations such as clusters or parametric models/. generalization with the use of concept hierarchies,by organizing the concepts into varying levels of abstraction. Data discretization is very useful for the automatic generation of concept hierarchies from numerical data.
50 CS2032 DATA WAREHOUSING AND DATA MINING UNIT & I' ASSOCIATION RU$E MINING AND C$ASSI(ICATION &! 6re)uent 5attern Anal#sis 7requent pattern$ a pattern .a set of items, subsequences, substructures, etc./ that occurs frequently in a data set 7irst proposed by Agrawal, mielins-i, and +wami ZA+"?[ in the conte)t of frequent itemsets and association rule mining (otivation$ 7inding inherent regularities in data What products were often purchased togetherMQ Beer and diapersM] What are the subsequent purchases after buying a =1M What -inds of D9A are sensitive to this new drugM 1an we automatically classify web documentsM Applications$ Bas-et data analysis, cross,mar-eting, catalog design, sale campaign analysis, Web log .clic- stream/ analysis, and D9A sequence analysis Wh# Is 6re)! 5attern ,ining Important< Discloses an intrinsic and important property of data sets 7orms the foundation for many essential data mining tas-s Association, correlation, and causality analysis +equential, structural .e.g., sub,graph/ patterns =attern analysis in spatiotemporal, multimedia, time,series, and stream data 1lassification$ associative classification 1luster analysis$ frequent pattern,based clustering Data warehousing$ iceberg cube and cube,gradient +emantic data compression$ fascicles Broad applications 51 CS2032 DATA WAREHOUSING AND DATA MINING $asic %oncepts1 6re)uent 5atterns and Association 9ules '!%onstraint-based A:uer#-DirectedB ,ining 7inding all the patterns in a database autonomouslyM Q unrealistic] o The patterns could be too many but not focused] Data mining should be an interactive process o *ser directs what to be mined using a data mining query language .or a graphical user interface/ 1onstraint,based mining o *ser fle)ibility$ provides constraints on what to be mined o +ystem optimization$ e)plores such constraints for efficient miningQconstraint,based mining %onstraints in Data ,ining 52 CS2032 DATA WAREHOUSING AND DATA MINING Hnowledge type constraint$ o classification, association, etc. Data constraint Q using +4F,li-e queries o find product pairs sold together in stores in 1hicago in Dec.'#> Dimension@level constraint o in relevance to region, price, brand, customer category 8ule .or pattern/ constraint o small sales .price ^ _!#/ triggers big sales .sum N _>##/ nterestingness constraint o strong rules$ min\support ?W, min\confidence C#W %onstrained ,ining vs! %onstraint-$ased (earch 1onstrained mining vs. constraint,based search@reasoning o Both are aimed at reducing search space o 7inding all patterns satisfying constraints vs. finding some .or one/ answer in constraint,based search in A o 1onstraint,pushing vs. heuristic search o t is an interesting research problem on how to integrate them 1onstrained mining vs. query processing in DB(+ o Database query processing requires to find all o 1onstrained pattern mining shares a similar philosophy as pushing selections deeply in query processing *he Apriori Algorithm C 74ample 53 CS2032 DATA WAREHOUSING AND DATA MINING +!Decision *ree Induction 54 CS2032 DATA WAREHOUSING AND DATA MINING nformation produced by data mining techniques can be represented in many different ways. Decision tree structures are a common way to organize classification schemes. n classifying tas-s, decision trees visualize what steps are ta-en to arrive at a classification. 3very decision tree begins with what is termed a root node, considered to be the XparentX of every other node. 3ach node in the tree evaluates an attribute in the data and determines which path it should follow. Typically, the decision test is based on comparing a value against some constant. 1lassification using a decision tree is performed by routing from the root node until arriving at a leaf node. The illustration provided here is a cannonical e)ample in data mining, involving the decision to play or not play based on climate conditions. n this case, outloo- is in the position of the root node. The degrees of the node are attribute values. n this e)ample, the child nodes are tests of humidity and windy, leading to the leaf nodes which are the actual classifications. This e)ample also includes the corresponding data, also referred to as instances. n our e)ample, there are " XplayX days and B Xno playX days. 55 CS2032 DATA WAREHOUSING AND DATA MINING Decision trees can represent diverse types of data. The simplest and most familiar is numerical data. t is often desirable to organize nominal data as well. 9ominal quantities are formally described by a discrete set of symbols. 7or e)ample, weather can be described in either numeric or nominal fashion. We can quantify the temperature by saying that it is !! degrees 1elsius or B> degrees 7ahrenheit. We could also say that it is cold, cool, mild, warm or hot. The former is an e)ample of numeric data, and the latter is a type of nominal data. (ore accurately, the e)ample of cold, cool, mild, warm and hot is a special type of nominal data, described as ordinal data. 6rdinal data has an implicit assumption of ordered relationships between the values. 1ontinuing with the weather e)ample, we could also have a purely nominal description li-e sunny, overcast and rainy. These values have no relationships or distance measures. The type of data organized by a tree is important for understanding how the tree wor-s at the node level. 8ecalling that each node is effectively a test, numeric data is often evaluated in terms of simple mathematical inequality. 7or e)ample, numeric weather data could be tested by finding if it is greater than !# degrees 7ahrenheit. 9ominal data is tested in Boolean fashionI in other words, whether or not it has a particular value. The illustration shows both types of tests. n the weather e)ample, outloo- is a nominal data type. The test simply as-s which attribute value is represented and routes accordingly. The humidity node reflects numeric tests, with an inequality of less than or equal to D#, or greater than D#. Decision tree induction algorithms function recursively. 7irst, an attribute must be selected as the root node. n order to create the most efficient .i.e, smallest/ tree, the root node must effectively split the data. 3ach split attempts to pare down a set of instances .the actual data/ until they all have the same classification. The best split is the one that provides what is termed the most information gain. nformation in this conte)t comes from the concept of entropy from information theory, as developed by 1laude +hannon. Although XinformationX has many conte)ts, it has a very specific mathematical meaning relating to certainty in decision ma-ing. deally, each split in the decision tree should bring us closer to a classification. 6ne way to conceptualize this is to see each step along the tree as removing randomness or entropy. nformation, e)pressed as a mathematical quantity, reflects this. 7or e)ample, consider a very simple classification problem that requires creating a decision tree to decide yes or no based on some data. This is e)actly the scenario visualized in the decision tree. 3ach attributes values will have a certain number of yes or no classifications. f there are equal numbers of yeses and noPs, then there is a great deal of entropy in 56 CS2032 DATA WAREHOUSING AND DATA MINING that value. n this situation, information reaches a ma)imum. 1onversely, if there are only yeses or only noPs the information is also zero. The entropy is low, and the attribute value is very useful for ma-ing a decision. The formula for calculating intermediate values is as follows$ )*Machine $earnin! The general problem of machine learning is to search a, usually very large, space of potential hypotheses to determine the one that will best fit the data and any prior -nowledge. The data may be labelled or unlabelled. f labels are given then the problem is one of supervised learning in that the true answer is -nown for a given set of data. f the labels are categorical then the problem is one of classification, e.g. predicting the species of a flower given petal and sepal measurements. f the labels are real,valued the problem is one of regression, e.g. predicting property values from crime, pollution, etc. statistic. f labels are not given then the problem is one of unsupervised learning and the aim is characterize the structure of the data, e.g. by identifying groups of e)amples in the data that are collectively similar to each other and distinct from the other data. S+per,i-e. $earnin! Jiven some e)amples we wish to predict certain properties, in the case where there are available a set of e)amples whose properties have already been characterized the tas- is to learn the relationship between the two. 6ne common early approach
was to present the e)amples in turn to a learner. The learner ma-es a prediction of the property of interest, the correct answer is presented, and the learner ad&usts its hypothesis accordingly. This is -nown as learning with a teacher, or supervised learning. n supervised learning there is necessarily the assumption that the descriptors available are in some related to a quantity of interest. 7or instance, suppose that a ban- wishes to detect fraudulent credit card transactions. n order to do this some domain -nowledge is required to identify factors that are li-ely to be indicative of fraudulent use. These may include frequency of usage, amount of transaction, spending patterns, type of business engaging in the transaction and so forth. These variables are the predictive, or independent, variables 4. t would be hoped that these were in some way related to the target, or dependent, variable . Deciding which variables to use in a model is a very difficult problem in generalI this is -nown as the problem of feature selection and is 9=,complete. (any methods e)ist for choosing the predictive variables, if domain -nowledge is available then this can be very useful in this conte)t. %ere we assume that at least some of the predictive variables at least are in fact predictive. L Assume, then, that the relationship between and is given by the &oint probability density . 57 CS2032 DATA WAREHOUSING AND DATA MINING UNIT & ' C$USTERING AND A//$ICATIONS AND TRENDS IN DATA MINING &!%luster Anal#sis Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics. The criterion for checking the similarity is implementation dependent. Clustering is often confused with classification, but there is some difference between the two. In classification the objects are assigned to pre defined classes, whereas in clustering the classes are also to be defined. Precisely, Data Clustering is a technique in which, the information that is logically similar is physically stored together. In order to increase the efficiency in the database systems the number of disk accesses are to be minimized. In clustering the objects of similar properties are placed in one class of objects and a single access to the disk makes the entire class available. !.! 74ample to 7laborate the Idea of %lustering 58 CS2032 DATA WAREHOUSING AND DATA MINING n order to elaborate the concept a little bit, let us ta-e the e)ample of the library system. n a library boo-s concerning to a large variety of topics are available. They are always -ept in form of clusters. The boo-s that have some -ind of similarities among them are placed in one cluster. 7or e)ample, boo-s on the database are -ept in one shelf and boo-s on operating systems are -ept in another cupboard, and so on. To further reduce the comple)ity, the boo-s that cover same -ind of topics are placed in same shelf. And then the shelf and the cupboards are labeled with the relative name. 9ow when a user wants a boo- of specific -ind on specific topic, he or she would only have to go to that particular shelf and chec- for the boo- rather than chec-ing in the entire library. '! D76I;I*I";( In this section some frequently used terms are defined. 2.1Cluster A cluster is an ordered list of objects, which have some common characteristics. The objects belong to an interval [a , b], in our case [0 , 1] 2.2 Distance Between Two Clusters The distance between two clusters involves some or all elements of the two clusters. The clustering method determines how the distance should be computed. 2.3 Similarity A similarity measure SIMILAR ( D i , D j ) can be used to represent the similarity between the documents. Typical similarity generates values of 0 for documents exhibiting no agreement among the assigned indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of partial agreement. 2.4 Average Similarity If the similarity measure is computed for all pairs of documents ( D i , D j ) except when i=j, an average value AVERAGE SIMILARITY is obtainable. Specifically, AVERAGE SIMILARITY = CONSTANT SIMILAR ( D i , D j ), where i=1,2,.n and j=1,2,.n and i < > j 59 CS2032 DATA WAREHOUSING AND DATA MINING 2.5 Threshold The lowest possible input value of similarity required to join two objects in one cluster. 2.6 Similarity Matrix Similarity between objects calculated by the function SIMILAR (D i, ,D j ), represented in the form of a matrix is called a similarity matrix. 2.7 Dissimilarity Coefficient The dissimilarity coefficient of two clusters is defined to be the distance between them. The smaller the value of dissimilarity coefficient , the more similar two clusters are. 2.8 Cluster Seed First document or object of a cluster is defined as the initiator of that cluster i.e. every incoming objects similarity is compared with the initiator. The initiator is called the cluster seed. 2. TYPES OF CLUSTERING METHODS There are many clustering methods available, and each of them may give a different grouping of a dataset. The choice of a particular method will depend on the type of output desired, The known performance of method with particular types of data, the hardware and software facilities available and the size of the dataset. In general , clustering methods may be divided into two categories based on the cluster structure which they produce. The non-hierarchical methods divide a dataset of N objects into M clusters, with or without overlap. These methods are sometimes divided into partitioning methods, in which the classes are mutually exclusive, and the less common clumping method, in which overlap is allowed. Each object is a member of the cluster with which it is most similar, however the threshold of similarity has to be defined. The hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains. The hierarchical methods can be further divided into agglomerative or divisive methods. In agglomerative methods , the hierarchy is build up in a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered 60 CS2032 DATA WAREHOUSING AND DATA MINING dataset. The less common divisive methods begin with all objects in a single cluster and at each of N-1 steps divides some clusters into two smaller clusters, until each object resides in its own cluster. Some of the important Data Clustering Methods are described below. Partitioning Methods The partitioning methods generally result in a set of M clusters, each object belonging to one cluster. Each cluster may be represented by a centroid or a cluster representative; this is some sort of summary description of all the objects contained in a cluster. The precise form of this description will depend on the type of the object which is being clustered. In case where real-valued data is available, the arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required in other cases, e.g., a cluster of documents can be represented by a list of those keywords that occur in some minimum number of documents within a cluster. If the number of the clusters is large, the centroids can be further clustered to produces hierarchy within a dataset. Single Pass: A very simple partition method, the single pass method creates a partitioned dataset as follows: 1. Make the first object the centroid for the first cluster. 2. For the next object, calculate the similarity, S, with each existing cluster centroid, using some similarity coefficient. 3. If the highest calculated S is greater than some specified threshold value, add the object to the corresponding cluster and re determine the centroid; otherwise, use the object to initiate a new cluster. If any objects remain to be clustered, return to step 2. As its name implies, this method requires only one pass through the dataset; the time requirements are typically of order O(NlogN) for order O(logN) clusters. This makes it a very efficient clustering method for a serial processor. A disadvantage is that the resulting clusters are not independent of the order in which the documents are processed, with the first clusters formed usually being larger than those created later in the clustering run Hierarchical Agglomerative methods The hierarchical agglomerative clustering methods are most commonly used. The construction of an hierarchical agglomerative classification can be achieved by the following general algorithm. 61 CS2032 DATA WAREHOUSING AND DATA MINING 1. Find the 2 closest objects and merge them into a cluster 2. Find and merge the next two closest points, where a point is either an individual object or a cluster of objects. 3. If more than one cluster remains , return to step 2 Individual methods are characterized by the definition used for identification of the closest pair of points, and by the means used to describe the new cluster when two clusters are merged. There are some general approaches to implementation of this algorithm, these being stored matrix and stored data, are discussed below In the second matrix approach , an N*N matrix containing all pairwise distance values is first created, and updated as new clusters are formed. This approach has at least an O(n*n) time requirement, rising to O(n 3 ) if a simple serial scan of dissimilarity matrix is used to identify the points which need to be fused in each agglomeration, a serious limitation for large N. The stored data approach required the recalculation of pairwise dissimilarity values for each of the N-1 agglomerations, and the O(N) space requirement is therefore achieved at the expense of an O(N 3 ) time requirement. The Single Link Method (SLINK) The single link method is probably the best known of the hierarchical methods and operates by joining, at each step, the two most similar objects, which are not yet in the same cluster. The name single link thus refers to the joining of pairs of clusters by the single shortest link between them. The Complete Link Method (CLINK) The complete link method is similar to the single link method except that it uses the least similar pair between two clusters to determine the inter-cluster similarity (so that every cluster member is more like the furthest member of its own cluster than the furthest item in any other cluster ). This method is characterized by small, tightly bound clusters. The Group Average Method 62 CS2032 DATA WAREHOUSING AND DATA MINING The group average method relies on the average value of the pair wise within a cluster, rather than the maximum or minimum similarity as with the single link or the complete link methods. Since all objects in a cluster contribute to the inter cluster similarity, each object is , on average more like every other member of its own cluster then the objects in any other cluster. Text Based Documents In the text based documents, the clusters may be made by considering the similarity as some of the key words that are found for a minimum number of times in a document. Now when a query comes regarding a typical word then instead of checking the entire database, only that cluster is scanned which has that word in the list of its key words and the result is given. The order of the documents received in the result is dependent on the number of times that key word appears in the document. +!A55>I%A*I";( Data clustering has immense number of applications in every field of life. One has to cluster a lot of thing on the basis of similarity either consciously or unconsciously. So the history of data clustering is old as the history of mankind. In computer field also, use of data clustering has its own value. Specially in the field of information retrieval data clustering plays an important role. Some of the applications are listed below. Similarity searching in Medical Image Database This is a major application of the clustering technique. In order to detect many diseases like Tumor etc, the scanned pictures or the x-rays are compared with the existing ones and the dissimilarities are recognized. We have clusters of images of different parts of the body. For example, the images of the CT Scan of brain are kept in one cluster. To further arrange things, the images in which the right side of the brain is damaged are kept in one cluster. The hierarchical clustering is used. The stored images have already been analyzed and a record is associated with each image. In this form a large database of images is maintained using the hierarchical clustering. 63 CS2032 DATA WAREHOUSING AND DATA MINING Now when a new query image comes, it is firstly recognized that what particular cluster this image belongs, and then by similarity matching with a healthy image of that specific cluster the main damaged portion or the diseased portion is recognized. Then the image is sent to that specific cluster and matched with all the images in that particular cluster. Now the image with which the query image has the most similarities, is retrieved and the record associated to that image is also associated to the query image. This means that now the disease of the query image has been detected. Using this technique and some really precise methods for the pattern matching, diseases like really fine tumor can also be detected. So by using clustering an enormous amount of time in finding the exact match from the database is reduced. Data Mining Another important application of clustering is in the field of data mining. Data mining is defined as follows. Definition1: "Data mining is the process of discovering meaningful new correlation, patterns and trends by sifting through large amounts of data, using pattern recognition technologies as well as statistical and mathematical techniques." Definition2: Data mining is a "knowledge discovery process of extracting previously unknown, actionable information from very large databases." Use of Clustering in Data Mining: Clustering is often one of the first steps in data mining analysis. It identifies groups of related records that can be used as a starting point for exploring further relationships. This technique supports the development of population segmentation models, such as demographic-based customer segmentation. Additional analyses using standard analytical and other data mining techniques can determine the characteristics of these segments with respect to some desired outcome. For example, the buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign. For example, a company that sales a variety of products may need to know about the sale of all of their products in order to check that what product is giving extensive sale and which is lacking. This is done by data mining techniques. But if the system clusters the products that are giving less sale then only the 64 CS2032 DATA WAREHOUSING AND DATA MINING cluster of such products would have to be checked rather than comparing the sales value of all the products. This is actually to facilitate the mining process. Windows NT Another major application of clustering is in the new version of windows NT. Windows NT uses clustering, it determine the nodes that are using same kind of resources and accumulate them into one cluster. Now this new cluster can be controlled as one node. 5artitioning methods1 Jiven a database of n ob&ects or data tuples,a partition in method constructs - partitions of the data, where each partition represents cluster and H^Vn. That is ,it classifies the data into - groups, which together satisfy the following requirements$ .!/ each group must contain at least on e ob&ect,and .>/ each ob&ect must belong to e)actly one group,9otice that the second requirement can be rela)ed in some fuzzy partitioning technique. Jiven H, the number of partitions to construct , a partitining method creates an initial partitioning. t then uses an iterative relocation technique that attempts to improve the partitioning by moving ob&ects from one group to another .The general criterion of a good partitioning is that ob&ects in the same clusters are XcloseX or related to each other,whereas ob&ects of different clusters are Xfar apartXor very different. there are various -inds of other criteria for &udging the quality of partitions. To achieve global optimality in partitioning,based clustering would require the e)haustive enumeration of all of the possible partitions. nstead, most applications adopt one of two popular -! Deuristic methods !. the -,means algorithm,where each cluster is represented by the mean value of the ob&ects in the cluster,and >. the -,medoids algorithm,where each cluster is represented by one of the ob&ects located near the center of the cluster.These heuristic clustering methods wor- well for finding spherical,shaped clusters in small to medium ,sized databases.To find clusters with comple) shapes and for clustering very large data sets, partitioning,based methods need to be e)tended.=artitioning,based clustering methods are studied in depth later. 65 CS2032 DATA WAREHOUSING AND DATA MINING Dierarchical methods1 A hierarchical method creates a hierarchical decomposition of the given set of data ob&ects,A hierarchical method can be classified as being either agglomerative or divisive ,based on how the hierarchical decomposition is formed. The agglomerative approach,also called the bottom ,up aproach ,starts with each ob&ect forming a separate group, t successively merges the ob&ects or groups close to one another, until all of the groups are merged into one. the topmost level of the hierarchy/, or until a trmination condition holds. The divisive approach, also called the top,down approach, starts with all the ob&ects in the same cluster,until eventually each ob&ect is in one cluster, or until a termination condition holds, %ierarchical methods suffer form the fact that once a step.merge o"r split/ is done,it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not worrying about a combinatorial number of different choices.%owever, a ma&or problem of such techniques is that they cannot correct erroneous decisions.There are two approaches to improving the quality of hierarchical partitioning, such as in 1*83 and 1hameleon, or .>/ integate hierarchical agglomeration and iterative relocation by first using a hierarchical agglomerative algorithm and then refining the result using iterative relocation by first using a hierarchical aggomerative algorithm and then refining the result using iterative relocation , as in B81%. .! Densit#- based methods most partitioning methods cluster ob&ects based on the distance between ob&ects.+uch methods can find only spherical,shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes. 6ther clustering methods have been developed based on the notion of density.Their general idea is to continue growing the given cluster as long as the density. Their general idea is to continue growing the given cluster as long as the density.number of ob&ects or data points/in the XneighborhoodX e)ceeds some thresholdI that is , for each data point within a given cluster,the neighborhood of a given radius has to contain at least a minimum number of points .+uch a method can be used to filter out noise.outliers/ and discover clusters of arbitrary shape. DB+1A9 is a typical density,based method that grows clusters according to a density threshold,6=T1+ is a density,based method that computes an augmented clustering ordering for automatic and interactive cluster analysis. /!8rid -based method Jrid ,based methods quantize the ob&ect space into a finite number of cells that form a grid structure .All of the clustering operations are performed on the grid structure.i.e., on the quantized space/.The main avantage of this approach is its fast processing time, which is typically independent of the 66 CS2032 DATA WAREHOUSING AND DATA MINING number of data ob&ects and dependent only on the number of cells in each dimension in the quantized space. +T9J is a typical e)ample of a grid,based method.1F4*3 and wave,cluster are two clustering algorithms that are both grid,based and density,based. model,based methods$ (odel,based methods hypothesize a model for each of the clusters and find the best fit of t he data to the given model. A model, based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model .A model,baed algorithm may locatre clusters by constructing a density function that reflects the spatial distribution of the data points.t also leads to a way of automatical determining the number of clusters based on standard statistics, ta-ing XnoiseXor outliers into account and thus yielding robust clustering methods .(odel,based clustering methods are studied below. +ome clustering algorithms integrate the ideas of several clustering methods,so that it is sometimes difficult to classify a given algorithm as uniquely belonging to only one clusteing method category. 7urthermore ,some applications may have clustering creteria that require the integration of seeral clustering techniques. n the following sections,we e)amine each of the above five clustering methods in detail. We also introduce algorithms that integrate the ideas of several clustering methods.outlier analysis , which typically involves clustering, is described at the end of this section. 0! 5artitioning ,ethods Jiven a database of ob&ects and - , the number of clusters to form , a partitioning algorithm organizes the ob&ects into - partitions.-^Vn/, where each partition represents a cluster.The clusters are formed to optimize an ob&ective,partitioning criterion,often called a similarity function ,such as distance ,so that the ob&ects within a cluste are Xsimilar,X whereas the ob&ects of different clusters are XdissimilarXin terms of the database attributes. %lassical 5artitioning ,ethods1 k-means and k-medoids The most well ,-nown and commonly used partitioning methods are -,means,-,nedoids, and their variations. %entroid-$ased *echni)ue1 *he E-,eans method The fc,means algorithm ta-es the input paramete,-, and partitions a set of n ob&ects into - clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.cluster similarity is measured in regard to the mean value of the ob&ects in a cluster, which can be viewed as the clusterPs center of gravity. X%ow does the -,means algorithm wor- MX The -,means algorithm proceeds as follows.7irst, it 67 CS2032 DATA WAREHOUSING AND DATA MINING randomly selects - of the ob&ects, each of which initially represents a cluster mean or center.7or each of the remaiining ob&ects, an ob&ect is assigned to the cluster to which it is the most similar, based on the distance between the ob&ect and the cluster mean.t then computes the new mean for each cluster.This process iterates until the criterion function converges.Typically, the squared,error criterion is used,defined as !4sigma sigmap4ci5p-mi5 s6uare where 3 is the sum of square,error for all obects in the database ,p is the point in space representing a given ob&ect, and mi, is the mean of cluster ci . both p and mi, are multidimensional/.This criterion tries to ma-e the resulting - clusters as compact and as separate as possible. The algorithm attempts to determine H partitions that minimize the squared,error function. t wor-s when the clusters are compact clouds that are rather well separated from one another.The method is relatively scalable and efficient in processing large data sets because the computational comple)ity of the algorithm is 6.n-t/, where n is the total number of ob&ects,- is the number of clusters , and t is the number of iterations . n9ormally,-^^n and t^^n.The method often terminates at a local optimum. The -,means method,however,can be applied only when the mean of a cluster is denned,This may not be the case in some applications , such as when data with categorical attributes are involved,the necessity for users to specify -, the number of clusters, in advance can be seen as a desadvantage.the -, means method is not suitable for discovering clusters with nonconve) shapes or clusters of very different size.(oreover,it is sensitive to noise and outlier data points since a small number of such data can substantially influence the mean value. @! Dierarchical ,ethods A hierarchical clustering method wor-s by grouping data ob&ects into a tree clusters.%ierarchical clustering methods can be further classified into agglomerative and divisive hierarchical clustering,depending on whether the hierarchy decomposition is formed in a bottom,up or top,down fashion.The quality of a pure hierarchical clustering method suffers from its inalbility to perform ad&ustment once a merge or split decision has been e)ecuted.Tecent studies have emphasized the integration of hierarchical agglomeration with iterative relocation methods. Agglomerative and Divisive Dierarchical %lustering n general, there are two types of herearchical clustering methods$ 68 CS2032 DATA WAREHOUSING AND DATA MINING Agglomerative hierarchical clustering1 This bottom,up strategy starts by placing each ob&ect in its own cluster and then merges these atomic dusters into larger and larger clusters,,until all of the ob&ects are in a single cluster or until certain termination conditions are satisfied .(ost herearchical clustering methods belong to this category.They differ only in their definition of inter cluster similarity. Divisive heerarchical clustering1 This top,down strategy does the reverse of agglomerative hierarchical clustering by starting with all ob&ects in one cluster.t subdivides the cluster into smaller and smallerpieces,until each ob&ect forms a cluster on its own or until it satisfies certain termination conditions,such as a desired number of clusters is obtained or the distance between the two closest clusters is above a certain threshold distance. 7our widely used measures for distance between clusters are as follows,where `p,pP` is the distance between two ob&ects or points p and pP,m,is the mean for cluster 1, and n, is the number of ob&ects of in 1i. minimum distance $ ma)imium distance$ mean distance$ Average distance$ F! DA*A ,I;I;8 A55>I%A*I";( (cience1 %hemistr#, 5h#sics, ,edicine o Biochemical analysis o 8emote sensors on a satellite o Telescopes : star gala)y classification o (edical mage analysis $ioscience o +equence,based analysis o =rotein structure and function prediction o =rotein family classification o (icroarray gene e)pression 5harmaceutical companies, Insurance and Dealth care, ,edicine o Drug development o dentify successful medical therapies o 1laims analysis, fraudulent behavior o (edical diagnostic tools o =redict office visits 69 CS2032 DATA WAREHOUSING AND DATA MINING 6inancial Industr#, $anks, $usinesses, 7-commerce o +toc- and investment analysis o dentify loyal customers vs. ris-y customer o =redict customer spending o 8is- management o +ales forecasting 70