Sie sind auf Seite 1von 72

Data Warehouse Architecture & Development, Multidimensional Analysis & Data Mining, Data Mining Tools & Techniques.

B.B.Mishra

Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age Information, which is created by data, becomes the bases for decision making

Databases

Decision Support Systems


Created to facilitate the decision making process So much information that it is difficult to extract it all from a traditional database Need for a more comprehensive data storage facility Data Warehouse Extract Information from data to use as the basis for decision making Used at all levels of the Organization Tailored to specific business areas Interactive Ad Hoc queries to retrieve and display information Combines historical operation data with business activities

4 Components of DSS
Data Store The DSS Database Business Data Business Model Data Internal and External Data Data Extraction and Filtering Extract and validate data from the operational database and the external data sources End-User Query Tool Create Queries that access either the Operational or the DSS database End User Presentation Tools Organize and Present the Data

Differences with DSS


Operational Stored in Normalized Relational Database Support transactions that represent daily operations (Not Query Friendly)

3 Main Differences Time Span Granularity Dimensionality

Time Span
Operational Real Time Current Transactions Short Time Frame Specific Data Facts DSS Historic Long Time Frame (Months/Quarters/Years) Patterns

Operational Specific Transactions that occur at a given time DSS Shown at different levels of aggregation Different Summary Levels Decompose (drill down) Summarize (roll up)

Granularity

Dimensionality

Most distinguishing characteristic of DSS data Operational Represents atomic transactions DSS Data is related in Many ways Develop the larger picture Multi-dimensional view of data

Data Warehouse for Decision Support & OLAP


5

Putting Information technology to help the knowledge worker make faster and better decisions
Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years?

Decision Support
Used to manage and control business Data is historical or point-in-time

Optimized for inquiry rather than update


Use of the system is loosely defined and can be ad-hoc

Used by managers and end-users to understand the business and make judgements

DSS Database Requirements


DSS Database Scheme
Support Complex and Non-Normalized data
Summarized and Aggregate data Multiple Relationships Queries must extract multi-dimensional time slices Redundant Data

Data Extraction and Filtering


DSS databases are created mainly by extracting data from operational databases combined with data imported from external source
Need for advanced data extraction & filtering tools Allow batch / scheduled data extraction Support different types of data sources Check for inconsistent data / data validation rules Support advanced data integration / data formatting conflicts

DSS Database Requirements


End User Analytical Interface
Must support advanced data modeling and data presentation tools Data analysis tools Query generation Must Allow the User to Navigate through the DSS

Size Requirements
VERY Large Terabytes Advanced Hardware (Multiple processors, multiple disk arrays, etc.)

Data Warehouse
DSS friendly data repository for the DSS is the DATA WAREHOUSE
Definition: Integrated, SubjectSubject-Oriented, TimeTime-Variant, Nonvolatile database that provides support for decision making

Data Warehousing Overview


Data Warehousing is open to an almost limitless range of definitions. Simply put, Data Warehouses store an aggregation of a company's data. Data Warehouses are an important asset for organizations to maintain efficiency, profitability and competitive advantages. Organizations collect data through many sources - Online, Call Center, Sales Leads, Inventory Management. The data collected have degrees of value and business relevance. As data is collected, it is passed through a 'conveyor belt', call the Data Life Cycle Management Management. An organization's data life cycle management's policy will dictate the data warehousing design and methodology.

Pre-Data Warehouse
The pre-Data Warehouse zone provides the data for data warehousing. Data Warehouse designers determine which data contains business value for insertion. OLTP databases are where operational data are stored. OLTP databases can reside in transactional software applications such as Enterprise Resource Management (ERP), Supply Chain, Point of Sale, Customer Serving Software. OLTPs are design for transaction speed and accuracy. Metadata ensures the sanctity and accuracy of data entering into the data lifecycle process. Meta-data ensures that data has the right format and relevancy. Organizations can take preventive action in reducing cost for the ETL stage by having a sound Metadata policy. The commonly used terminology to describe meta data is "data about data". Data Cleansing Before data enters the data warehouse, the extraction, transformation and cleaning (ETL) process ensures that the data passes the data quality threshold. ETLs are also responsible for running scheduled tasks that extract data from OLTPs.

Data Repositories
The Data Warehouse repository is the database that stores active data of business value for an organization. The Data Warehouse modeling design is optimized for data analysis. There are variants of Data Warehouses - Data Marts and ODS. Data Marts are not physically any different from Data Warehouses. Data Marts can be though of as smaller Data Warehouses built on a departmental rather than on a company-wide level. Data Warehouses collects data and is the repository for historical data. Hence it is not always efficient for providing up-to-date analysis. This is where ODS, Operational Data Stores, come in. ODS are used to hold recent data before migration to the Data Warehouse. ODS are used to hold data that have a deeper history that OLTPs. Keep large amounts of data in OLTPs can tie down computer resources and slow down processing - imagine waiting at the ATM for 10 minutes between the prompts for inputs. .

Front-End Analysis
The last and most critical potion of the Data Warehousing are the front-end applications that business users will use to interact with data stored in the repositories. Data Mining is the discovery of useful patterns in data. Data Mining are used for prediction analysis and classification - e.g. what is the likelihood that a customer will migrate to a competitor. OLAP, Online Analytical Processing, is used to analyze historical data and slice the business information required. OLAPs are often used by marketing managers. Slices of data that are useful to marketing managers can be - How many customers between the ages 24-45, that live in Orissa, buy over Rs.2000 worth of groceries a month? Reporting tools are used to provide reports on the data. Data are displayed to show relevancy to the business and keep track of key performance indicators (KPI). Data Visualization tools is used to display data from the data repository. Data visualization is combined with Data Mining & OLAP tools. Data visualization can allow the user to manipulate data to show relevancy & patterns.

Data Warehouse
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing processing. It usually contains historical data derived from transaction data, but it can include data from other sources sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: 1. Subject Oriented 2. Integrated 3. Nonvolatile & 4. Time Variant

Very Large Data Bases


WAREHOUSES ARE VERY LARGE DATABASES
Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Wal-Mart -- 24 Terabytes Geographic Information Systems National Medical Records

Exabytes -- 10^18 bytes:

Zettabytes -- 10^21 bytes: Weather images


Zottabytes -- 10^24 bytes: Intelligence Agency Videos

Complexities of Creating a Data Warehouse


Incomplete errors Missing Fields Records or Fields That, by Design, are not Being Recorded Incorrect errors Wrong Calculations, Aggregations Duplicate Records Wrong Information Entered into Source System

Success & Future of Data Warehouse


The Data Warehouse has successfully supported the increased needs of the State over the past eight years. The need for growth continues however, as the desire for more integrated data increases.

The Data Warehouse has software and tools in place to provide the functionality needed to support new enterprise Data Warehouse projects.
The future capabilities of the Data Warehouse can be expanded to include other programs and agencies.

Data Warehousing

OLAP Design Cycle


Using the Data Warehouse Requirement Analysis

Implementation

Conceptual Design (Implementation Independent)

Logical + Physical Design (e.g. Product specific)

Data Warehouse Architecture


Data Analysis Reporting, OLAP, Data Mining

Data Storage

Repository
Data-Migration

Middleware (Populations-Tools)

Operational Data Sources

Contrasting OLTP and Data Warehousing Environments


One major difference between the types of system is that data warehouses are not usually in third normal form (3NF), a type of data normalization common in OLTP environments. Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems: Workload Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations. Data modifications A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse. In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.

Contrasting OLTP and Data Warehousing Environments


Schema design Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency. Typical operations A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month. A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer." Historical data Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

Data Warehousing Architectures

Three parts of the data warehouse


The data warehouse that contains the data and associated software Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse Client (front (front-end) software that allows users to access and analyze data from the warehouse

Two Tier Data Architecture

Three Tier Data Architecture

Web Based Data Architecture

Data Mart Architecture

Enterprise Data Architecture

Data Warehousing Architectures


Ten factors that potentially affect the architecture selection decision: 1. Information interdependence between organizational units 2. Upper managements information needs 3. Urgency of need for a data warehouse 4. Nature of end-user tasks 5. Constraints on resources 6. Strategic view of the data warehouse prior to implementation 7. Compatibility with existing systems 8. Perceived ability of the in-house IT staff 9. Technical issues 10.Social/political factors

Evolution architecture of data warehouse

Top-Down Architecture Bottom-Up Architecture


Enterprise Data Mart Architecture
Data Stage/Data Mart Architecture

GO TO DIAGRAM
GO TO DIAGRAM GO TO DIAGRAM

GO TO DIAGRAM

Top-Down Architecture

BACK TO ARCHITECTURE

Bottom-Up Architecture

BACK TO ARCHITECTURE

Enterprise Data Mart Architecture

BACK TO ARCHITECTURE

Data Stage/Data Mart Architecture

BACK TO ARCHITECTURE

Data Warehouse Architectures


Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:
Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts)

collection of data in support of the management's decisionmaking process. A data warehouse is a centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis.

Data Warehouse Architecture (Basic)


shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.
the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like March sales.

Data Warehouse Architecture (with a Staging Area)


needs to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management.

Data Warehouse Architecture (with a Staging Area and Data Marts)


Although the architecture is quite common, you may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. This illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.

Data Integration and the Extraction, Transformation, and Load (ETL) Process
Data integration
Integration that comprises three major processes: data access, data federation, and change capture. When these three processes are correctly implemented, data can be accessed and made accessible to an array of ETL and analysis tools and data warehousing environments

Extraction, transformation, and load (ETL)


A data warehousing process that consists of extraction (i.e., reading data from a database), transformation (i.e., converting the extracted data from its previous form into the form in which it needs to be so that it can be placed into a data warehouse or simply another database), and load (i.e., putting the data into the data warehouse)

Issues affect whether an organization will purchase data transformation tools or build the transformation process itself Data transformation tools are expensive Data transformation tools may have a long learning curve It is difficult to measure how the IT organization is doing until it has learned to use the data transformation tools

The ETL Process

Important criteria in selecting an ETL tool


Ability to read from and write to an unlimited number of data source architectures Automatic capturing and delivery of metadata A history of conforming to open standards An easy-to-use interface for the developer and the functional user

Data Warehouse Development


Direct benefits of a data warehouse
Allows end users to perform extensive analysis Allows a consolidated view of corporate data Better and more timely information A Enhanced system performance Simplification of data access

Indirect benefits result from end users using these direct benefits
Enhance business knowledge Present competitive advantage Enhance customer service and satisfaction Facilitate decision making Help in reforming business processes

Data warehouse vendors Six guidelines to considered when developing a vendor list:
Financial strength ERP linkages Qualified consultants Market share Industry experience Established partnerships

Data warehouse development approaches


Inmon Model: EDW approach Kimball Model: Data mart approach

Which model is best?


There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse

Data warehouse structure: The Star Schema Dimensional modeling


A retrieval-based system that supports highvolume query access

Dimension tables
A table that address how data will be analyzed

Grain
A definition of the highest level of detail that is supported in a data warehouse

Drill-down
The process of probing beyond a summarized value to investigate each of the detail transactions that comprise the summary

Data warehousing implementation issues


Implementing a data warehouse is generally a massive effort that must be planned and executed according to established methods There are many facets to the project lifecycle, and no single person can be an expert in each area

Eleven major tasks that could be performed in parallel for successful implementation of a data warehouse (Solomon, 2005) :
Establishment of service-level agreements and data-refresh requirements Identification of data sources and their governance policies Data quality planning Data model design ETL tool selection Relational database software and platform selection Data transport Data conversion Reconciliation process Purge and archive planning End-user support

Some best practices for implementing a data warehouse (Weir, 2002):

Project must fit with corporate strategy and business objectives There must be complete buy-in to the project by executives, managers, and users It is important to manage user expectations about the completed project The data warehouse must be built incrementally Build in adaptability The project must be managed by both IT and business professionals Develop a business/supplier relationship Only load data that have been cleansed and are of a quality understood by the organization Do not overlook training requirements Be politically aware

Failure factors in data warehouse projects:


Cultural issues being ignored Inappropriate architecture Unclear business objectives Missing information Unrealistic expectations Low levels of data summarization Low data quality

Issues to consider to build a successful data warehouse:


Starting with the wrong sponsorship chain Setting expectations that you cannot meet and frustrating executives at the moment of truth Engaging in politically naive behavior Loading the warehouse with information just because it is available

Issues to consider to build a successful data warehouse: Believing that data warehousing database design is the same as transactional database design Choosing a data warehouse manager who is technology oriented rather than user oriented Focusing on traditional internal record-oriented data and ignoring the value of external data and of text, images, and, perhaps, sound and video Delivering data with overlapping and confusing definitions Believing promises of performance, capacity, and scalability Believing that your problems are over when the data warehouse is up and running Focusing on ad hoc data mining and periodic reporting instead of alerts

Implementation factors that can be categorized into three criteria

User participation in the development of data and access modeling is a critical success factor in data warehouse development Massive data warehouses and scalability The main issues pertaining to scalability: The amount of data in the warehouse How quickly the warehouse is expected to grow The number of concurrent users The complexity of user queries

Organizational issues Project issues Technical issues

Good scalability means that queries and other data-access functions will grow linearly with the size of the warehouse

Data Warehousing Tools


Data Warehouse
SQL Server 2000 DTS Oracle 8i Warehouse Builder

OLAP tools
SQL Server Analysis Services Oracle Express Server

Reporting tools
MS Excel Pivot Chart VB Applications

Data Mining: On What Kind of Data?


Relational databases Data warehouses Transactional databases Advanced DB and information repositories
Object-oriented and object-relational databases Spatial and temporal data Time-series data and stream data Text databases and multimedia databases Heterogeneous and legacy databases WWW

What Is Data Mining?


Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

Alternative names:
Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What is not data mining?


(Deductive) query processing. Expert systems or small ML/statistical programs

Multidimensional Analysis And Data Mining


Databases contain information in a series of two-dimensional tables In a data warehouse and data mart, information is multidimensional, it contains layers of columns and rows Dimension a particular attribute of information Cube common term for the representation of multidimensional information Data mining the process of analyzing data to extract information not offered by the raw data alone To perform data mining users need data-mining tools Data-mining tools use a variety of techniques to find patterns and
relationships in large volumes of information and infer rules from them that predict future behavior and guide decision making Include query tools, reporting tools, multidimensional analysis tools, statistical tools, and intelligent agents

INFORMATION CLEANSING OR SCRUBBING


An organization must maintain high-quality data in the data warehouse Information cleansing and scrubbing a process that weeds out and fixes or discards inconsistent, incorrect, or incomplete information

Data Warehouses Are Multidimensional


A Multidimensional Data Warehouse with Information from Multiple Operational Databases

Multidimensionality
Revenue: 2,000,000.

Time

Matrix element with key figure(s)


Customer Dimension / Characteristic
Data warehouses support multimulti-dimensional analysis: Means you can analyze facts at the intersection of any combination of dimensions: Show me the Revenue we made from Customer A for Product 1 in January 2007

Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The overall goal of the data mining process is to extract knowledge from a data set in a human-understandable structure and besides the raw analysis step involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating.

Data mining

Data mining involves six common classes of tasks:


Anomaly detection (Outlier/change/deviation detection) The identification of unusual data records, that might be interesting or data errors and require further investigation. Association rule learning (Dependency modeling) Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Regression Attempts to find a function which models the data with the least error. Summarization providing a more compact representation of the data set, including visualization and report generation.

Evolution of Database Technology


1960s:
Data collection, database creation, IMS and network DBMS

1970s:
Relational data model, relational DBMS implementation

1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s2000s:
Data mining and data warehousing, multimedia databases, and Web databases

Data Mining works with Warehouse Data


Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence


56

We want to know ...

57

Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information

Application Areas
Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application Credit Card Analysis Claims, Fraud Analysis Call record analysis Logistics management promotion analysis Value added data Power usage analysis
58

Data Mining in Use


59

Many Governments all over the World uses Data Mining to track fraud A Supermarket becomes an information broker Basketball teams use it to track game strategy Cross Selling Warranty Claims Routing Holding on to Good Customers Weeding out Bad Customers

Advances in the following areas are making data mining deployable: data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques.
Gartner Group

What makes data mining possible?

Why Separate Data Warehouse?


60

Performance Op dbs designed & tuned for known txs & workloads. Complex OLAP queries would degrade perf. for op txs. Special data organization, access & implementation methods needed for multidimensional views & queries.
Function

Missing data: Decision support requires historical data, which op dbs do not typically maintain. Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.

Why Data Mining? Potential Applications


Database analysis and decision support
Market analysis and management

target marketing, customer relation management, market basket analysis, cross selling, market segmentation
Risk analysis and management

Forecasting, customer retention, improved underwriting, quality control, competitive analysis


Fraud detection and management

Other Applications
Text mining (news group, email, documents) Stream data mining Web mining. DNA data analysis

Market Analysis and Management


Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies

Target marketing
Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.

Determine customer purchasing patterns over time


Conversion of single to a joint bank account: marriage, etc.

Cross-market analysis
Associations/co-relations between product sales Prediction based on the association information

Customer profiling
data mining can tell you what types of customers buy what products (clustering or classification)

Identifying customer requirements


identifying the best products for different customers use prediction to find what factors will attract new customers

Provides summary information


various multidimensional summary reports statistical summary information (data central tendency and variation)

Corporate Analysis and Risk Management


Finance planning and asset evaluation
cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)

Resource planning:
summarize and compare the resources and spending

Competition:
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market

Fraud Detection and Management


Detecting telephone fraud
Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.

Retail
Analysts estimate that 38% of retail shrink is due to dishonest employees.

Applications
widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.

Approach
use historical data to build models of fraudulent behavior and use data mining to help identify similar instances

Examples
auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references

Other Applications
Sports
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

Astronomy
JPL and the Palomar Observatory discovered 22 quasars with the help of data mining

Internet Web Surf-Aid


IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Data Mining: A KDD Process


Pattern Evaluation

Data mining: the core of knowledge discovery process.

Data Mining

Task-relevant Data

Data Warehouse Data Cleaning Data Integration


Databases

Selection

Steps of a KDD Process


Learning the application domain:
relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining


summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

Data Mining Tools


Different types of data mining tools are available in the marketplace, each with their own strengths and weaknesses. Internal auditors need to be aware of the different kinds of data mining tools available and recommend the purchase of a tool that matches the organization's current detective needs. This should be considered as early as possible in the project's lifecycle, perhaps even in the feasibility study. Most data mining tools can be classified into one of three categories: traditional data mining tools, dashboards, and text-mining tools. Below is a description of each.

Data Mining Tools


Traditional Data Mining Tools. Traditional data mining programs help companies establish data patterns and trends by using a number of complex algorithms and techniques. In addition, while some may concentrate on one database type, most will be able to handle any data using online analytical processing or a similar technology. Dashboards. Installed in computers to monitor information in a database, dashboards reflect data changes and updates onscreen often in the form of a chart or table enabling the user to see how the business is performing. Historical data also can be referenced, enabling the user to see where things have changed (e.g., increase in sales from the same period last year). This functionality makes dashboards easy to use and particularly appealing to managers who wish to have an overview of the company's performance.

Data Mining Tools


Text-mining Tools. The third type of data mining tool sometimes is called a text-mining tool because of its ability to mine data from different kinds of text. Scanned content can be unstructured (i.e., information is scattered almost randomly across the document, including emails, Internet pages, audio and video data) or structured (i.e., the data's form and purpose is known, such as content found in a database). Capturing these inputs can provide organizations with a wealth of information that can be mined to discover trends, concepts, and attitudes.

Data Mining Techniques & Their Application


I n addition to using a particular data mining tool, internal auditors can choose from a variety of data mining techniques. The most commonly used techniques include artificial neural networks, decision trees, and the nearest-neighbor method. Each of the following techniques analyzes data in different ways: Artificial neural networks are non-linear, predictive models that learn through training. Although they are powerful predictive modeling techniques, some of the power comes at the expense of ease of use and deployment. One area where auditors can easily use them is when reviewing records to identify fraud and fraudlike actions. Because of their complexity, they are better employed in situations where they can be used and reused, such as reviewing credit card transactions every month to check for anomalies.

Data Mining Techniques & Their Application


Decision trees are tree-shaped structures that represent decision sets. These decisions generate rules, which then are used to classify data. Decision trees are the favored technique for building understandable models. Auditors can use them to assess, for example, whether the organization is using an appropriate cost-effective marketing strategy that is based on the assigned value of the customer, such as profit. The nearest-neighbor method classifies dataset records based on similar data in a historical dataset. Auditors can use this approach to define a document that is interesting to them and ask the system to search for similar items.

Das könnte Ihnen auch gefallen