Beruflich Dokumente
Kultur Dokumente
B.B.Mishra
Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age Information, which is created by data, becomes the bases for decision making
Databases
4 Components of DSS
Data Store The DSS Database Business Data Business Model Data Internal and External Data Data Extraction and Filtering Extract and validate data from the operational database and the external data sources End-User Query Tool Create Queries that access either the Operational or the DSS database End User Presentation Tools Organize and Present the Data
Time Span
Operational Real Time Current Transactions Short Time Frame Specific Data Facts DSS Historic Long Time Frame (Months/Quarters/Years) Patterns
Operational Specific Transactions that occur at a given time DSS Shown at different levels of aggregation Different Summary Levels Decompose (drill down) Summarize (roll up)
Granularity
Dimensionality
Most distinguishing characteristic of DSS data Operational Represents atomic transactions DSS Data is related in Many ways Develop the larger picture Multi-dimensional view of data
Putting Information technology to help the knowledge worker make faster and better decisions
Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years?
Decision Support
Used to manage and control business Data is historical or point-in-time
Used by managers and end-users to understand the business and make judgements
Size Requirements
VERY Large Terabytes Advanced Hardware (Multiple processors, multiple disk arrays, etc.)
Data Warehouse
DSS friendly data repository for the DSS is the DATA WAREHOUSE
Definition: Integrated, SubjectSubject-Oriented, TimeTime-Variant, Nonvolatile database that provides support for decision making
Pre-Data Warehouse
The pre-Data Warehouse zone provides the data for data warehousing. Data Warehouse designers determine which data contains business value for insertion. OLTP databases are where operational data are stored. OLTP databases can reside in transactional software applications such as Enterprise Resource Management (ERP), Supply Chain, Point of Sale, Customer Serving Software. OLTPs are design for transaction speed and accuracy. Metadata ensures the sanctity and accuracy of data entering into the data lifecycle process. Meta-data ensures that data has the right format and relevancy. Organizations can take preventive action in reducing cost for the ETL stage by having a sound Metadata policy. The commonly used terminology to describe meta data is "data about data". Data Cleansing Before data enters the data warehouse, the extraction, transformation and cleaning (ETL) process ensures that the data passes the data quality threshold. ETLs are also responsible for running scheduled tasks that extract data from OLTPs.
Data Repositories
The Data Warehouse repository is the database that stores active data of business value for an organization. The Data Warehouse modeling design is optimized for data analysis. There are variants of Data Warehouses - Data Marts and ODS. Data Marts are not physically any different from Data Warehouses. Data Marts can be though of as smaller Data Warehouses built on a departmental rather than on a company-wide level. Data Warehouses collects data and is the repository for historical data. Hence it is not always efficient for providing up-to-date analysis. This is where ODS, Operational Data Stores, come in. ODS are used to hold recent data before migration to the Data Warehouse. ODS are used to hold data that have a deeper history that OLTPs. Keep large amounts of data in OLTPs can tie down computer resources and slow down processing - imagine waiting at the ATM for 10 minutes between the prompts for inputs. .
Front-End Analysis
The last and most critical potion of the Data Warehousing are the front-end applications that business users will use to interact with data stored in the repositories. Data Mining is the discovery of useful patterns in data. Data Mining are used for prediction analysis and classification - e.g. what is the likelihood that a customer will migrate to a competitor. OLAP, Online Analytical Processing, is used to analyze historical data and slice the business information required. OLAPs are often used by marketing managers. Slices of data that are useful to marketing managers can be - How many customers between the ages 24-45, that live in Orissa, buy over Rs.2000 worth of groceries a month? Reporting tools are used to provide reports on the data. Data are displayed to show relevancy to the business and keep track of key performance indicators (KPI). Data Visualization tools is used to display data from the data repository. Data visualization is combined with Data Mining & OLAP tools. Data visualization can allow the user to manipulate data to show relevancy & patterns.
Data Warehouse
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing processing. It usually contains historical data derived from transaction data, but it can include data from other sources sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: 1. Subject Oriented 2. Integrated 3. Nonvolatile & 4. Time Variant
The Data Warehouse has software and tools in place to provide the functionality needed to support new enterprise Data Warehouse projects.
The future capabilities of the Data Warehouse can be expanded to include other programs and agencies.
Data Warehousing
Implementation
Data Storage
Repository
Data-Migration
Middleware (Populations-Tools)
GO TO DIAGRAM
GO TO DIAGRAM GO TO DIAGRAM
GO TO DIAGRAM
Top-Down Architecture
BACK TO ARCHITECTURE
Bottom-Up Architecture
BACK TO ARCHITECTURE
BACK TO ARCHITECTURE
BACK TO ARCHITECTURE
collection of data in support of the management's decisionmaking process. A data warehouse is a centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis.
Data Integration and the Extraction, Transformation, and Load (ETL) Process
Data integration
Integration that comprises three major processes: data access, data federation, and change capture. When these three processes are correctly implemented, data can be accessed and made accessible to an array of ETL and analysis tools and data warehousing environments
Issues affect whether an organization will purchase data transformation tools or build the transformation process itself Data transformation tools are expensive Data transformation tools may have a long learning curve It is difficult to measure how the IT organization is doing until it has learned to use the data transformation tools
Indirect benefits result from end users using these direct benefits
Enhance business knowledge Present competitive advantage Enhance customer service and satisfaction Facilitate decision making Help in reforming business processes
Data warehouse vendors Six guidelines to considered when developing a vendor list:
Financial strength ERP linkages Qualified consultants Market share Industry experience Established partnerships
Dimension tables
A table that address how data will be analyzed
Grain
A definition of the highest level of detail that is supported in a data warehouse
Drill-down
The process of probing beyond a summarized value to investigate each of the detail transactions that comprise the summary
Eleven major tasks that could be performed in parallel for successful implementation of a data warehouse (Solomon, 2005) :
Establishment of service-level agreements and data-refresh requirements Identification of data sources and their governance policies Data quality planning Data model design ETL tool selection Relational database software and platform selection Data transport Data conversion Reconciliation process Purge and archive planning End-user support
Project must fit with corporate strategy and business objectives There must be complete buy-in to the project by executives, managers, and users It is important to manage user expectations about the completed project The data warehouse must be built incrementally Build in adaptability The project must be managed by both IT and business professionals Develop a business/supplier relationship Only load data that have been cleansed and are of a quality understood by the organization Do not overlook training requirements Be politically aware
Issues to consider to build a successful data warehouse: Believing that data warehousing database design is the same as transactional database design Choosing a data warehouse manager who is technology oriented rather than user oriented Focusing on traditional internal record-oriented data and ignoring the value of external data and of text, images, and, perhaps, sound and video Delivering data with overlapping and confusing definitions Believing promises of performance, capacity, and scalability Believing that your problems are over when the data warehouse is up and running Focusing on ad hoc data mining and periodic reporting instead of alerts
User participation in the development of data and access modeling is a critical success factor in data warehouse development Massive data warehouses and scalability The main issues pertaining to scalability: The amount of data in the warehouse How quickly the warehouse is expected to grow The number of concurrent users The complexity of user queries
Good scalability means that queries and other data-access functions will grow linearly with the size of the warehouse
OLAP tools
SQL Server Analysis Services Oracle Express Server
Reporting tools
MS Excel Pivot Chart VB Applications
Alternative names:
Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Multidimensionality
Revenue: 2,000,000.
Time
Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The overall goal of the data mining process is to extract knowledge from a data set in a human-understandable structure and besides the raw analysis step involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating.
Data mining
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s2000s:
Data mining and data warehousing, multimedia databases, and Web databases
57
Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
Application Areas
Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application Credit Card Analysis Claims, Fraud Analysis Call record analysis Logistics management promotion analysis Value added data Power usage analysis
58
59
Many Governments all over the World uses Data Mining to track fraud A Supermarket becomes an information broker Basketball teams use it to track game strategy Cross Selling Warranty Claims Routing Holding on to Good Customers Weeding out Bad Customers
Advances in the following areas are making data mining deployable: data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques.
Gartner Group
Performance Op dbs designed & tuned for known txs & workloads. Complex OLAP queries would degrade perf. for op txs. Special data organization, access & implementation methods needed for multidimensional views & queries.
Function
Missing data: Decision support requires historical data, which op dbs do not typically maintain. Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
target marketing, customer relation management, market basket analysis, cross selling, market segmentation
Risk analysis and management
Other Applications
Text mining (news group, email, documents) Stream data mining Web mining. DNA data analysis
Target marketing
Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.
Cross-market analysis
Associations/co-relations between product sales Prediction based on the association information
Customer profiling
data mining can tell you what types of customers buy what products (clustering or classification)
Resource planning:
summarize and compare the resources and spending
Competition:
monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
Retail
Analysts estimate that 38% of retail shrink is due to dishonest employees.
Applications
widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references
Other Applications
Sports
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
Data Mining
Task-relevant Data
Selection
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.