Data Mining and Data Warehousing

K.Keerthi (Y2CS268) B.Hima Bindu (L3CS342) keerthi_klce@yahoo.co.
in
bommareddy_1560@yahoo.co.in.
Department of Computer Science Koneru Lakshmaiah College of Engineering Green fields, Vaddeswaram Guntur(Dt) ,AP.
ABSTRACT
Call data, mail-order addresses, sales histories, POS data, Web transactions, even free-form text notes ... if your organization could fully harness and exploit this wealth of information, the potential would be enormous. With data mining, the possibilities are endless. Achieving industry-leading status almost upon its introduction, data mining technology continues to receive rave reviews from industry experts and users alike. Forward-thinking companies today are using data mining to reduce fraud, anticipate resource demand, increase acquisition and curb customer attrition. Data mining reaches across industries and business functions. Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Data warehouse is "a copy of transaction data specifically structured for query and analysis". Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining. Let us see in this paper how we can use the concepts of data mining and data warehousing effectively.
INDEX
1. INTRODUCTION 2. FOUR APPROACHES TO DATA MINING
3. DATA MINING TECHNIQUES 4. DATA MINING ON WEB 5. A TAXONOMY OF DATA WAREHOUSE DATA ERRORS 6. ASPECTS OF DATA WAREHOUSE ARCHITECTURE 7. MAINTENANCE ISSUES FOR DATA WAREHOUSING SYSTEMS 8. CONCLUSION 9. BIBLIOGRAPHY
INTRODUCTION
What is data mining?
Data mining is the process of exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information -
information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. The activities involved in extracting meaningful new information from the data are: classification estimation prediction affinity grouping or association rules clustering description and visualization classification, estimation and prediction are examples of directed data mining and the next three are examples of undirected data mining. Classification: Classification consists of examining the features of a newly presented object and assigning to it a predefined class. The objects to be classified are generally represented by records in a database. It updates each record by filling in a field with a class code. Classification deals with discrete outcomes - yes or no. Estimation : Estimation deals with continuously valued outcomes. Given some input data, we use estimation to come up with a value for some unknown continuous variable such as income, height, or credit card balance. In practice, estimation is used to perform a classification task. Prediction : When data mining is used to classify the data and obtain the desired result, we dont expect it to be able to go back later to see if the classification was correct. Our classification may be correct or incorrect, but the uncertainty is due only to incomplete knowledge. But, with certain amount of effort, it is possible to check. Predictive tasks are different from classification & estimation because here the records are classified according to some predictive future behavior or estimated future value. With prediction, the only way to check the accuracy of the classification is to wait and see. Predicting which customers will leave within 6 months can be taken as an example for this. Affinity Grouping or Association rules : the task of affinity grouping is to determine which things go together. The prototypical example is determining what things go together in a
shopping cart at a supermarket. Retail chains can use affinity grouping to plan arrangement of items on store shelves or in a catalog so that items often purchased together will be seen together. Affinity grouping can also be used to identify cross-selling opportunities and to design attractive packages or groupings of products and services. Clustering : Clustering is the task of segmenting a diverse group into a number of similar subgroups or clusters. Clustering does not rely upon predefined classes. In clustering there are no predefined classes or examples. The records are grouped together on the basis of self similarity. It is up to the miner to determine what meaning to attach to the resulting clusters. Clustering is often done as a prelude to some other form of datamining. Description and visualization : sometimes, the purpose of data mining is simply to describe what is going on in a complicated database in a way that increases our understanding of people, products, or processes that produced the data. A good description suggests where to start looking for an explanation. Data visualization is one powerful form of descriptive data mining.
What is Data Warehousing?

A data warehouse is a copy of transaction data specifically structured for querying and reporting. It can be a relational database, multidimensional database, flat file, hierarchical database, object database, etc. Data warehouse data often gets changed. And data warehouses often focus on a specific activity or entity. Data warehousing is not necessarily for the needs of "decision makers" or used in the process of decision making. The overwhelming uses of data warehouses are for quite mundane, non-decision making purposes rather than for grist for making decisions with wide ranging effects (socalled "strategic" decisions.). In fact, most of data warehouses are used for post-decision monitoring of the effects of decisions.
FOUR APPROACHES TO DATA MINING:

Any technology requires a great deal of specialized knowledge and experience on part of its users. There are essentially 4 ways to bring data mining expertise to bear on a companys business problems and opportunities: 1. By purchasing scores from outside vendors that are related to your business problem.
2. 3.
By purchasing software that embodies data mining expertise directed towards a By hiring outside consultants to perform predictive modeling for you for special
particular application such as credit approval, fraud detection, etc. projects. 4. By mastering data mining skills within your organization. Purchasing scores : Scores are powerful mechanism for reducing complex judgments based on hundreds or thousands of factors to a single number that can be used to assign grades, rank applicants, and even, in the case of IQ scores, attempt to quantify human intelligence. Most predictive models are designed to produce scores. Purchasing software : Data mining expertise can be embodied in software in one of two ways. The software may embody an actual model, perhaps in the form of a set of decision rules or a fully-trained neural network that can be applied directly to a particular problem domain. Or, it may embody knowledge of the process of building models appropriate to a particular domain in the form of a model-creation template. Purchasing models : Purchasing and applying models developed elsewhere will suit best into your economy. But the qualification is that these will work well only to the extent that your products, your customers, and your market conditions match those that were used to develop the models. Developing in-house expertise : Any business that is serious about converting corporate data into business intelligence should give serious thought to making data mining one of its core competencies by developing in-house expertise. This is especially true for companies that have many products and many customers. By taking control of the data mining process, a company can take full advantage of the fact that, by their actions, its customers are continually teaching it about themselves. Although data mining tools have improved greatly over the last few years, it is still not possible to get good results from data mining without considerable expertise because the activities 5that can be automated by a tool from only a small part of the data mining process. Understanding the business problem, selecting relevant data, transforming data to bring the information content to the surface, and interpreting the results are all activities that have not yet been automated and are not likely to be any time soon.
DATA MINING TECHNIQUES :

Since data mining here is viewed as a technical subject, let us look at the technical part of data mining. The level of understanding needed to make good use of data mining algorithms does not require detailed study of machine learning or statistics. Different goals call for different techniques Data mining can be perspective or descriptive. This distinction refers to the goal of the data mining exercise. The chief goal of a perspective data mining effort is to automate a decision making process by creating a model capable of making a prediction, assigning a label, or estimating a value. Different data types call for different techniques: Data mining algorithms are designed with specific kinds of predictions and specific types of input data in mind. Types of both input and output variables should be taken into account when selecting a data mining algorithm. Here, we discuss about three data mining techniques Automatic cluster detection Decision trees Neural networks
Automation cluster detection : the algorithm divides a data set into predetermined number of clusters - k. To form clusters each record is mapped to a point in record space. The space has as many dimensions as there are fields in the record. The value of each field is represented as distances from origin. Records are assigned to clusters through an iterative process. Decision trees : a decision tree cuts the space into boxes. A diverse population is split into two subpopulations of greater purity. Neural network : there is a hidden layer which contains hidden units. Back propagation is possible in neural networks. The biggest drawback of neural networks is that neural networks cannot explain results.
DATA MINING ON WEB

Two challenges are predominant for data mining on the Web. The first goal is to help users in finding useful information on the Web and in discovering knowledge about a domain that is represented by a collection of Web-documents. The second goal is to
analyze the transactions run in a Web-based system, be it to optimize the system or to find information about the clients using the system This search-centric view misses the point that we might actually want to treat the information in the web as a large knowledge base from which we can extract new, neverbefore encountered information . On the other hand, the results of certain types of text processing can yield tools that indirectly aid in the information access process. Examples include text clustering to create thematic overviews of text collections, automatically generating term associations to aid in query expansion, and using co-citation analysis to find general topics within a collection or identify central web pages. Aside from providing tools to aid in the standard information access process, text data mining can contribute along another dimension. In future to see information access systems supplemented with tools for exploratory data analysis.
TAXONOMY OF DATA WAREHOUSE DATA ERRORS

If you know the possibility that certain errors exist, you will be more prone to spot them and to plan your project to attack the errors in a manageable way. Following is a list of common errors. The categories of "errors" Incomplete Incorrect Incomprehensible Inconsistent.
Incomplete errors : These consist of Missing records Missing fields Wrong (but sometimes right) codes Wrong calculations, aggregations Duplicate records Wrong information entered into source system
Incorrect errors : The data really are incorrect.
Incorrect pairing of codes
Incomprehensibility errors : These are the types of conditions that make source data difficult to read. Multiple fields within one field Weird formatting to conserve disk space Unknown codes Spreadsheets and word processing files
Inconsistency errors : The category of inconsistency errors encompasses the widest range of problems. Obviously similar data from different systems can easily be inconsistent. However, data within one system can be inconsistent across locations, reporting units, and time. Inconsistent use of different codes Inconsistent meaning of a code Overlapping codes Different codes with the same meaning
ASPECTS OF DATA WAREHOUSE ARCHITECTURE

This page is a list of the different aspects of data warehouse architecture. Architecture is a pretty nebulous term. I think of architecture as a system design decision that is usually not easily changed. The decision is not easily changed because the amount of work, money, and politics involved in doing so.
Data consistency architecture: This is the choice of what data sources, dimensions, business rules, semantics, and metrics an organization chooses to put into common usage. This is by far the hardest aspect of architecture to implement and maintain because it involves organizational politics. However, determining this architecture has more to do with determining the place of the data warehouse in your business than any other architectural decision. Reporting data store and staging data store architecture The main reasons we store data in a data warehousing systems are so they can be:
1) reported against, 2) cleaned up, and 3) transported to another data store where they can be reported against and/or cleaned up. Determining where we hold data to report against is called the reporting data store architecture. All other decisions are called staging data store architecture. Data modeling architecture This is the choice of whether you wish to use denormalized, normalized, objectoriented, proprietary multidimensional, etc. data models. It makes perfect sense for an organization to use a variety of models. Tool architecture This is the choice of tools you are going to use for reporting and for infrastructure. Processing tiers architecture This is the choice of what physical platforms will do what pieces of the concurrent processing that takes place when using a data warehouse. This architecture can be simple or complicated. Security architecture If there is a need to restrict access down to the row or field level, well have to use some other means to accomplish this other than the usual security mechanisms.
MAINTENANCE ISSUES FOR DATA WAREHOUSING SYSTEMS:

Another important aspect of data warehousing and decision support systems is maintenance of these systems. The challenge is to learn about business and feeder system changes that will affect the data warehouse/data security systems. For maintaining the data warehouse systems, the measures to be taken are Figure out if, when, and how to purge data. Determine which queries and reports should be IS written and which should be user written. Balance the need for building aggregate structures for processing efficiency with the desire not to build a maintenance nightmare.
Interactively correct data in the data warehouse and send back corrections to the transaction processing system. Figure out how to test the effect of structure changes on end user written queries and reports. Determine how problems with feeder system update processing affect data warehouse/data security system update processing. Rework on how the security is implemented. Most firms, if their data warehousing systems are used for ad hoc reporting, will find their security schemes are either too loose or too tight. Perform euthanasia on some data warehouse/data security systems
CONCLUSION
The issue of privacy points out that data mining exists in the context of a larger society. When we are using data mining for customer relationship management, we are using data mining for customer relationship management; we are bringing the weight of technology to bear on the challenge of understanding other people. We are trying to predict what their actions are likely to be in future. We are learning from what people did in the past to predict what they need in future. In principle, this activity is no different from the personal relationships that once permeated a nostalgic past of corner stores, friendly banks, and helpful insurance agents. Data mining is about expanding this learning culture to companies that are also big enough to reap economies of scale.
Data mining helps in focusing business to servicing customers and to provide efficient business processes. The form of the stored data has nothing to do with whether something is a data warehouse. Use of data warehousing systems is optional. This means you have to identify the potential users of the systems, help them understand what are the benefits of the system, and then make them want to keep coming back to use the system. If the data is rich in metadata , it enables you to build your warehouse in the most scalable, intelligent environment possible, ensuring your efforts are readily adaptable to platform changes, database additions and changing business requirements. The paper concludes hoping that companies will use data mining and data warehousing effectively in order to focus on serving the customers and serving themselves in doing so.
BIBLIOGRAPHY
TEXT BOOK : MASTERING DATA MINING
by MICHAEL J. A. BERRY & GORDON S. LINOFF
Website : http://www.dwinfocenter.org/
-11-

Data Mining and Data Warehousing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining and Data Warehousing

Hochgeladen von

Copyright:

Verfügbare Formate

K.Keerthi (Y2CS268) B.Hima Bindu (L3CS342) keerthi_klce@yahoo.co.

What is Data Warehousing?

FOUR APPROACHES TO DATA MINING:

DATA MINING TECHNIQUES :

DATA MINING ON WEB

TAXONOMY OF DATA WAREHOUSE DATA ERRORS

Incorrect errors : The data really are incorrect.

Incorrect pairing of codes

ASPECTS OF DATA WAREHOUSE ARCHITECTURE

MAINTENANCE ISSUES FOR DATA WAREHOUSING SYSTEMS:

by MICHAEL J. A. BERRY & GORDON S. LINOFF

Das könnte Ihnen auch gefallen