Sie sind auf Seite 1von 10

Decision Tree Definition Decision trees are are graphs people use to see all the possible outcomes

of a decision. Also called "tree diagrams," they help you make a decision. Function

Decision trees include each decision's chances, recourse costs, benefits and drawbacks. They include controllable alternatives and uncontrollable uncertainties. Because the decisions and their outcomes are laid out in one graph, it is easier to see all the consequences and, therefore, make an informed decision based on the diagram.

Fields

Decision trees are sometimes used when teaching certain subjects, such as business, health economics and public health. Decision trees are most commonly used in operations research and specifically in decision analysis. They are also used in data mining and machine learning. In these fields, the decision tree acts as a predictive model, helping decision makers to come to conclusions about an item's target value based on observations. Here, a decision tree is more commonly known as a classification or regression tree. The leaves represent classifications of items and the branches represent features that lead to those classifications.

Features

Decision trees are drawn from left to right, so they can be read like a book. They start with the object or initial decision. Branches lead to decision nodes, which show other decisions to be made and are enclosed in squares; chance nodes, which show chances the person or company will be taking and are enclosed in circles; and end nodes, which illustrate the end result of decisions and are enclosed in triangles.

Benefits

Decision trees are easy to interpret once the features are explained. They show decisions, chances, outcomes, costs and other factors, often on a single page. Perhaps most importantly, they can be used in conjunction with other decision-making techniques or programs to determine the best course of action.

Creating Decision Trees

Decision trees are traditionally created by hand, the words written and branches and nodes drawn out on a piece of paper. However, there are now decision tree programs that can be downloaded from the Internet, including the most common program called Tree Plan, which is a plug-in for Microsoft Excel. These programs allow users to create decision trees as if by hand, but the computer prompts users to input decisions, chances and end nodes, and often keeps track of branch and leaf information in a linear fashion.

Decision tree learning is a method commonly used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. An example is shown on the right. Each interior node corresponds to one of the input variables; there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target variable given the values of the input variables represented by the path from the root to the leaf. A tree can be "learned" by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions. This process of top-down induction of decision trees (TDIDT) [1] is an example of a greedy algorithm, and it is by far the most common strategy for learning decision trees from data, but it is not the only strategy. In fact, some approaches have been developed recently allowing tree induction to be performed in a bottom-up fashion[2]. In data mining, decision trees can be described also as the combination of mathematical and computational techniques to aid the description, categorisation and generalisation of a given set of data. Data comes in records of the form:

The dependent variable, Y, is the target variable that we are trying to understand, classify or generalise. The vector x is composed of the input variables, x1, x2, x3 etc., that are used for that task.

A tree showing survival of passengers on the Titanic ("sibsp" is the number of spouses or siblings aboard). The figures under the leaves show the probability of survival and the percentage of observations in the leaf.
The procedure can be used for: Segmentation. Identify persons who are likely to be members of a particular group. Stratification. Assign cases into one of several categories, such as high-, medium-, and low-risk groups. Prediction. Create rules and use them to predict future events, such as the likelihood that someone will default on a loan or the potential resale value of a vehicle or home. Data reduction and variable screening. Select a useful subset of predictors from a large set of variables for use in building a formal parametric model. Interaction identification. Identify relationships that pertain only to specific subgroups and specify these in a formal parametric model. Category merging and discretizing continuous variables. Recode group predictor categories and continuous variables with minimal loss of information. Example. A bank wants to categorize credit applicants according to whether or not they represent a reasonable credit risk. Based on various factors, including the known credit ratings of past customers, you can build a model to predict if future customers are likely to default on their loans. A tree-based analysis provides some attractive features: It allows you to identify homogeneous groups with high or low risk. It makes it easy to construct rules for making predictions about individual cases.

ID3 algorithm:In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] used to generate a decision tree Algorithm The ID3 algorithm can be summarized as follows: 1. Take all unused attributes and count their entropy concerning test samples 2. Choose attribute for which entropy is minimum (or, equivalently, information gain is maximum) 3. Make node containing that attribute The algorithm is as follows: ID3 (Examples, Target_Attribute, Attributes)

Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, then Return the single node tree Root, with label = most common value of the target attribute in the examples. Otherwise Begin o A = The Attribute that best classifies examples. o Decision Tree attribute for Root = A. o For each possible value, , of A,

Add a new tree branch below Root, corresponding to the test A = . Let Examples( ) be the subset of examples that have the value for A If Examples( ) is empty Then below this new branch add a leaf node with label = most common target value in the examples Else below this new branch add the subtree ID3 (Examples( ), Target_Attribute, Attributes {A})

End Return Root

[edit] The ID3 metrics To avoid overtraining, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible tree. The optimization step makes use of information entropy: [edit] Entropy

Where :

is the information entropy of the set ; is the number of different values of the attribute in chosen attribute) is the frequency (proportion) of the value is the binary logarithm

(entropy is computed for one

in the set

An entropy of 0 identifies a perfectly classified set. Entropy is used to determine which node to split next in the algorithm. The higher the entropy, the higher the potential to improve the classification here. [edit] Gain Gain is computed to estimate the gain produced by a split over an attribute :

Where :

is the gain of the set

after a split over the in

attribute

is the information entropy of the set is the number of different values of the attribute

is the frequency (proportion) of the items possessing as value for is possible value of is a subset of containing all items where the value of is

in

Gain quantifies the entropy improvement by splitting over an attribute: higher is better. CHAID CHAID is a type of decision tree technique, based upon adjusted significance testing (Bonferroni testing). The technique was developed in South Africa and was published in 1980 by Gordon V. Kass, who had completed a PhD thesis on this topic. CHAID can be used for prediction (in a similar fashion to regression analysis, this version of CHAID being originally known as XAID) as well as classification, and for detection of interaction between variables. CHAID stands for CHi-squared Automatic Interaction Detection, based upon a formal extension of the US AID (Automatic Interaction Detection) and THAID (THeta Automatic Interaction Detection) procedures of the 1960s and 70s, which in turn were extensions of earlier research, including that performed in the UK in the 1950s. In practice, CHAID is often used in the context of direct marketing to select groups of consumers and predict how their responses to some variables affect other variables, although other early applications were in the field of medical and psychiatric research. Like other decision trees, CHAID's advantages are that its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large sample sizes to work effectively, since with small sample sizes the respondent groups can quickly become too small for reliable analysis. CHAID detects interaction between variables in the data set. Using this technique it is possible to establish relationships between a dependent variable for example readership of a certain newspaper and other explanatory variables such as price, size, supplements etc. CHAID does this by identifying discrete groups of respondents and, by taking their responses to explanatory variables, seeks to predict what the impact will be on the dependent variable. CHAID is often used as an exploratory technique and is an alternative to multiple linear regression and logistic regression, especially when the data set is not well-suited to regression analysis.

Association Rules Introductory Overview


The goal of the techniques described in this topic is to detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data mining subcategory text mining. These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research - from the analysis of consumer preferences or human resource management, to the history of language. These techniques enable analysts and researchers to uncover hidden patterns in large data sets, such as "customers who order product A often also order product B or C" or "employees who said positive things about initiative X also frequently complain about issue Y but are happy with issue Z." The implementation of the so-called a-priori algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000) allows us to process rapidly huge data sets for such associations, based on predefined "threshold" values for detection.

How association rules work. The usefulness of this technique to address unique data mining problems is best illustrated in a simple example. Suppose we are collecting data at the check-out cash registers at a large book store. Each customer transaction is logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional magazine titles and other gift items that were purchased, and so on. Hence, each record in the database will represent one customer (transaction), and may consist of a single book purchased by that customer, or it may consist of many (perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order in which the different items (books, magazines, and so on) came down the conveyor belt at the cash register. The purpose of the analysis is to find associations between the items that were purchased, i.e., to derive association rules that identify the items and co-occurrences of different items that appear with the greatest (co-)frequencies. For example, we want to learn which books are likely to be purchased by a customer who we know already purchased (or is about to purchase) a particular book. This type of information could then quickly be used to suggest to the customer those additional titles. You may already be "familiar" with the results of these types of analyses if you are a customer of various on-line (Web-based) retail businesses; many times when making a purchase on-line, the vendor will suggest similar items (to the ones purchased by you) at the time of "checkout", based on some rules such as "customers who buy book title A are also likely to purchase book title B," and so on. Sequence Analysis. Sequence analysis is concerned with a subsequent purchase of a product or products given a previous buy. For instance, buying an extended warranty is more likely to follow (in that specific sequential order) the purchase of a TV or other electric appliances. Sequence rules, however, are not always that obvious, and sequence analysis helps you to extract such rules no matter how hidden they may be in your market basket data. There is a wide range of applications for sequence analysis in many areas of industry including customer shopping patterns, phone call patterns, the fluctuation of the stock market, DNA sequence, and Web log streams. Link Analysis. Once extracted, rules about associations or the sequences of items as they occur in a transaction database can be extremely useful for numerous applications. Obviously, in retailing or marketing, knowledge of purchase "patterns" can help with the direct marketing of special offers to the "right" or "ready" customers (i.e., those who, according to the rules, are most likely to purchase specific items given their observed past consumption patterns). However, transaction databases occur in many areas of business, such as banking. In fact, the term "link analysis" is often used when these techniques - for extracting sequential or non-sequential association rules - are applied to organize complex "evidence." It is easy to see how the "transactions" or "shopping basket" metaphor can be applied to situations where individuals engage in certain actions, open accounts, contact other specific individuals, and so on. Applying the technologies described here to such databases may quickly extract patterns and associations between individuals and actions and, hence, for example, reveal the patterns and structure of some clandestine illegal network. Unique data analysis requirements. Crosstabulation tables, and in particular Multiple Response tables can be used to analyze data of this kind. However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the "factorial degree" of important association rules is not known ahead of time, then these tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple "bookstore-example" discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete crosstabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules "hiding" in the data, we would miss them completely. The a-priori algorithm implemented in Association Rules will not only automatically detect the relationships ("cross-tabulation tables") that are important (i.e., cross-tabulation tables that are not sparse, not containing mostly zero's), but also determine the factorial degree of the tables that contain the important association rules.

To summarize, use Association Rules to find rules of the kind If X then (likely) Y where X and Y can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct cross-tabulation tables without the need to specify the number of dimensions for the tables, or the number of categories for each dimension. Hence, this technique is particularly well suited for databases.

data and text mining of huge

Statistical Analysis
Statistical analysis refers to a collection of methods used to process large amounts of data and report overall trends. Statistical analysis is particularly useful when dealing with noisy data. Statistical analysis provides ways to objectively report on how unusual an event is based on historical data. Our server uses statistical analysis to examine the tremendous amount of data produced every day by the stock market. We usually prefer statistical analysis to more traditional forms of technical analysis because statistical analysis makes use of every print. Candlesticks, by comparison, throw away an arbitrary number of prints before the analysis starts.
Candlesticks, point and figure charts, and other traditional forms of technical analysis were designed long ago. They were specifically created for people who were analyzing the data by hand. Statistical analysis looks at more data, and typically requires a computer. Alert Types

We offer the following alert types which are related to this topic. Click on the icon for a detailed description of the alert, or click on the example link for additional samples of each type of alert.

Statistical Approaches and Data Mining


A third approach based on statistics is more evidence-oriented. One goal is not to find patterns within price history for a certain stock, but to find patterns valid across all stocks. The price development of a certain stock is analyzed and compared to a large dataset of results for stocks with similar price development. If there exists a correlation to future returns for stocks exhibiting similar price patterns, you can use this correlation when trading. One advantage is that you are not optimizing within historical prices of a certain stock, but checking price patterns of the stock to a dataset providing estimations of future returns, and other measures. You are thus eliminating curve-fitting and gaining statistically valid estimations. Optimal Trader is a tool combining technical analysis with statistical approaches and data mining (the process of extracting hidden patterns from data) giving you the advantages of both means.

Data Mining
Data Mining, also called KnowledgeDiscovery, is a general term for a variety of interlocking technologies that, used together, find, isolate, and quantify patterns hidden in large and often disparate collections of data. As a general knowledge extraction process, its primary goal is the discovery of nontrivial and potentially valuable hidden in local files, databases, and in repositories scattered across distributed networks. Employing a wide spectrum of statistical analysis, machine learning, graph theory, and advanced computer science techniques data mining uncovers often subtle patterns and time varying relationships buried deep in the morass of data. From these relationships and shifting patterns it evolves a set of rules that predict and classify future behaviors. An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis.

Roll over image for more information. While earlier generations of data mining approaches generated static reports and statistical descriptions, the newest breed of data mining systems (such as the mechanisms in Scianta Intelligences Adaptive Intelligence Platform) generate powerful models that often tightly integrated with client applications. These models can take on a wide variety of forms: if-then rule based knowledge systems, adaptive feed-back models that incorporate machine learning, linkage and affinity modes that discover shared connections and relationships, statistical models, regression models for time-series prediction, and classification model. Typical uses of data mining include supply chain management, customer relationship analysis and profiling, fraud and anomalous behavior discovery, risk assessment, inventory optimization, and customer cross-selling and profitability. Statistical analysis is an aspect of business intelligence (BI) that involves the collection and scrutiny of business data and the reporting of trends. Statistical analysis examines every single data sample in a population (the set of items from which samples can be drawn), rather than a cross sectional representation of samples as less sophisticated methods do.

Know-IT-All

Deleting statistics after analyzing tables (SearchOracle.com)

SysTrack

SysTrack from Lakeside Software Inc. simplifies...


(SearchWindowsServer.com)

Statistical analysis can be broken down into five discrete steps, as follows.

Describe the nature of the data to be analyzed. Explore the relation of the data to the underlying population. Create a model to summarize understanding of how the data relates to the underlying population. Prove (or disprove) the validity of the model. Employ predictive analytics to anticipate future trends.

Besides statistical analysis, BI applications include the activities of decision support systems (DSS), query and reporting, online analytical processing (OLAP), forecasting, and data mining.

Das könnte Ihnen auch gefallen