Sie sind auf Seite 1von 38

Big Data and Data Mining

Professor Tom Fomby Director Richard B. Johnson Center for Economic Studies Department of Economics SMU May 23, 2013

Big Data: Many Observations on Many Variables


Data File
OBS No. 1 2 3 . . . . . 1,500,000 Target Var. 0 1 0 . . . . . 1 Var. 1 63 54 44 . . . . . 32 Var. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Var. 100 . . . . . . . . .

Types of Problems
Customer and Student Retention Employee Churn Credit Scoring (Auto or Home Loans) Bond Ratings What Characteristics Make for a Successful Mary Kay Representative? Detection of Fraudulent Insurance Claims Is a Newly Introduced Product Meeting with Consumer Acceptance or Rejection? Who is a likely Donor to your Charity? Early Detection of a Stolen or Compromised Credit Card

Types of Problems
What kind of genetic markers imply certain susceptibilities to specific diseases? Netflix and recommendations of Related and Suggested Movies Recommendations for Book Purchases: Amazon Side-Bars Click Stream Analysis of Optimal Web Base Design

Statistical Hypothesis Testing Versus Prediction

Example of Statistical Hypothesis Testing


A Clinical Trial of 400 people 200 randomly selected into a Control (Placebo) Group and the Other 200 into a Treatment Group Question: Does the Drug Treatment Significantly Reduce a Persons Cholesterol Count? Method: Conventional Statistical Methods Like T-Test Of Significant Difference in Population Means

Example of a Prediction Problem


Early Detection of a Stolen or Compromised Credit Card Not So Interested in How or Why the Credit Card was Stolen but Instead Whether Recent Transactions are Indicative of a Stolen or Compromised Credit Card Tool Box Plot

Getting Gems From the Data

Crankshaft Cartoon

The Task of Constructing a Meaningful Data Warehouse

Data Rich, Information Poor


The Amount of Raw Data Stored in Corporate Databases is Exploding Most of this information is recorded instantaneously and with minimal cost Data bases are measured in gigabytes and terabytes (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) Walmart uploads 20 million point-of-sale transactions to 500 parallel processing storage devices each day. Raw data by itself, however does not provide much information. That is where Data Mining Comes in!

What is Data Mining?


Extracting useful information from large datasets (Hand et al., 2001) Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. (Berry and Linoff, 1997, 2000) Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques (Gartner Group, 2004)

Four Distinct Characteristics of Data Mining Projects


Partitioning given data into Training, Validation, and Test Parts Cross Validation using the Validation and Test Parts to gauge the worthiness of competing models Using Ensemble Methods to increase predictive accuracy. (There is no such thing as a correct model!) Continual Monitoring of a PA system to guard against structural change and to maintain predictive accuracy

More Detailed Discussion of Specific Data Mining Applications


Text Mining (Classification of Documents and Evolution of Opinions on Blogs) Target Marketing Credit Scoring Bond Ratings: Calculating Default Probabilities on Bonds (Bond rating services like Moodys, Standard & Poors, Fitch, etc.) Fraud Detection Customer Retention Franchise Locations and Performance Customer Segmentation Affinity Analysis (i.e. Market Basket Analysis) Link Analysis (Webpage design) Many Other Fields including Clinical Science, Statistical Genetics, Political Science, Real Estate Assessment, and College Admissions Practices

Text Mining

Text Mining: Converting Unstructured Data to Structured Data

Text

Frequencies of Words and Phrases

Numbers for
Prediction

Who Wrote the Federalist Papers? Frederick Mosteller and David Wallace
Inference in an Authorship Problem JASA, June 1963

Comparing Two Documents


Doc 1

Doc 2

18

Target Marketing
Target Marketing is the process of choosing specific customers to advertise to and/or to offer discounts to in order to increase the sales of the company Target Marketing usually proceeds in two stages: (1) Determining the probability that the solicited customer will purchase products from the company once solicited and (2) Once the solicited customer decides to purchase items from the company, estimating the profit that will likely be generated by the customers purchases. Thus the goal is to advertise only to those potential customers that represent expected profits that exceed the cost of advertising to the customer We then need to use data mining techniques to determine (1) the probability of purchase and (2) conditional on purchase, the expected profit of purchase. Expected Profit of Purchase = (Probability of Purchase) x (Expected profits from purchase, conditional on purchase)

Credit Scoring
Credit scoring involves using data mining tools determine the credit worthiness of loan applicants The task is determining the probability that a potential borrower will default on his or her obligations, given the personal characteristics of the borrower and the macroeconomic conditions of the economy at the time Some Examples: Citibank and Credit Card Issuers reviewing applicants for credit cards; Banks considering loaning money for mortgages

Bond Ratings: Calculating Default Probabilities on Bonds


Given the financial characteristics of a bond issuer and the macroeconomic conditions at the time, what is the probability that the bond issuer will, at some time in the future, not be able to service the obligations of the bond? Bond rating services like Moodys, Standard and Poors, and Fitch build probability of default models and use them to give bonds their credit ratings (AAA, AAB, , BBB, etc.). The lower the probability of default, the higher the bond rating and vice versa. In turn, these ratings give rise to differential interest rates paid by the bond issuers. (See Town and Gown PPT for example.)

Fraud Detection
Of interest to IRS, Credit Card Companies, and Auditors Given a history of transactions, a record of typical income tax reports or income or balance sheets, which transactions\reports appear to be outliers? Basic Tool: Statistical Outlier Analysis. Roughly speaking: What is three or more standard deviations from the norm?

Customer Retention
What factors determine the loyalty displayed by a customer? When is a customer likely to jump ship? Would loyalty programs be useful? Basic Tool: Duration Modeling. This method determines what factors extend or limit the durations of customers with companies. Purpose: To identify potential fragile customers and then incentivize them so that they will remain loyal Result: Higher profits

Facets of a Data Mining Job


1. Development of Problem Statement and Consultation with Domain Experts 2. Data Acquisition 3. Data Preparation and Cleaning 4. Data Visualization and Summarization 5. Type of Task? Supervised Learning (Prediction, Classification), or Unsupervised Learning 6. Evaluation of Models (Data Partitioning and Cross Validation) 7. Scoring of New Data 8. Continual Review of Model Usefulness

Franchise Locations and Performance


What location factors affect the eventual profitability and success of franchises? Even within a set of franchises, should the product mix be the same for all franchises or should franchises be treated differently? Can franchisees by put into Clusters and treated differently so as to maximize the profits of the entire franchise operation?

Customer Segmentation
Suppose you are a giant publisher of magazines of various types. How do your subscribers differ across your portfolio of magazines? When soliciting advertising for your magazines, how do you match your potential advertisers with your magazines so that the advertisers receive the maximum benefit for their advertising expenditures? Is there a niche market (customer segment) that none of your magazines (or those of your competitors) is currently serving? Is this niche market substantial enough to warrant introducing a new magazine? Also, retailers often like to be able to distinguish between customers with low versus high elasticities of demand for their products so that they will know who to offer discounts to increase their revenues and profits. Basic Tool: Cluster Analysis

Affinity Analysis
Given that a customer purchases a given set of items, what is the probability that they will purchase another set of items? That is, what does the customers final market basket look like, given a partially-filled one? Purpose: Arrange the store shelves of a retail store so as make it most convenient for customers to purchase related goods and minimize the time of search and shopping. We want the customer to be able to shop quickly but at the same time buy a lot! On book seller web pages, once you have indicated an interest in purchasing a given book, several related books are often brought to your attention by advertisements in the margins of the page you are currently on. Affinity analysis is helpful in generating associated sales on retail web pages. This increases the profits of the web retailer. Major Tool: Association Rules The A priori Algorithm.

Link Analysis
Explores Associations between groups (individuals, organizations, web sites, nationstates and the like) Uses: To improve webpage design, to facilitate criminal investigations, and to benefit medical research in epidemiology and pharmacology, among other uses

Text Mining
To Understand Textual Content For Finding Interesting Regularities in Text Help Classify Documents by Type and Content Useful for Medical Science Search Engines seeking most current research on particular maladies seen in patients Beneficial in Building Spam Filters Help Examine Evolution of Opinion vis--vis Blogs

Other Fields Where Data Mining is Used


Clinical Science and Providing Baseline Guidance for Clinical Treatment Political Science (Modeling Voting Patterns, Election Outcomes and Appeal and Supreme Court Decisions) Statistical Genetics Relating Genetic characteristics with medical outcomes Real Estate Assessment Models County Assessors using predictive models to gauge the current value of houses for the purpose of assessing real estate taxes College Admissions Practices Which students should be admitted and how much financial aid is needed to insure that the chosen student will matriculate?

Typical Data Mining Course Outline

Prediction MLR K-Nearest Neighbor Regression Trees Neural Nets

Data Preparation & Exploration Sampling Cleaning Summaries Visualization Partitioning Dimension reduction

Classification K-Nearest Neighbor Nave Bayes Logistic Regression Classification Trees Neural Nets Discriminant Analysis Segmentation/Clu stering Affinity Analysis/ Association Rules Deriving Insight Model Evaluation & Selection Deriving Insight

Figure 1.2: Data mining from a process perspective

G. Samueli, N. R. Patel and P.C. Bruce. Data Mining for

Business Intelligence (2007).

Available Software Packages


XLMINER (Frontline Systems) SAS Enterprise Miner (SAS Product) SPSS Modeler (IBM Product) R (Open Source) Data Mining Certificates are available for SAS EM and SPSS Modeler

The Shortage of Trained Personnel for Doing Data Mining


Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute, May 2011 140,000 190,000 more deep analytical talent positions over the next decade 1.5 Million more data-savvy managers to take advantage of insights offered by Data Mining

What is SMU doing about this shortage?


Department of Economics: MS in Applied Economics and Predictive Analytics Starting Fall of 2013 Department of Statistics: MS in Statistics and Data Analytics Started Fall of 2012 Cox School of Business: MS in Business Analytics Starting Fall of 2013

The Super Woman of Predictive Analytics

The Skill Set of Super Woman


Analytics:
SAS/SPSS/Statistics

Data Management: Oracle and SQL

Reporting: Cognos and Dashboards

Das könnte Ihnen auch gefallen