Sie sind auf Seite 1von 52

Statistical Analysis with Oracle

Marc Daniau Product group


marc.daniau@sap.com

Introduction
The ANSI SQL offers analytical capabilities useful for

advanced data analysis like linear regression and


correlation. Beyond supporting the standard SQL functions
Oracle provides additional vendor-specific functions that

enable to perform statistical analysis such as descriptive


statistics and hypothesis testing. In the following
presentation we will put into practice the main statistical

built-in functions from Oracle 10g release 2. We will also


show an example of a user-defined function.
SLIDE 2

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics
Descriptive Statistics
Linear Regression
Correlation
Performance measurement
Distribution analysis
Hypothesis Testing

SLIDE 3

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Summarizing a data set


The built-in SQL functions involved in

the following sample are avg, median,


stats_mode, variance, stddev, min,
max and percentile_cont.

Note there is no built-in functions for


skew and kurtosis; we had to use
custom SQL expressions.

SLIDE 4

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

The BOBJ Universe


Here is a sample Universe exposing the stats_* functions.

New functions can be added


to the list by updating the
oracle.prm file.

SLIDE 5

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Box Plot
The quartiles are computed using the Oracle SQL function
percentile_cont() within group (order by ()).

SLIDE 6

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

A real-life example: GE Delivery span


The delivery span focuses on the time between when a

customer requested the product and when it was delivered.


The span is a measure of variation similar to the interquartile range but instead of looking at the middle 50% of

the observations, it looks at the middle 90%.


Span = 95th Percentile 5th Percentile
The Oracle SQL functions to compute the span are:
percentile_cont(0.95) and percentile_cont(0.05).
SLIDE 7

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

A real-life example: GE Delivery span


The goal is to squeeze the two sides of the delivery span,

days early & days late, ever closer to the center: the exact
day the customer desired. Reducing variation is what
quality is about.
Days
Late

Days
Early

SLIDE 8

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics
Descriptive Statistics
Linear Regression
Correlation
Performance measurement
Distribution analysis
Hypothesis Testing

SLIDE 9

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Linear Regression


Studying the relationship of two time series
The built-in SQL functions used are regr_* and stats_*.
In this DeskI document, it
appears that X and Y
increase together.

SLIDE 10

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Linear Regression


Checking the normality of the residuals
The SQL function row_number() over() is used. A custom

PL/SQL function inverse_phi() is employed for building


the X axis; an alternative consists of using a lookup Z table.

SLIDE 11

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

User defined functions


The Oracle user can extend the list of built-in functions with

his own PL/SQL functions. Following are examples of


custom functions.
The custom function
inverse_phi().

SLIDE 12

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

User defined functions


Custom functions can be exposed in the BOBJ SQL editor

by adding them in the oracle.prm file.

The custom function


inverse_phi().

SLIDE 13

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Linear Regression


Detecting Outliers (unusual values)
The variables X and Y in this example are not time series.

Mortality (Y) tends to decrease as education (X) increases.


The limits are computed with the regr_* SQL functions.

Ouliers appear in
red.

SLIDE 14

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Linear Regression


Detecting Outliers (unusual values)
After changing the sigma factor from 2 to 2.5, New Orleans

is no longer beyond limits.

SLIDE 15

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Non Linear Regression


Some simple non-linear relationships can be transformed

into linear relationships. Every time you can transform


a relationship into an equation of the form Y = a + b X ,
you can use the least squares method to fit the data.

This technique is used for fitting curves such as


exponential, logarithmic, power, hyperbola, logistic (Pearl)
and Gompertz.

SLIDE 16

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Non Linear Regression


Fitting a S-shaped curve (Pearl)
The sample below plots the cumulative sales by month for a

given product. The X and Y variables are computed using


respectively row_number() over and sum(sum()) over().
The ln() function is used for transformation; regr_slope()

and regr_intercept() for getting the regression coefficients.

SLIDE 17

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Simple Non Linear Regression


Fitting a S-shaped curve (Gompertz)
The error values are computed within the BOBJ report.

The number of
periods to forecast
SLIDE 18

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Weighted Least Squares (local regression)


Smoothing a time series
The moving slope and intercept are calculated on the

RDBMS side. The user chooses the strength of the smoother


(moving window size).

SLIDE 19

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Multiple Linear Regression


Forecasting
The regression coefficients are calculated on the fly by the

rdbms using regr_* functions.

SLIDE 20

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics
Descriptive Statistics
Linear Regression
Correlation
Performance measurement
Distribution analysis
Hypothesis Testing

SLIDE 21

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Checking for auto-correlation


The Lag Plot
The lag plot helps uncover seasonality and patterns from

sequential data (e.g. time series) that may have been missed
by looking only at a line chart. The lag plot is useful to check
auto-correlation (lack of independence in the series).

It consists of a scatter diagram plotting Yt on the vertical axis


versus Yt-lag on the horizontal axis.
The Oracle SQL function lag(,) over() is illustrated next.

SLIDE 22

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Checking for auto-correlation


The Lag Plot
The sample line chart below tells us there is no shift of

average over time. The lag plot (lag 1) on the right shows a
pattern that cannot be seen in the line chart. Note that outliers
(unusual values) appear more clearly on the lag plot.

SLIDE 23

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Auto-correlation function (ACF)


While the lag plot displays the individual data points for a

given lag, the ACF plot gives a summary picture of autocorrelation over multiple lags. The ACF plot is useful in
identifying seasonal or cyclical patterns in a time series.

It shows the correlation coefficient statistic on the vertical axis


over the different lags on the horizontal axis.
The Oracle SQL functions regr_* are used in the following
example.
SLIDE 24

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

ACF Plot
The example below is the ACF plot of the monthly Australian

beer production. The thin bars represent the correlation


coefficients. The dotted flat lines are the significance limits.

The peak on lag 12


indicates seasonality.

SLIDE 25

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics
Descriptive Statistics
Linear Regression
Correlation
Performance measurement
Distribution analysis
Hypothesis Testing

SLIDE 26

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Comparison types
Organizations measure performance of individual entities

such as products, geographies, suppliers or employees by


making different types of comparisons :
Over time

Against pre-determined goals


Against a comparator group

SLIDE 27

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Evaluation against a group


Most of the examples presented in this section illustrate how
to evaluate the performance of an individual entity against a
comparator group. We will see various methods:
Percent rank

T-score
Percent of leader
Range score

Percent of total
Individual ratio versus group ratio
SLIDE 28

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Percent Rank
The WebI document below ranks countries on economic

performance. It involves the following Oracle SQL functions:


rank() over() and percent_rank() over().

SLIDE 29

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

T-score
Comparing against the group average
The Xcelsius radar chart here requires the avg() over()

and stddev_pop() over() SQL functions. The data set


consists of 7 indicators and 7 countries.
The blue line represents the group
average normalized at value 50. The
amber line corresponds to the
country T-score (here USA).

The slicer allows the user to see the


performance of the different countries
in the G7 group.
SLIDE 30

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Percent of Leader
Comparing against the best in the group
The sample document below rates UK versus other nations

on educational research. It requires the max() over() SQL


functions. The underlying data set covers 35 nations and
includes 7 indicators within a period from 1987 to 1998.

SLIDE 31

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Range score
Positioning the individual in the group range
The report below evaluates companies performance based

on 3 indicators. The range score normalizes the results


before computing a composite score. It involves the SQL
functions min() over() and max() over().

We apply a weighted
average to obtain a
composite score. This
is done within the BOBJ
document.
SLIDE 32

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Percent of Total
Even though it is limited to additive indicators such as

revenue, cost or number of customers, the percent of


group total is commonly used. The Pareto chart is an
example of graph that displays the percentage of

each entity (represented as bars) relative to the total value of


the group. It helps identify the largest contributors by
presenting them first on the left side of the graph.

SLIDE 33

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Percent of Total - Pareto chart


The Oracle built-in function ratio_to_report() over()

enable to compute the percent of total required by the


Pareto chart.

SLIDE 34

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Cumulative Percent of Total


Here the percent of total for the two indicators is obtained

with the ratio_to_report() over() SQL function. The


cumulative sum is done within the BOBJ crosstab.

The report indicates that Callahan


generates the largest revenue with
the smallest number of customers.
We also observe that roughly half of
the Sales (48.7%) is generated by a
quarter of the customers (22.4%)
and by a quarter of the sales force
(Callahan and King).

SLIDE 35

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Individual ratio versus group ratio


We just saw how we can apply a percent of total to additive

indicators like Sales and Profit. Lets assume we want to track


the performance of Gross Profit rate calculated as Gross
Profit divided by Gross Sales. The percent of total analysis

will not work in that case since ratios are not additive. We
could use the average of the Gross Profit rates in the group
as the baseline but instead we will compare to the group

Gross Profit rate obtained as:


group total Gross Profit / group total Gross Sales.
SLIDE 36

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Individual ratio versus group ratio


We use the rollup() SQL function in order to generate on

the fly a summary row that aggregates values for Sales,


Profit and Profit rate.

The blue cursors on both sides of the


table indicate the group row which
serves as the baseline. The categories
above the baseline exhibit a better
profit rate than the group.

SLIDE 37

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Measuring process performance


Companies that implement a quality improvement

methodology measure performance of their processes by


analyzing the variation. A process is given upper and lower
specification limits (USL & LSL) and a target. Variation over

the target is measured to evaluate how capable the process


is relative to the specifications. One process performance
measurement method is presented next for illustration.

SLIDE 38

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Process capability/performance
Measuring process variations
The Oracle built-in SQL function
stats_one_way_anova() returns the

variations.
The custom PL/SQL function phi() is

required to get the ppm/defect rates.

SLIDE 39

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics
Descriptive Statistics
Linear Regression
Correlation
Performance measurement
Distribution analysis
Hypothesis Testing

SLIDE 40

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Frequency Histogram 1/3


The bucketing of individual values is performed with the
width_bucket() function. The count aggregation is done by

the rdbms. Empty buckets are handled in the BOBJ report.

The data set is from the Oracle


HR demo database.
Salary exibits a distribution
stretched out towards the right.

SLIDE 41

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Frequency Histogram 2/3


Running the query.

The 3 arguments of
the width_bucket()
function are prompted

The end-user can


enter a custom
value: e.g. 0.277
SLIDE 42

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Listing the main


power-transforms.

Frequency Histogram 3/3


The square root transform brings some symmetry to the

salary data. The power-transforms require the following SQL


functions: power() and ln().

SLIDE 43

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Automatic bucketing
The BOBJ document below requires no inputs from the user

with regards to the histogram buckets. We use the Sturges


rule to determine the number of bars automatically. The
histogram count aggregation is performed by the rdbms.

The normal plot


complements the
histogram in assessing
the normality of the data
set.
SLIDE 44

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

User-defined intervals
In the sample below the end-user enters the histogram

interval width. The count aggregation is done by the rdbms.


The Gaussian curve is estimated within the BOBJ document.

The blue bars represent the


differences between the
observed frequencies
(histogram bars) and the
estimated frequencies (fitted
curve). The Gaussian curve
is turned into a horizontal
line (zero baseline).

SLIDE 45

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Weibull distribution
The Hazard plot below uses the following SQL Functions:
row_number() over() and sum() over(). The regression line

is computed within the BOBJ document.

SLIDE 46

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Topics
Descriptive Statistics
Linear Regression
Correlation
Performance measurement
Distribution analysis
Hypothesis Testing

SLIDE 47

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

t-test on a single group


Comparing the mean to a target
In this sample report we use

the Oracle SQL function


stats_t_test_one().

SLIDE 48

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

t-test for two paired groups


Before-After comparison of mean
In this sample report we use

the Oracle SQL function


stats_t_test_paired().

SLIDE 49

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

t-test for two independent groups


Comparing means
In this sample report we use

the Oracle SQL function


stats_t_test_indep().

SLIDE 50

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

Analysis of variance
Comparing means across multiple groups
The Oracle SQL function
stats_one_way_anova()

is used here.
The following sample

involves six groups.

SLIDE 51

COPYRIGHT 2008 BUSINESS OBJECTS S.A. ALL RIGHTS RESERVED.

2008 SAP AG. All rights reserved.


No part of this publication may be reproduced or transmitted in any form or for any
purpose without the express permission of SAP AG. The information contained
herein may be changed without prior notice.
Some software products marketed by SAP AG and its distributors contain
proprietary software components of other software vendors.
Microsoft, Windows, Excel, Outlook, and PowerPoint are registered trademarks of
Microsoft Corporation.
IBM, DB2, DB2 Universal Database, System i, System i5, System p, System p5,
System x, System z, System z10, System z9, z10, z9, iSeries, pSeries, xSeries,
zSeries, eServer, z/VM, z/OS, i5/OS, S/390, OS/390, OS/400, AS/400, S/390
Parallel Enterprise Server, PowerVM, Power Architecture, POWER6+, POWER6,
POWER5+, POWER5, POWER, OpenPower, PowerPC, BatchPipes,
BladeCenter, System Storage, GPFS, HACMP, RETAIN, DB2 Connect, RACF,
Redbooks, OS/2, Parallel Sysplex, MVS/ESA, AIX, Intelligent Miner, WebSphere,
Netfinity, Tivoli and Informix are trademarks or registered trademarks of IBM
Corporation.
Linux is the registered trademark of Linus Torvalds in the U.S. and other
countries.
Adobe, the Adobe logo, Acrobat, PostScript, and Reader are either trademarks or
registered trademarks of Adobe Systems Incorporated in the United States and/or
other countries.
Oracle and Java are registered trademarks of Oracle and/or its affiliates.

UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group.
Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame, and
MultiWin are trademarks or registered trademarks of Citrix Systems, Inc.
HTML, XML, XHTML and W3C are trademarks or registered trademarks of
W3C, World Wide Web Consortium, Massachusetts Institute of Technology.

2008 SAP AG. All rights reserved.

SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects


Explorer, StreamWork, and other SAP products and services mentioned herein as
well as their respective logos are trademarks or registered trademarks of SAP AG
in Germany and other countries.
Business Objects and the Business Objects logo, BusinessObjects, Crystal
Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business
Objects products and services mentioned herein as well as their respective logos
are trademarks or registered trademarks of Business Objects Software Ltd.
Business Objects is an
SAP company.
Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere, and other
Sybase products and services mentioned herein as well as their respective logos
are trademarks or registered trademarks of Sybase, Inc. Sybase is an SAP
company.
All other product and service names mentioned are the trademarks of their
respective companies. Data contained in this document serves informational
purposes only. National product specifications may vary.
The information in this document is proprietary to SAP. No part of this document
may be reproduced, copied, or transmitted in any form or for any purpose without
the express prior written permission of SAP AG.

Das könnte Ihnen auch gefallen