Sie sind auf Seite 1von 8

# SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

Este sitio emplea cookies como ayuda para prestar servicios. Al utilizar este sitio, ests aceptando el uso de cookies.

4/28/15, 7:02 PM

Ms informacin

Entendido

## SAS Programming for Data Mining

Copyright 2006-2014 / SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their
respective companies.

Home

## AUC calculation using Wilcoxon Rank Sum Test

Accurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data
In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linear
logistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1
respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able to
obtain accurate measurement of AUC for this given data.

Join this site

## Members (67) More

The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 and
N0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W is
the Wilcoxon Rank Sums.
In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it as
AUC=0.9119491555

## Search This Blog

Search

Sites on SAS
Analytics in Writing
MySAS.NET
PROC-X Aggregator
SAS Analysis by Charlie
SAS Community
SAS Die Hard
SAS Graph Examples
SAS Support
SAS-L Archives
StatComput by Wensui

## %macro AUC( dsn, Target, score);

ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data=&dsn ;
where &Target^=.;
class &Target;
var &score;
run;
ods select all;
data AUC;
set WilcoxonScore end=eof;
retain v1 v2 1;
if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
v2=N*v2;
if eof then do;
d=v1/v2;
Gini=d * 2;
AUC=d+0.5;
put AUC= GINI=;
keep AUC Gini;
output;
end;
run;
%mend;
data test;
do i = 1 to 10000;
x = ranuni(1);
y=(x + rannor(2315)*0.2 > 0.35 ) ;
output;
end;
run;
ods select none;
ods output Association=Asso;
proc logistic data = test desc;
model y = x;
score data = test out = predicted ;
run;
ods select all;
data _null_;

http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 1 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

set Asso;
if Label2='c' then put 'c-stat=' nValue2;

## Exploring Data Blog

Python Scikit

4/28/15, 7:02 PM

run;
%AUC( predicted, y, p_0);

Python SciPy
R Bloggers Aggregator
R Cookbook
R Graphics
R Project

NPAR1WAY gets
AUC = 0.91766634744;
LOGISTIC reports c-statistic = 0.917659
So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure,
PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers'D in PROC FREQ. Here, from
NPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer's D C|R (since the column
variable is predictor)from PROC FREQ:

Recommended Sites
Baidu
Bing
Colt: JAVA Lib for Computing

## proc freq data=test noprint;

tables y*x/ measures;
output out=_measures measures;
run;
data _null_;
set _measures;
put _SMDCR_=;
run;

## Kaggle (DM Competition)

MITBBS

Then why not just use PROC FREQ since the coding is so simple? Well, the answer is really about the SPEED!
Check the log below for a data with only 100000 observations, 37.63sec vs. 0.15 sec in real time:

## NIST Math & Stat Div

Stats Blog
Tim's TextMining
UCI Machine Learning
Repository
UCLA Stat Computing

Tag
Array (5)
AUC (1)
Bayesian (2)
Boost Algorithms (4)
Data Manipulation (14)
Data Mining (12)
Erlang C (1)

3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556

data one;
call streaminit(98676876);
do id=1 to 1e5;
score=ranuni(0)*1000;
if score+rannor(0)>0 then y=1;
else y=0;
output;
drop id;
end;
run;

NOTE: The data set WORK.ONE has 100000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time
0.04 seconds
cpu time
0.04 seconds
3557
3558
3559
3560
3561

## proc freq data=one noprint;

tables score*y/measures noprint;
output out=_freq_out measures;
run;

NOTE: There were 100000 observations read from the data set WORK.ONE.
NOTE: The data set WORK._FREQ_OUT has 1 observations and 27 variables.
NOTE: PROCEDURE FREQ used (Total process time):
real time
37.63 seconds
cpu time
37.56 seconds

Filter (1)
Finite Mixture Model (1)
Format (1)
Gap Statistic (1)
Gini Index (1)
GRAPH (2)
Hash Object (4)
Heckman Selection model (1)
HOSVD (2)

3562
3563
3564
3565
3566
3567

data _null_;
set _freq_out;
AUC=_smdrc_/2 + 0.5;
put "AUC = " AUC "
SOMER'S D R|C = " _smdrc_;
run;

AUC = 0.9995285252
SOMER'S D R|C = 0.9990570504
NOTE: There were 1 observations read from the data set WORK._FREQ_OUT.
NOTE: DATA statement used (Total process time):
real time
0.00 seconds
cpu time
0.00 seconds
3568
3569

%AUC(one, y, score);

## NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.

http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 2 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

HPGLIMMIX (1)
Index (2)
K-means Clustering (3)
K/N Algorithm (1)
kernel (1)
KNN (3)
LGD (1)
Macro Programming (7)
Moore-Penrose pseudoinverse
(3)

4/28/15, 7:02 PM

NOTE: There were 100000 observations read from the data set WORK.ONE.
WHERE y not = .;
NOTE: PROCEDURE NPAR1WAY used (Total process time):
real time
0.10 seconds
cpu time
0.09 seconds

AUC=0.9995285252 Gini=0.9990570504
NOTE: There were 2 observations read from the data set WORK.WILCOXONSCORE.
NOTE: The data set WORK.AUC has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time
0.01 seconds
cpu time
0.00 seconds

## Labels: AUC, Macro Programming, predictive modeling, PROC NPAR1WAY

Nearest Neighbor (3)
Over-dispersion (1)
PCA (3)

Charlie Shipp Family said...
Looks great .!.

Charlie Shipp

## PROC CANDISC (1)

PROC CORR (3)
PROC DISCRIM (5)
PROC DISTANCE (2)
PROC EXPAND (1)
PROC FACTOR (1)

## 11:25 PM, February 27, 2010

eskimokitty said...
It is awesome. Thank you so much!!
4:37 PM, June 08, 2011
raspcompote said...
Thank you for this useful post.
3:47 PM, March 13, 2012

PROC FCMP (1)

## PROC FMM (1)

I'd like to know where did you find this relationship between AUC and Wilcoxon Rank Sum Test. I'm trying
to study more about it and it would really help!

Thanks

## PROC GLMMOD (1)

http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U

PROC GPLOT (2)
PROC HPMIXED (3)
PROC KRIGE2D (1)

## 7:37 PM, February 13, 2014

Jon Dickens said...
You are confusing the Gini with the accuracy ratio but you are not alone several people at SAS make the
same mistake.

## PROC LIFEREG (1)

if you are interested in discussing this issue, then contact me via linkedin.

Jon Dickens

## PROC MODECLUS (1)

PROC NPAR1WAY (1)
PROC ORTHOREG (1)

http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 3 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

SAS SQL

SAS Programming

SAS SPSS

PROC REG (6)

Home

Older Post

PROC SQL (2)

## PROC STANDARD (1)

PROC STDIZE (1)
PROC UNIVARIATE (2)
quantile computing (1)
Queueing Model (1)
R (3)
random number (1)
Random Split (1)
RISK (1)
SAS (2)
Statistical Graphics (1)
SVD (11)
Tensor (2)
Tobit Model (1)

Recent Posts

## sklearn DecisionTree plot

example needs pydotplus
In Python, sklearn (scikitlearn)'s DecisionTree
example uses pydot for
plotting the generated tree:
@here. But for Python 3,
pydot has...
Apr-25-2015 | More
Migrating code pieces to
GitHub
One of the original reasons
for this blog was to keep
track of my SAS code as
well as its relevant context.
That was the mindset when
I was a SAS...
Feb-05-2015 | More
%SVD macro with BYProcessing
For the
Regularized
Discriminant
Analysis Cross
Validation, we need to
compute SVD for each pair
of $$(\lambda, \gamma)$$,
and the factorization...
Dec-18-2014 | More
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 4 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

Dec-18-2014 | More
Experient downdating
algorithm for Leave-OneOut CV in RDA
In this post, I want to
demonstrate a piece of
experiment code for
downdating algorithm for
Leave-One-Out (LOO)
Cross Validation in
Regularized...
Dec-15-2014 | More
Control Excel via SAS DDE
& Python win32com
Excel is probably the most
used interface between
human and data.
Whenever you are dealing
Excel is the de facto means
for all...
Dec-15-2014 | More
%HPGLIMMIX SAS macro
is available online at JSS
website
My paper "%HPGLIMMIX:
A High-Performance SAS
Macro for GLMM
Estimation" is now
available at Journal of
Statistical Software website
@here. SAS macro...
Jul-01-2014 | More
analytics for SAS, R and
Python
Disclaimer: This
study is a view on
the market
trend on demand
software and their
market perspective,...
Dec-06-2013 | More
I don't always do
regression, but when I do, I
do it in SAS ...
There are several
from SAS
Analytics products
running on v9.4, especially
the SAS/STAT high
performance procedures,
where "high...
Jul-19-2013 | More
Finding the closest pair in
datat using PROC
MODECLUS
UPDATE: Rick
Wicklin kindly
shared his
visualization efforts
on the output to put a more
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 5 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

## on the output to put a more

straightforward sense on
the results. Thanks. Here...
May-08-2013 | More
Large Scale Linear Mixed
Model
Update at the end:

****************************;
Bob at r4stats.com claimed
that a linear mixed model
with over 5 million
observations and 2
million...
Mar-26-2013 | More
Poor man's HPQLIM?
Tobit model is a
type of censored
regression and is
one of the most
important regression
models you will encounter
1984...
Feb-26-2013 | More
Kaggle Digit Recoginizer:
SAS k-Nearest Neighbor
solution
Kaggle is hosting
an educational
data mining
competition:
Kaggle Digit Recognizer,
using MNIST data.
Handwritten digit
recognition is one of...
Dec-10-2012 | More
KNN Classification and
Regression in SAS
PDF available at
here. Related post
on KNN
classification using
SAS is here. In data mining
and predictive modeling, it
refers to a memory-based
(or...
Nov-25-2012 | More
Finite Mixture Model for
Loss Given Default (LGD)
Loss Given Default
(LGD) is a key
risk in financial
service. One unique
feature of this metric is
overdispersion and the
other is...
Oct-04-2012 | More
SAS functions for
computing parameters in
Erlang-C model
Call center
management is
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 6 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

management is
both Arts and
Sciences. While
driving moral and setting
Arts, staffing and servicing
level...
Jul-12-2012 | More
Decending Logistic
Regression in SAS
Test the Stochastic
Logistic Regression in
SAS. The logic and code
follows the code piece of
this...
May-24-2012 | More
Component Analysis
SAS used to not
support
PCA, then I figured
out that its server version
supports this functionality,
see here. Today, I...
Jan-31-2012 | More
Random Number Seeds:
NOT only the first one
matters!
Today, Rick (blog
@ here) wrote
random number
seed in SAS to be used in
random number functions
in DATA Step. Rick noticed
when...
Jan-30-2012 | More
Using PROC CANCORR to
solve large scale PLS
problem
Partial Least
Square (PLS) is a
powerful tool for
discriminant
analysis with large number
of predictors [1]. PLS
extracts latent factors
that...
Nov-16-2011 | More
Bayesian Computation (3)
In Chapter 3 of "Bayesian
Computation with R", Jim
conduct 2 fundamental
Estimation and...
Oct-06-2011 | More

Blog Archive
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 7 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

2015 (2)
2014 (4)
2013 (5)
2012 (7)
2011 (11)
2010 (19)
2009 (12)
December (3)
October (1)
AUC calculation
using Wilcoxon
Rank Sum Test
September (2)
August (2)
July (1)
June (1)
April (1)
March (1)
2008 (1)
2007 (5)
2006 (3)

SAS Output

SAS Analysis

SAS Macro

4 9 8 9