Sie sind auf Seite 1von 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

Este sitio emplea cookies como ayuda para prestar servicios. Al utilizar este sitio, ests aceptando el uso de cookies.

4/28/15, 7:02 PM

Ms informacin

Entendido

SAS Programming for Data Mining


Copyright 2006-2014 / SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their
respective companies.
About

Home

Bayesian using SAS

Friday, October 23, 2009

About Me

AUC calculation using Wilcoxon Rank Sum Test


Accurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data
In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linear
logistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1
respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able to
obtain accurate measurement of AUC for this given data.

Follow Me
Join this site
with Google Friend Connect

Members (67) More

The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 and
N0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W is
the Wilcoxon Rank Sums.
In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it as
AUC=0.9119491555

Already a member? Sign in

Search This Blog


Search

Sites on SAS
Analytics in Writing
MySAS.NET
PROC-X Aggregator
SAS Analysis by Charlie
SAS Community
SAS Die Hard
SAS Graph Examples
SAS Support
SAS-L Archives
StatComput by Wensui

Sites on R & Python

%macro AUC( dsn, Target, score);


ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data=&dsn ;
where &Target^=.;
class &Target;
var &score;
run;
ods select all;
data AUC;
set WilcoxonScore end=eof;
retain v1 v2 1;
if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
v2=N*v2;
if eof then do;
d=v1/v2;
Gini=d * 2;
AUC=d+0.5;
put AUC= GINI=;
keep AUC Gini;
output;
end;
run;
%mend;
data test;
do i = 1 to 10000;
x = ranuni(1);
y=(x + rannor(2315)*0.2 > 0.35 ) ;
output;
end;
run;
ods select none;
ods output Association=Asso;
proc logistic data = test desc;
model y = x;
score data = test out = predicted ;
run;
ods select all;
data _null_;

http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 1 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

set Asso;
if Label2='c' then put 'c-stat=' nValue2;

Exploring Data Blog


Python Scikit

4/28/15, 7:02 PM

run;
%AUC( predicted, y, p_0);

Python SciPy
R Bloggers Aggregator
R Cookbook
R Graphics
R Project

NPAR1WAY gets
AUC = 0.91766634744;
LOGISTIC reports c-statistic = 0.917659
So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure,
PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers'D in PROC FREQ. Here, from
NPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer's D C|R (since the column
variable is predictor)from PROC FREQ:

Recommended Sites
Baidu
Bing
Colt: JAVA Lib for Computing
Google

proc freq data=test noprint;


tables y*x/ measures;
output out=_measures measures;
run;
data _null_;
set _measures;
put _SMDCR_=;
run;

Kaggle (DM Competition)


MITBBS

Then why not just use PROC FREQ since the coding is so simple? Well, the answer is really about the SPEED!
Check the log below for a data with only 100000 observations, 37.63sec vs. 0.15 sec in real time:

NIST Math & Stat Div


Stats Blog
Tim's TextMining
UCI Machine Learning
Repository
UCLA Stat Computing

Tag
Array (5)
AUC (1)
Bayesian (2)
Boost Algorithms (4)
Data Manipulation (14)
Data Mining (12)
Erlang C (1)

3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556

data one;
call streaminit(98676876);
do id=1 to 1e5;
score=ranuni(0)*1000;
if score+rannor(0)>0 then y=1;
else y=0;
output;
drop id;
end;
run;

NOTE: The data set WORK.ONE has 100000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time
0.04 seconds
cpu time
0.04 seconds
3557
3558
3559
3560
3561

proc freq data=one noprint;


tables score*y/measures noprint;
output out=_freq_out measures;
run;

NOTE: There were 100000 observations read from the data set WORK.ONE.
NOTE: The data set WORK._FREQ_OUT has 1 observations and 27 variables.
NOTE: PROCEDURE FREQ used (Total process time):
real time
37.63 seconds
cpu time
37.56 seconds

Filter (1)
Finite Mixture Model (1)
Format (1)
Gap Statistic (1)
Gini Index (1)
GRAPH (2)
Hash Object (4)
Heckman Selection model (1)
HOSVD (2)

3562
3563
3564
3565
3566
3567

data _null_;
set _freq_out;
AUC=_smdrc_/2 + 0.5;
put "AUC = " AUC "
SOMER'S D R|C = " _smdrc_;
run;

AUC = 0.9995285252
SOMER'S D R|C = 0.9990570504
NOTE: There were 1 observations read from the data set WORK._FREQ_OUT.
NOTE: DATA statement used (Total process time):
real time
0.00 seconds
cpu time
0.00 seconds
3568
3569

%AUC(one, y, score);

NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.

http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 2 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

HPGLIMMIX (1)
Index (2)
K-means Clustering (3)
K/N Algorithm (1)
kernel (1)
KNN (3)
LGD (1)
Macro Programming (7)
Moore-Penrose pseudoinverse
(3)

4/28/15, 7:02 PM

NOTE: There were 100000 observations read from the data set WORK.ONE.
WHERE y not = .;
NOTE: PROCEDURE NPAR1WAY used (Total process time):
real time
0.10 seconds
cpu time
0.09 seconds

AUC=0.9995285252 Gini=0.9990570504
NOTE: There were 2 observations read from the data set WORK.WILCOXONSCORE.
NOTE: The data set WORK.AUC has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time
0.01 seconds
cpu time
0.00 seconds

Posted by Liang Xie at 10/23/2009 02:09:00 PM


Recommend this on Google

Labels: AUC, Macro Programming, predictive modeling, PROC NPAR1WAY

multi-threading (1)
Nearest Neighbor (3)
Over-dispersion (1)
PCA (3)

6 comments:
Charlie Shipp Family said...
Looks great .!.

predictive modeling (17)

Thanks for your work in sasCommunity.

PROC APPEND (1)

Charlie Shipp

PROC CANDISC (1)


PROC CORR (3)
PROC DISCRIM (5)
PROC DISTANCE (2)
PROC EXPAND (1)
PROC FACTOR (1)

11:25 PM, February 27, 2010


eskimokitty said...
It is awesome. Thank you so much!!
4:37 PM, June 08, 2011
raspcompote said...
Thank you for this useful post.
3:47 PM, March 13, 2012

PROC FASTCLUS (3)


PROC FCMP (1)

Luis Gustavo said...

PROC FMM (1)

I'd like to know where did you find this relationship between AUC and Wilcoxon Rank Sum Test. I'm trying
to study more about it and it would really help!

PROC FORMAT (1)

Thanks

PROC GENDMO (1)

10:32 AM, February 06, 2014

PROC GLIMMIX (3)

Liang Xie said...

PROC GLMMOD (1)

The relationship is well explained at the Wikipedia page below:


http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U

PROC GLMSELECT (2)


PROC GPLOT (2)
PROC HPMIXED (3)
PROC KRIGE2D (1)

7:37 PM, February 13, 2014


Jon Dickens said...
You are confusing the Gini with the accuracy ratio but you are not alone several people at SAS make the
same mistake.

PROC LIFEREG (1)

if you are interested in discussing this issue, then contact me via linkedin.

PROC MEANS (3)

Jon Dickens

PROC MIXED (1)

3:11 PM, July 20, 2014

PROC MODECLUS (1)


PROC NPAR1WAY (1)
PROC ORTHOREG (1)

Post a Comment

Links to this post


Create a Link

http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 3 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

PROC PLS (2)


SAS SQL

PROC PRINCOMP (9)

SAS Programming

Read SAS Dataset

SAS SPSS

PROC QLIM (1)


PROC REG (6)

Newer Post

Home

Older Post

PROC SCORE (7)


PROC SQL (2)

Subscribe to: Post Comments (Atom)

PROC STANDARD (1)


PROC STDIZE (1)
PROC UNIVARIATE (2)
quantile computing (1)
Queueing Model (1)
R (3)
random number (1)
Random Split (1)
RISK (1)
SAS (2)
Statistical Graphics (1)
SVD (11)
Tensor (2)
Tobit Model (1)

Recent Posts

sklearn DecisionTree plot


example needs pydotplus
In Python, sklearn (scikitlearn)'s DecisionTree
example uses pydot for
plotting the generated tree:
@here. But for Python 3,
pydot has...
Apr-25-2015 | More
Migrating code pieces to
GitHub
One of the original reasons
for this blog was to keep
track of my SAS code as
well as its relevant context.
That was the mindset when
I was a SAS...
Feb-05-2015 | More
%SVD macro with BYProcessing
For the
Regularized
Discriminant
Analysis Cross
Validation, we need to
compute SVD for each pair
of \((\lambda, \gamma)\),
and the factorization...
Dec-18-2014 | More
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 4 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

Dec-18-2014 | More
Experient downdating
algorithm for Leave-OneOut CV in RDA
In this post, I want to
demonstrate a piece of
experiment code for
downdating algorithm for
Leave-One-Out (LOO)
Cross Validation in
Regularized...
Dec-15-2014 | More
Control Excel via SAS DDE
& Python win32com
Excel is probably the most
used interface between
human and data.
Whenever you are dealing
with business people,
Excel is the de facto means
for all...
Dec-15-2014 | More
%HPGLIMMIX SAS macro
is available online at JSS
website
My paper "%HPGLIMMIX:
A High-Performance SAS
Macro for GLMM
Estimation" is now
available at Journal of
Statistical Software website
@here. SAS macro...
Jul-01-2014 | More
Market trend in advanced
analytics for SAS, R and
Python
Disclaimer: This
study is a view on
the market
trend on demand
of advanced analytics
software and their
adoptions from the job
market perspective,...
Dec-06-2013 | More
I don't always do
regression, but when I do, I
do it in SAS ...
There are several
exciting add-ins
from SAS
Analytics products
running on v9.4, especially
the SAS/STAT high
performance procedures,
where "high...
Jul-19-2013 | More
Finding the closest pair in
datat using PROC
MODECLUS
UPDATE: Rick
Wicklin kindly
shared his
visualization efforts
on the output to put a more
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 5 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

on the output to put a more


straightforward sense on
the results. Thanks. Here...
May-08-2013 | More
Large Scale Linear Mixed
Model
Update at the end:

****************************;
Bob at r4stats.com claimed
that a linear mixed model
with over 5 million
observations and 2
million...
Mar-26-2013 | More
Poor man's HPQLIM?
Tobit model is a
type of censored
regression and is
one of the most
important regression
models you will encounter
in business. Amemiya
1984...
Feb-26-2013 | More
Kaggle Digit Recoginizer:
SAS k-Nearest Neighbor
solution
Kaggle is hosting
an educational
data mining
competition:
Kaggle Digit Recognizer,
using MNIST data.
Handwritten digit
recognition is one of...
Dec-10-2012 | More
KNN Classification and
Regression in SAS
PDF available at
here. Related post
on KNN
classification using
SAS is here. In data mining
and predictive modeling, it
refers to a memory-based
(or...
Nov-25-2012 | More
Finite Mixture Model for
Loss Given Default (LGD)
Loss Given Default
(LGD) is a key
business metric of
risk in financial
service. One unique
feature of this metric is
overdispersion and the
other is...
Oct-04-2012 | More
SAS functions for
computing parameters in
Erlang-C model
Call center
management is
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 6 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

management is
both Arts and
Sciences. While
driving moral and setting
up strategies is more about
Arts, staffing and servicing
level...
Jul-12-2012 | More
Stochastic Gradient
Decending Logistic
Regression in SAS
Test the Stochastic
Gradient Decending
Logistic Regression in
SAS. The logic and code
follows the code piece of
Ravi Varadhan, Ph.D from
this...
May-24-2012 | More
Multi-Threaded Principle
Component Analysis
SAS used to not
support
multithreading in
PCA, then I figured
out that its server version
supports this functionality,
see here. Today, I...
Jan-31-2012 | More
Random Number Seeds:
NOT only the first one
matters!
Today, Rick (blog
@ here) wrote
an article about
random number
seed in SAS to be used in
random number functions
in DATA Step. Rick noticed
when...
Jan-30-2012 | More
Using PROC CANCORR to
solve large scale PLS
problem
Partial Least
Square (PLS) is a
powerful tool for
discriminant
analysis with large number
of predictors [1]. PLS
extracts latent factors
that...
Nov-16-2011 | More
Bayesian Computation (3)
In Chapter 3 of "Bayesian
Computation with R", Jim
Albert talked about how to
conduct 2 fundamental
tasks of Statistics, namely
Estimation and...
Oct-06-2011 | More
Powered By : Blogger Plugins

Blog Archive
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 7 of 8

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

4/28/15, 7:02 PM

2015 (2)
2014 (4)
2013 (5)
2012 (7)
2011 (11)
2010 (19)
2009 (12)
December (3)
October (1)
AUC calculation
using Wilcoxon
Rank Sum Test
September (2)
August (2)
July (1)
June (1)
April (1)
March (1)
2008 (1)
2007 (5)
2006 (3)

SAS Data Mining

SAS Output

SAS Analysis

SAS Macro

Pageviews last month

4 9 8 9
Copyright (c). Liang Xie. Awesome Inc. template. Powered by Blogger.

http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html

Page 8 of 8

Das könnte Ihnen auch gefallen