SAS Programming For Data Mining: AUC Calculation Using Wilcoxon Rank Sum Test

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Este sitio emplea cookies como ayuda para prestar servicios. Al utilizar este sitio, ests aceptando el uso de cookies.
4/28/15, 7:02 PM
Ms informacin
Entendido
SAS Programming for Data Mining

Copyright 2006-2014 / SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their
respective companies.
About
Home
Bayesian using SAS
Friday, October 23, 2009
About Me
AUC calculation using Wilcoxon Rank Sum Test

Accurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data
In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linear
logistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1
respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able to
obtain accurate measurement of AUC for this given data.
Follow Me
Join this site
with Google Friend Connect
Members (67) More
The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 and
N0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W is
the Wilcoxon Rank Sums.
In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it as
AUC=0.9119491555
Already a member? Sign in
Search This Blog

Search
Sites on SAS
Analytics in Writing
MySAS.NET
PROC-X Aggregator
SAS Analysis by Charlie
SAS Community
SAS Die Hard
SAS Graph Examples
SAS Support
SAS-L Archives
StatComput by Wensui
Sites on R & Python
%macro AUC( dsn, Target, score);

ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data=&dsn ;
where &Target^=.;
class &Target;
var &score;
run;
ods select all;
data AUC;
set WilcoxonScore end=eof;
retain v1 v2 1;
if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
v2=N*v2;
if eof then do;
d=v1/v2;
Gini=d * 2;
AUC=d+0.5;
put AUC= GINI=;
keep AUC Gini;
output;
end;
run;
%mend;
data test;
do i = 1 to 10000;
x = ranuni(1);
y=(x + rannor(2315)*0.2 > 0.35 ) ;
output;
end;
run;
ods select none;
ods output Association=Asso;
proc logistic data = test desc;
model y = x;
score data = test out = predicted ;
run;
ods select all;
data _null_;
http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
Page 1 of 8
set Asso;
if Label2='c' then put 'c-stat=' nValue2;
Exploring Data Blog

Python Scikit
4/28/15, 7:02 PM
run;
%AUC( predicted, y, p_0);
Python SciPy
R Bloggers Aggregator
R Cookbook
R Graphics
R Project
NPAR1WAY gets
AUC = 0.91766634744;
LOGISTIC reports c-statistic = 0.917659
So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure,
PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers'D in PROC FREQ. Here, from
NPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer's D C|R (since the column
variable is predictor)from PROC FREQ:
Recommended Sites
Baidu
Bing
Colt: JAVA Lib for Computing
Google
proc freq data=test noprint;

tables y*x/ measures;
output out=_measures measures;
run;
data _null_;
set _measures;
put _SMDCR_=;
run;
Kaggle (DM Competition)

MITBBS
Then why not just use PROC FREQ since the coding is so simple? Well, the answer is really about the SPEED!
Check the log below for a data with only 100000 observations, 37.63sec vs. 0.15 sec in real time:
NIST Math & Stat Div

Stats Blog
Tim's TextMining
UCI Machine Learning
Repository
UCLA Stat Computing
Tag
Array (5)
AUC (1)
Bayesian (2)
Boost Algorithms (4)
Data Manipulation (14)
Data Mining (12)
Erlang C (1)
3546
3547
3548
3549
3550
3551
3552
3553
3554
3555
3556
data one;
call streaminit(98676876);
do id=1 to 1e5;
score=ranuni(0)*1000;
if score+rannor(0)>0 then y=1;
else y=0;
output;
drop id;
end;
run;
NOTE: The data set WORK.ONE has 100000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time
0.04 seconds
cpu time
0.04 seconds
3557
3558
3559
3560
3561
proc freq data=one noprint;

tables score*y/measures noprint;
output out=_freq_out measures;
run;
NOTE: There were 100000 observations read from the data set WORK.ONE.
NOTE: The data set WORK._FREQ_OUT has 1 observations and 27 variables.
NOTE: PROCEDURE FREQ used (Total process time):
real time
37.63 seconds
cpu time
37.56 seconds
Filter (1)
Finite Mixture Model (1)
Format (1)
Gap Statistic (1)
Gini Index (1)
GRAPH (2)
Hash Object (4)
Heckman Selection model (1)
HOSVD (2)
3562
3563
3564
3565
3566
3567
data _null_;
set _freq_out;
AUC=_smdrc_/2 + 0.5;
put "AUC = " AUC "
SOMER'S D R|C = " _smdrc_;
run;
AUC = 0.9995285252
SOMER'S D R|C = 0.9990570504
NOTE: There were 1 observations read from the data set WORK._FREQ_OUT.
real time
0.00 seconds
cpu time
0.00 seconds
3568
3569
%AUC(one, y, score);
NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.
Page 2 of 8
HPGLIMMIX (1)
Index (2)
K-means Clustering (3)
K/N Algorithm (1)
kernel (1)
KNN (3)
LGD (1)
Macro Programming (7)
Moore-Penrose pseudoinverse
(3)
4/28/15, 7:02 PM
NOTE: There were 100000 observations read from the data set WORK.ONE.
WHERE y not = .;
NOTE: PROCEDURE NPAR1WAY used (Total process time):
real time
0.10 seconds
cpu time
0.09 seconds
AUC=0.9995285252 Gini=0.9990570504
NOTE: There were 2 observations read from the data set WORK.WILCOXONSCORE.
NOTE: The data set WORK.AUC has 1 observations and 2 variables.
real time
0.01 seconds
cpu time
0.00 seconds
Posted by Liang Xie at 10/23/2009 02:09:00 PM

Recommend this on Google
Labels: AUC, Macro Programming, predictive modeling, PROC NPAR1WAY
multi-threading (1)
Nearest Neighbor (3)
Over-dispersion (1)
PCA (3)
6 comments:
Charlie Shipp Family said...
Looks great .!.
predictive modeling (17)
Thanks for your work in sasCommunity.
PROC APPEND (1)
Charlie Shipp
PROC CANDISC (1)

PROC CORR (3)
PROC DISCRIM (5)
PROC DISTANCE (2)
PROC EXPAND (1)
PROC FACTOR (1)
11:25 PM, February 27, 2010

eskimokitty said...
It is awesome. Thank you so much!!
4:37 PM, June 08, 2011
raspcompote said...
Thank you for this useful post.
3:47 PM, March 13, 2012
PROC FASTCLUS (3)

PROC FCMP (1)
Luis Gustavo said...
PROC FMM (1)
I'd like to know where did you find this relationship between AUC and Wilcoxon Rank Sum Test. I'm trying
to study more about it and it would really help!
PROC FORMAT (1)
Thanks
PROC GENDMO (1)
10:32 AM, February 06, 2014
PROC GLIMMIX (3)
Liang Xie said...
PROC GLMMOD (1)
The relationship is well explained at the Wikipedia page below:

http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U
PROC GLMSELECT (2)

PROC GPLOT (2)
PROC HPMIXED (3)
PROC KRIGE2D (1)
7:37 PM, February 13, 2014

Jon Dickens said...
You are confusing the Gini with the accuracy ratio but you are not alone several people at SAS make the
same mistake.
PROC LIFEREG (1)
if you are interested in discussing this issue, then contact me via linkedin.
PROC MEANS (3)
Jon Dickens
PROC MIXED (1)
3:11 PM, July 20, 2014
PROC MODECLUS (1)

PROC NPAR1WAY (1)
PROC ORTHOREG (1)
Post a Comment
Links to this post

Create a Link
Page 3 of 8
4/28/15, 7:02 PM
PROC PLS (2)

SAS SQL
PROC PRINCOMP (9)
SAS Programming
Read SAS Dataset
SAS SPSS
PROC QLIM (1)

PROC REG (6)
Newer Post
Home
Older Post
PROC SCORE (7)

PROC SQL (2)
Subscribe to: Post Comments (Atom)
PROC STANDARD (1)

PROC STDIZE (1)
PROC UNIVARIATE (2)
quantile computing (1)
Queueing Model (1)
R (3)
random number (1)
Random Split (1)
RISK (1)
SAS (2)
Statistical Graphics (1)
SVD (11)
Tensor (2)
Tobit Model (1)
Recent Posts
sklearn DecisionTree plot

example needs pydotplus
In Python, sklearn (scikitlearn)'s DecisionTree
example uses pydot for
plotting the generated tree:
@here. But for Python 3,
pydot has...
Apr-25-2015 | More
Migrating code pieces to
GitHub
One of the original reasons
for this blog was to keep
track of my SAS code as
well as its relevant context.
That was the mindset when
I was a SAS...
Feb-05-2015 | More
%SVD macro with BYProcessing
For the
Regularized
Discriminant
Analysis Cross
Validation, we need to
compute SVD for each pair
of \((\lambda, \gamma)\),
and the factorization...
Dec-18-2014 | More
Page 4 of 8
4/28/15, 7:02 PM
Dec-18-2014 | More
Experient downdating
algorithm for Leave-OneOut CV in RDA
In this post, I want to
demonstrate a piece of
experiment code for
downdating algorithm for
Leave-One-Out (LOO)
Cross Validation in
Regularized...
Dec-15-2014 | More
Control Excel via SAS DDE
& Python win32com
Excel is probably the most
used interface between
human and data.
Whenever you are dealing
with business people,
Excel is the de facto means
for all...
Dec-15-2014 | More
%HPGLIMMIX SAS macro
is available online at JSS
website
My paper "%HPGLIMMIX:
A High-Performance SAS
Macro for GLMM
Estimation" is now
available at Journal of
Statistical Software website
@here. SAS macro...
Jul-01-2014 | More
Market trend in advanced
analytics for SAS, R and
Python
Disclaimer: This
study is a view on
the market
trend on demand
of advanced analytics
software and their
adoptions from the job
market perspective,...
Dec-06-2013 | More
I don't always do
regression, but when I do, I
do it in SAS ...
There are several
exciting add-ins
from SAS
Analytics products
running on v9.4, especially
the SAS/STAT high
performance procedures,
where "high...
Jul-19-2013 | More
Finding the closest pair in
datat using PROC
MODECLUS
UPDATE: Rick
Wicklin kindly
shared his
visualization efforts
on the output to put a more
Page 5 of 8
4/28/15, 7:02 PM
on the output to put a more

straightforward sense on
the results. Thanks. Here...
May-08-2013 | More
Large Scale Linear Mixed
Model
Update at the end:
****************************;
Bob at r4stats.com claimed
that a linear mixed model
with over 5 million
observations and 2
million...
Mar-26-2013 | More
Poor man's HPQLIM?
Tobit model is a
type of censored
regression and is
one of the most
important regression
models you will encounter
in business. Amemiya
1984...
Feb-26-2013 | More
Kaggle Digit Recoginizer:
SAS k-Nearest Neighbor
solution
Kaggle is hosting
an educational
data mining
competition:
Kaggle Digit Recognizer,
using MNIST data.
Handwritten digit
recognition is one of...
Dec-10-2012 | More
KNN Classification and
Regression in SAS
PDF available at
here. Related post
on KNN
classification using
SAS is here. In data mining
and predictive modeling, it
refers to a memory-based
(or...
Nov-25-2012 | More
Finite Mixture Model for
Loss Given Default (LGD)
Loss Given Default
(LGD) is a key
business metric of
risk in financial
service. One unique
feature of this metric is
overdispersion and the
other is...
Oct-04-2012 | More
SAS functions for
computing parameters in
Erlang-C model
Call center
management is
Page 6 of 8
4/28/15, 7:02 PM
management is
both Arts and
Sciences. While
driving moral and setting
up strategies is more about
Arts, staffing and servicing
level...
Jul-12-2012 | More
Stochastic Gradient
Decending Logistic
Regression in SAS
Test the Stochastic
Gradient Decending
Logistic Regression in
SAS. The logic and code
follows the code piece of
Ravi Varadhan, Ph.D from
this...
May-24-2012 | More
Multi-Threaded Principle
Component Analysis
SAS used to not
support
multithreading in
PCA, then I figured
out that its server version
supports this functionality,
see here. Today, I...
Jan-31-2012 | More
Random Number Seeds:
NOT only the first one
matters!
Today, Rick (blog
@ here) wrote
an article about
random number
seed in SAS to be used in
random number functions
in DATA Step. Rick noticed
when...
Jan-30-2012 | More
Using PROC CANCORR to
solve large scale PLS
problem
Partial Least
Square (PLS) is a
powerful tool for
discriminant
analysis with large number
of predictors [1]. PLS
extracts latent factors
that...
Nov-16-2011 | More
Bayesian Computation (3)
In Chapter 3 of "Bayesian
Computation with R", Jim
Albert talked about how to
conduct 2 fundamental
tasks of Statistics, namely
Estimation and...
Oct-06-2011 | More
Powered By : Blogger Plugins
Blog Archive
Page 7 of 8
4/28/15, 7:02 PM
2015 (2)
2014 (4)
2013 (5)
2012 (7)
2011 (11)
2010 (19)
2009 (12)
December (3)
October (1)
AUC calculation
using Wilcoxon
Rank Sum Test
September (2)
August (2)
July (1)
June (1)
April (1)
March (1)
2008 (1)
2007 (5)
2006 (3)
SAS Data Mining
SAS Output
SAS Analysis
SAS Macro
Pageviews last month
4 9 8 9
Copyright (c). Liang Xie. Awesome Inc. template. Powered by Blogger.
Page 8 of 8

SAS Programming For Data Mining: AUC Calculation Using Wilcoxon Rank Sum Test

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

SAS Programming For Data Mining: AUC Calculation Using Wilcoxon Rank Sum Test

Hochgeladen von

Copyright:

Verfügbare Formate

SAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test

SAS Programming for Data Mining

Bayesian using SAS

Friday, October 23, 2009

AUC calculation using Wilcoxon Rank Sum Test

Members (67) More

Already a member? Sign in

Search This Blog

Sites on R & Python

%macro AUC( dsn, Target, score);

Exploring Data Blog

proc freq data=test noprint;

Kaggle (DM Competition)

NIST Math & Stat Div

proc freq data=one noprint;

NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.

Posted by Liang Xie at 10/23/2009 02:09:00 PM

Labels: AUC, Macro Programming, predictive modeling, PROC NPAR1WAY

predictive modeling (17)

Thanks for your work in sasCommunity.

PROC APPEND (1)

PROC CANDISC (1)

11:25 PM, February 27, 2010

PROC FASTCLUS (3)

Luis Gustavo said...

PROC FMM (1)

PROC FORMAT (1)

PROC GENDMO (1)

10:32 AM, February 06, 2014

PROC GLIMMIX (3)

Liang Xie said...

PROC GLMMOD (1)

The relationship is well explained at the Wikipedia page below:

PROC GLMSELECT (2)

7:37 PM, February 13, 2014

PROC LIFEREG (1)

PROC MEANS (3)

PROC MIXED (1)

3:11 PM, July 20, 2014

PROC MODECLUS (1)

Links to this post

PROC PLS (2)

PROC PRINCOMP (9)

Read SAS Dataset

PROC QLIM (1)

PROC SCORE (7)

Subscribe to: Post Comments (Atom)

PROC STANDARD (1)

sklearn DecisionTree plot

on the output to put a more

SAS Data Mining

Pageviews last month

Das könnte Ihnen auch gefallen