Beruflich Dokumente
Kultur Dokumente
INTRODUCTION TO
D ATA S C I E N C E
E P I S O D E 3 : E X P L O R A T O R Y D A TA A N A LY S I S
TODAY’S MENU
1. E X P L O R AT O R Y
D ATA A N A LY S I S
2. V I S U A L I Z AT I O N
M E E T M R D ATA S C H E I N T I S T
TOOL #2:
A MOUTH
TOOL #1:
AN EYEBALL
E X P L O R AT O R Y A N A LY S I S
• The further you get from the source, the higher the risk
that you see artefacts of the pre-processing
NON-GRAPHICAL EDA
• Raw data:
p l
a y e r _ n a m e , h e i g h t , w e i g h t , p _ i d , i d , p l a y e r _ f i f a _ a p i _ i d
"Abel Tamata",182.88,170,240556,1031,202153,240556,"
A b
e l , 1 7 7 . 8 , 1 6 5 , 4 0 9 3 8 , 1 0 4 6 , 1 7 8 8 0 , 4 0 9 3 8 , " 2 0 1 0 - 0 8 - 3 0 0 0
"Abella Perez Damia",187.96,174,37422,1051,159580,37
" A
b i o l a D a u d a " , 1 8 0 . 3 4 , 1 6 5 , 1 1 4 5 0 3 , 1 0 7 6 , 1 8 7 1 7 5 , 1 1 4 5 0 3 ,
"Abou Diaby",193.04,168,27277,1092,163423,27277,"201
"Aboubacar Tandia",193.04,185,181344,1119,138675,181
"Aboubakar Kamara",177.8,168,581141,1122,225541,5811
• Observations:
– height in metric units [cm] but weight in imperial [lb]
– name can be 1–3 words (at least)
NON-GRAPHICAL EDA
• Raw data:
, 7
0 , r i g h t , , _ 0 , 6 0 , 5 0 , 6 0 , 7 4 , , 7 4 , , 5 3 , 6 2 , 7 3 , 7 4 , 7 0 , , 6 3 , , 6
3,right,,_0,49,64,65,54,58,58,46,44,27,64,69,54,66,5
: 0
0 " , 6 3 , 7 2 , l e f t , m e d i u m , m e d i u m , 4 2 , 5 2 , 3 9 , 6 4 , 4 0 , 7 2 , 6 7 , 4
",56,62,right,high,medium,54,37,48,62,31,56,35,39,56
3 ,
r i g h t , m e d i u m , m e d i u m , 5 7 , 5 3 , 5 8 , 6 9 , 5 7 , 5 9 , 6 9 , 6 8 , 6 6 , 6 4 ,
,right,,_0,56,34,70,70,,56,,26,71,60,69,68,,70,,72,,
",61,65,right,,_0,42,60,54,52,,51,,42,35,60,74,77,,5
,left,medium,medium,42,30,61,59,37,47,41,32,67,50,67
• Observations:
– height in metric units [cm] but weight in imperial [lb]
– name can be 1–3 words (at least)
– missing data
– what are these '_0'?
NON-GRAPHICAL EDA
GOOGLE SPREADSHEET
NON-GRAPHICAL EDA
• E.g.:
– choose between linear vs non-linear methods
– decide whether clustering would be helpful
S U M M A R Y S TAT I S T I C S
Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. "Let's practice what we preach: turning tables into
graphs." The American Statistician 56.2 (2002): 121-130.
W H Y TA B L E S A R E M U C H B E T T E R T H A N
G R A P H S ( A P R I L 1 S T, 2 0 0 9 )
• "And I recommend using Excel, which has some really nice defaults as well
as options such as those 3-D colored bar charts. "
(Andrew Gelman, "Why Tables are Really Much Better than Graphs", Journal of Computational
and Graphical Statistics, Volume 20, Number 1, Pages 3–7
Gelman, Andrew, Cristian Pasarica, and Rahul Dodhia. "Let's practice what we preach: turning tables into
graphs." The American Statistician 56.2 (2002): 121-130.
CHOOSING THE RIGHT KIND OF CHART
A D D I T I O N A L R E M A R K S ( T H A N K S T O S T U D E size
N T S=! )overall rating
ON WHY RED AND GREEN ARE A BAD CHOICE
1. C O L O R B L I N D N E S S
2. G R E E N V S R E D A L S O C A R R I E S T H E C O N N O TAT I O N
GOOD VS BAD FOR MOST PEOPLE
GRAPHICAL EDA P Y T H O N M AT P L O T L I B
+ SEABORN
df = sns.load_dataset("iris")
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)
examples on this page from: "seaborn: statistical data visualization", Michael Waskom
GRAPHICAL EDA P Y T H O N M AT P L O T L I B
+ SEABORN
df = sns.load_dataset("iris")
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)
examples on this page from: "seaborn: statistical data visualization", Michael Waskom
GRAPHICAL EDA
LOOKS LIKE
CLUSTERS
A B O U T S TAT I S T I C A L G R A P H I C S
• data-ink
"data–ink ratio =
total ink used to print the graphic
= proportion of a graphic's ink devoted to the
non-redundant display of data information "
T H E G O O D , T H E B A D , A N D T H E U G LY
#1
• Apparently "cosmetic" things such as line weight and
decorations (borders, labels, etc) can make all the difference
D ATA - I N K R AT I O ?
examples on this page from: Introduction to R Graphics with ggplot2, Harvard University
T H E G O O D , T H E B A D , A N D T H E U G LY
#1
• Apparently "cosmetic" things such as line weight and
decorations (borders, labels, etc) can make all the difference
examples on this page from: Introduction to R Graphics with ggplot2, Harvard University
T H E G O O D , T H E B A D , A N D T H E U G LY
#2 & #3
• Even though you can produce a really cool visualization, it's not
necessarily the best way to show the data or convey the idea
examples on this page from: "The most misleading charts of 2015, fixed", qz.com
T H E G O O D , T H E B A D , A N D T H E U G LY
#6
• 16 books is much better than five books, right?
1.
Representation (length, area, ...) should be directly
proportional to the number
examples on this page from: "The most misleading charts of 2015, fixed", qz.com
T H E G O O D , T H E B A D , A N D T H E U G LY
#6
• 16 books is much better than five books, right?
examples on this page from: "The most misleading charts of 2015, fixed", qz.com
T H E G O O D , T H E B A D , A N D T H E U G LY
#7
TUFTE, 1983: