Sie sind auf Seite 1von 71

Data Science and Big Data Analytics

Day 4

Training Schedule: Day 4


Time

0900-1000
1000-1040
1040-1100
1100-1300
1300-1400
1400-1500
1500-1540
1540-1600
1600-1700

Details

Principles of analytic graphics & making exploratory graphs


R Plotting Systems
Morning Break
Base Plot, Lattice Plot and ggplot2
Lunch
Laboratory exercise
Laboratory exercise
Afternoon Break
Laboratory exercise

Plotting Systems
Mainly on plotting of graphs using 3 different
systems in R
{base}
{lattice}
{ggplot2}

Learning Objectives
Towards the end of the todays training, the
following should be covered.
Principles of analytic graphics & making
exploratory graphs
Plotting systems in R
The {base} & {lattice} plotting systems in R
Graphics devices in R

Principle of Analytic Graphics


The basis of Analytic Graphics
is NOT to simply plot uni- or
multi-variate graphs or make
beautiful graphics using
fanciful tools that you have.
It is to describe and document
your data in order for
appropriate analytics to be
applied to it.

Exploratory Graphics
Using exploratory graphics, the goal is to assist
ones understanding of the data that is
available.
The graphics are usually informal
in nature and very plain (nonfanciful)
Axes and labelling may be
optional (but still highly
recommended)

Plotting systems in R
There are 3 common plotting systems
available in R.
{base} which is in the default R installation. We
will use this for the exercise today.
{lattice} which is loaded using the
library(lattice) command.
{ggplot2} which is loaded using the
library(ggplot2) command.
We will explore this in detail more next week.

{base} graphics system


Common main plotting functions.
hist(), barplot(), boxplot(),
plot()

They are usually annotated with separate


functions. The following functions are typically
used to add elements to a plot.
points(), lines(), text()

Some Important Base Graphics


Parameters
Many base plotting functions share a set of parameters.
Here are a few key ones:
pch: the plotting symbol (default is open circle)
lty: the line type (default is solid line), can be dashed,
dotted, etc.
lwd: the line width, specified as an integer multiple
col: the plotting color, specified as a number, string, or
hex code; the colors() function gives you a vector of
colors by name
xlab: character string for the x-axis label
ylab: character string for the y-axis label
9

Some Important Base Graphics


Parameters
The par() function is used to specify global graphics
parameters that affect all plots in an R session. These
parameters can be overridden when specified as arguments
to specific plotting functions.
las: the orientation of the axis labels on the plot
bg: the background color
mar: the margin size
oma: the outer margin size (default is 0 for all sides)
mfrow: number of plots per row, column (plots are filled
row-wise)
mfcol: number of plots per row, column (plots are filled
10
column-wise)

Plots

Basic
Histogram

Basic Boxplot

Line graph with regression

Scatterplot

Basic Histogram
ylim

ylab

main
library(datasets)
hist(warpbreaks$b
reaks,
breaks
breaks=20,
xlab =
"Breaks",
xlim
main="Number
Breaks in Yarn
during Weaving",
xlab
ylim =
c(0,20))

Basic Line Plot


text()

type
lwd
col

abline(f
it1)

sglong<read.table("sg_long_density.tx
t", header=TRUE)
plot(sglong$Year,
sglong$PopSize,
type="l",
col="red",
lwd=3,
xlab="Year",
ylab="Population Size",
main="Population Size
around Sg Long 1971-2001")
fit1<-lm(PopSize~ Year,
data=sglong)
abline(fit1, lty="dashed")
text(x=1976, y=750, labels="
R2=0.896\n P=2615e-15")

Basic Scatterplot
legend
()

col
pch
cex

library(datasets)
plot(iris$Sepal.Length,
iris$Petal.Length,
col=iris$Species,
pch=16,
cex=0.5,
xlab="Sepal Length",
ylab="Petal Length",
main="Flower
Characteristics in Iris")
legend(x=4.2, y=7,
legend=levels(iris$Specie
s),col=c(1:3), pch=16)

Basic Boxplot
library(datasets)
boxplot(iris$Sepal.Lengt
h ~ iris$Species,
ylab="Sepal
Length",
xlab="Species",
main="Sepal
Length by Species in
Iris")

{base} graphics system par()


Panel plotting
1 by 2
2 by 2
3 by 1

par()
mfrows()

Exercise

col and pch


R colour chart:
http://research.sto
wersinstitute.org/efg/R
/Color/Chart/Color
Chart.pdf

Line types

Par () margins

par (mar=c(3,4,3,4), oma=c(3,4,3,4))

Exercise (plot disp ~ mpg)

Exercise (plot disp ~ mpg)

{base} graphics system output


Demonstration on the various common
formats
PDF pdf()
PNG png()
JPG jpg()
Screen (in MS-Windows systems, its called
windows())
dev.copy()
dev.copy2pdf()

Graphics Device
Computer Screen

NOT
Input (such as Mouse &
Keyboard)
File System
Network Connection

Bitmap vs Vector Graphics


Bitmap

BMP
JPG / JPEG
TIFF
GIF
PNG

Better for point kind of


plots such as scatter
plots and density plots.

Vector
PS
EPS
SVG

Good for resizing,


scaled plots.

Exercise Download files


Start an R Script in R Studio, code the following in R
Download the datasets from http://data.gov.my/
Raid, Wind, Humidity and Radiation
January 2008 - April 2015
http://data.gov.my/folders/MOSTI/TaburanHujanKelaju
anAnginKelembapanRadiasiGlobal.xls
Air Pollutant Index (API)
August 2013 - February 2015
http://data.gov.my/folders/MOSTI/AirPollutentIndex.xls
x

Exercise - Subsetting
Use the Sandakan data
For the Rain, Wind, Humidity and Radiation,
we need
Rows 12 to 2688 & Columns 2 to 6.
To clean the data for Rain, transform trace to 0
and also negative numbers of 0.

For the Air Pollutant Index, we need


Columns 3 to 6

Exercise aggregate()
Note that the Air Pollutant Index file capture
data by the hour while the Rain, Wind,
Humidity and Radiation data is by day.
Use aggregate() to compute the mean
API per day and save the API dataset by day
instead of by hour.

Exercise paste()
Note that the date of the Rain, Wind,
Humidity and Radiation dataset is stored in
three columns. Will need to combine them to
form a date.
Use paste() with the separator as - so
that it will be in the format %Y-%m-%d.

Exercise Sub-setting by date


Both the datasets have different date ranges
but there is a common overlapping dates
August 2013 January 2015

You can with use logical date comparison of


the %in% to subset both the data.
Both data after the sub-setting via date should
contain exactly 549 observations

Exercise Select the graphics device


Can either use png()or
Plot the graph on the default graphics device
and use dev.copy()
Size of output plot should be set as 480 X 480

Exercise hist()
Use the hist()
function.
The graph shows that
most days (the dataset
should have about 549
observations, so
approximately 450 out
of 549 days) have less
than 0-20 mm of
rainfall.

Exercise plot()
Use plot() function.
There is a slight issue
with the dataset, where
the wind measurement
data entry seems to be
using a different scale
around September
2014.

Exercise merge()
Merge both the datasets into one using
merge()
Use the date as the join

After the merge you may want to remove the


redundant Year, Month and Day from the
dataset.

Exercise Multiple lines using


lines()
Use plot() function
and then add the
additional lines using
lines() function.
Use legend()
function to add in the
appropriate legend.

Exercise Multiple panels using


par()
Use par() and the
parameter mfrow()
to achieve the 2 x 2
panel graphics.
The rest of the graphics
is just the same as the
previous (except the top
right plot() for
rainfall data).

{LATTICE} GRAPHICS SYSTEM

Lattice Function
Lattice functions generally take a formula for their first
argument, usually of the form
xyplot(y ~ x | f * g, data)
We use the formula notation here, hence the ~.
On the left of the ~ is the y-axis variable, on the right is the
x-axis variable
f and g are conditioning variables they are optional
the * indicates an interaction between two variables

The second argument is the data frame or list from which


the variables in the formula should be looked up
If no other arguments are passed, there are defaults that can be
used.
If no data frame or list is passed, then the parent frame is used.

{lattice} graphics system


The lattice add-on package
(library(lattice)) is an
implementation of Trellis graphics for R.
It is a powerful and elegant high-level data
visualization system, with an emphasis on
multivariate data

> class(xyplot(weight ~ Time | Diet,


BodyWeight))
The "trellis"
class() of lattice plot is trellis
[1]

Type of plots
Function
histogram()
densityplot()
qqmath()
qq()
stripplot()
bwplot()
dotplot()
barchart()
xyplot()
splom()
contourplot()
levelplot()
wireframe()
cloud()
parallel()

Default Display
Histogram
Kernel Density Plot
Theoretical Quantile Plot
Two-sample Quantile Plot
Stripchart (Comparative 1-D Scatter
Plots)
Comparative Box-and-Whisker Plots
Cleveland Dot Plot
Bar Plot
Scatter Plot
Scatter-Plot Matrix
Contour Plot of Surfaces
False Color Level Plot of Surfaces
Three-dimensional Perspective Plot of
Surfaces
Three-dimensional Scatter Plot
Parallel Coordinates Plot

Type of Lattice plots

{lattice} Histogram example

library(lattice)
library(datasets)

#mtcars displacement by
factor cylinder
histogram(~disp |
factor(cyl), data=mtcars,
main="Displacement by
Cylinders",
xlab="Displacment (cu
in)",
col="blue")

{lattice} Scatterplot
example
library(nlme)
library(lattice)
xyplot(weight ~ Time | Diet,
BodyWeight)

{lattice} Boxplot example


library(lattice)
library(datasets)

#boxplot Sepal.Length by
Species
bwplot(Sepal.Length ~
factor(Species) , data=iris,
xlab="Species",
col="red",
pch=16,
main=("Sepal.Length by
Species"))

{lattice} Cloud plot


example
library(lattice)
library(datasets)
cloud(depth ~ lat * long,
data = quakes,
zlim =
rev(range(quakes$depth)),
screen = list(z =
105, x = -70),
panel.aspect =
0.75,
xlab = "Longitude",
ylab = "Latitude",
zlab = "Depth",
main="Lattice Cloud
Plot",
col="red")

{lattice} example
library(nlme)
library(lattice)
p <- xyplot(weight ~ Time | Diet,
BodyWeight)

If you assign the Trellis plot to a variable, it will


not have any output until you print() it.
print(p)

{lattice} graphics system


Can use trellis.par.set() to control
the output
To get a list of default settings, can use
trellis.par.get(). Will return a long
list of values.
par.ylab.text <trellis.par.get("par.ylab.text")
par.ylab.text$col <- "red"
trellis.par.set("par.ylab.text",
par.ylab.text)
print(p)

{GGPLOT2} GRAPHICS SYSTEM

{ggplot2} graphics system


Created by Hadley
Wickham
Based on Leland
Wilkinson's The
Grammar of Graphics.
Widely used because
considered the best R
package for static
visualization.
Package = ggplot2.
The main function is
ggplot().

{ggplot2} introduction
install.packages("ggplot2")
library(ggplot2)
head(mtcars)

{ggplot2} plot layers


library(ggplot2)
g <- ggplot(data = mtcars, aes(x = hp, y
= mpg))
print(g)

> ggplot(data = mtcars, aes(x =


hp, y = mpg))
Error: No layers in plot

What is a geometry objects (geom) in the


ggplot2 system?
a plotting object like point, line, or other shape

{ggplot2} Scatterplots
ggplot(data = mtcars, aes(x = hp, y = mpg)) + geom_point()

Exercise 1: Adding colours to the plot

Modify the
content of the
aes() to include
colours and
produce the
following graph.

ggplot(data = mtcars, aes(x = hp, y = mpg))


+ geom_point()

Exercise 2: Adding legend label to the


plot

Include the
appropriate
parameters
here (hint:
vectors)

+ scale_color_discrete(labels = )

Exercise 3: Re-labelling

labs(color = "Transmission", ? ? )

{ggplot2} Scatterplots
Add different themes to
your plots

theme_bw()
theme_light()
theme_minimal()
theme_classic()

Which one of the above


produces the graph
shown here?

{ggplot2} Scatterplots

ggplot(data = mtcars, aes(x = hp, y = mpg,color =


factor(am), alpha = wt, size = cyl)) + geom_point() +
scale_color_discrete(labels = c("Automatic",
"Manual")) + labs(color = "Transmission", x =
"Horsepower", y = "Miles Per Gallon", alpha =
"Weight", size = "Cylinders")

{ggplot2} Bar Charts


> head(diamonds)

Instead of geom_point() use geom_bar()


For the next few exercises, use the ? (help) to assist you.

Exercise 4: {ggplot2} Bar Charts


Using the diamond
data and geom_bar()
Count the frequency of
the diamonds by
clarity

Exercise 5: {ggplot2} Bar Charts


The bars can be
filled accordingly.
Use the diamond cut
to fill it appropriately.

Exercise 6: {ggplot2} Bar Charts


Instead of stacking the
cut type, have them
side by side in the bar
chart.
The geom_bar() has
the option of dodge.

{ggplot2} Line Charts


Typically used for visualizing how a continuous variable (on
the y-axis) changes in relation to another continuous variable
(on the x-axis).
Often the x variable represents time, but it may also represent some
other continuous quantity.

Appropriate when the variable is ordered (e.g., small,


medium, large), but not when the variable is unordered
(e.g., cow, goose, elephant)

In ggplot2() system, the plot layer is


geom_line()

{ggplot2} Line Charts


set.seed(1)
time <- 1:10
income <runif(10,1000,5000)
dt <data.frame(time,income)

{ggplot2} Line Charts


Using the dataset
created (previous slide)
Plot the graph using
geom_line() as the
plot layer.

ggplot(dt, aes(x = time, y = income))


+ geom_line()

{ggplot2} Line Charts


Use both the
geom_line() and
geom_point() on
the data created to
produce the graph.

ggplot(dt, aes(x = time, y = income)) +


geom_line() + geom_point()

{ggplot2} Line Charts


Separate it into two
categories. Lets call them
high and low
type <- c("low", "high", "low",
"high", "low", "high", "low",
"high", "low", "high")

Then plot it to obtain the


graph.
ggplot(dt, aes(x = time, y = income, colour = type)) +
geom_line() + geom_point()

{ggplot2} geom_smooth()

set.seed(1)
time<-rep(1:5, each=2)
income <- runif(10,1000,5000)
dt <- data.frame(time,income,type)
ggplot(dt, aes(x = time, y = income, colour = type)) +
geom_smooth()

GGPLOT2 EXERCISE

data("msleep")

data("msleep")

Das könnte Ihnen auch gefallen