Beruflich Dokumente
Kultur Dokumente
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
Save the Excel file as "input.xlsx". You should save it in the current working
directory of the R workspace.
BINARY DATA:
A binary file is a file that contains information stored only in form of bits and bytes.(0’s and 1’s).
R has two functions WriteBin() and readBin() to create and read binary files.
Syntax
writeBin(object, con)
readBin(con, what, n )
what is the mode like character, integer etc. representing the bytes to be read.
Example
We consider the R inbuilt data "mtcars". First we create a csv file from it and convert it to a binary file and store it as a OS file.
Next we read this binary file created into R.
Writing the Binary File
We read the data frame "mtcars" as a csv file and then write it as a binary file to the OS.
# Read the "mtcars" data frame as a csv file and store only the columns
"cyl", "am" and "gear".
write.table(mtcars, file = "mtcars.csv",row.names = FALSE, na = "",
col.names = TRUE, sep = ",")
# Create a connection object to write the binary file using mode "wb".
write.filename = file("/web/com/binmtcars.dat", "wb")
# Write the column names of the data frame to the connection object.
writeBin(colnames(new.mtcars), write.filename)
# Close the file for writing so that it can be read by other program.
close(write.filename)
# Create a connection object to read the file in binary mode using "rb".
read.filename <- file("/web/com/binmtcars.dat", "rb")
# Next read the column values. n = 18 as we have 3 column names and 15 values.
read.filename <- file("/web/com/binmtcars.dat", "rb")
bindata <- readBin(read.filename, integer(), n = 18)
# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)
# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)
# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)
When we execute the above code, it produces the following result and chart −
[1] 7108963 1728081249 7496037 6 6 4
[7] 6 8 1 1 1 0
[13] 0 4 4 4 3 3
[1] 6 6 4 6 8
[1] 1 1 1 0 0
[1] 4 4 4 3 3
cyl am gear
[1,] 6 1 4
[2,] 6 1 4
[3,] 4 1 4
[4,] 6 0 3
[5,] 8 0 3
As we can see, we got the original data back by reading the binary file in R.
XML FILES
XML is a file format which shares both the file format and the data on the World Wide Web, intranets, and elsewhere using standard ASCII text. It stands for Extensible
Markup Language (XML). S
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT
4
Ryan
729
5/11/2014
HR
5
Gary
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
Get Different Elements of a Node
# Load the packages required to read XML files.
library("XML")
library("methods")
$Name
[1] "Rick" "Dan" "Michelle" "Ryan" "Gary" "Nina" "Simon" "Guru"
$Salary
[1] "623.3" "515.2" "611" "729" "843.25" "578" "632.8" "722.5"
$StartDate
[1] "1/1/2012" "9/23/2013" "11/15/2014" "5/11/2014" "3/27/2015" "5/21/2013"
"7/30/2013" "6/17/2014"
$Dept
[1] "IT" "Operations" "IT" "HR" "Finance" "IT"
"Operations" "Finance"
print(json_data_frame)
When we execute the above code, it produces the following result −
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Install R Packages
The following packages are required for processing the URL’s and links to the files. If they are not available in your R
Environment, you can install them using following commands.
install.packages("xlsx")
install.packages("rjson")
install.packages("RCurl")
install.packages("XML")
install.packages("stringr")
install.packages("plyr")
Input Data
We will visit the URL weather data and download the CSV files using R
for the year 2015.
getHTMLLinks() to gather the URLs of the files. Then we will use the
function download.file() to save the files to the local system.
# Identify only the links which point to the JCMB 2015 files.
filenames <- links[str_detect(links, "JCMB_2015")]
RMySQL Package
R has a built-in package named "RMySQL" which provides native
connectivity between with MySql database. You can install this package
in the R environment using the following command.
install.packages("RMySQL")
Connecting R to MySql
Once the package is installed we create a connection object in R to
connect to the database. It takes the username, password, database
name and host name as input.
# Create a connection Object to MySQL database.
# We will connect to the sampel database named "sakila" that comes with
MySql installation.
After executing the above code we can see the table updated in the
MySql Environment.
Inserting Data into the Tables
dbSendQuery(mysqlconnection,
"insert into mtcars(row_names, mpg, cyl, disp, hp, drat, wt,
qsec, vs, am, gear, carb)
values('New Mazda RX4 Wag', 21, 6, 168.5, 110, 3.9, 2.875, 17.02,
0, 1, 4, 4)"
)
After executing the above code we can see the row inserted into the
table in the MySql Environment.
Creating Tables in MySql
We can create tables in the MySql using the function dbWriteTable(). It
overwrites the table if it already exists and takes a data frame as input.
# Create the connection object to the #database where we want to
create the table.
mysqlconnection = dbConnect(MySQL(), user = 'root', password = '',
dbname = 'sakila',
host = 'localhost')
After executing the above code we can see the table created in the
MySql Environment.
Dropping Tables in MySql
We can drop the tables in MySql database passing the drop table
statement into the dbSendQuery() in the same way we used it for
querying data from tables.
dbSendQuery(mysqlconnection, 'drop table if exists mtcars')
After executing the above code we can see the table is dropped in the
MySql Environment.
piepercent<- round(100*x/sum(x), 1)
R Bar Charts:
A bar chart represents data in rectangular bars with length of the bar
proportional to the value of the variable. R uses the function barplot() to
create bar charts. R can draw both vertical and Horizontal bars in the
bar chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
Example
A simple bar chart is created using just the input vector and the name of
each bar.
The below script will create and save the bar chart in the current R
working directory.
# Create the data for the chart
H <- c(7,12,28,3,41)
Scatter Plot:
Scatterplots show many points plotted in the Cartesian plane. Each
point represents the values of two variables. One variable is chosen in
the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used −
x is the data set whose values are the horizontal coordinates.
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.
Example
We use the data set "mtcars" available in the R environment to create
a basic scatterplot. Let's use the columns "wt" and "mpg" in mtcars.
input <- mtcars[,c('wt','mpg')]
print(head(input))
Scatterplot Matrices
When we have more than two variables and we want to find the
correlation between one variable versus the remaining ones we use
scatterplot matrix. We use pairs() function to create matrices of
scatterplots.
Syntax
The basic syntax for creating scatterplot matrices in R is −
pairs(formula, data)
Following is the description of the parameters used −
formula represents the series of variables used in pairs.
data represents the data set from which the variables will be
taken.
Example
Each variable is paired up with each of the remaining variable. A
scatterplot is plotted for each pair.
# Give the chart file a name.
png(file = "scatterplot_matrices.png")
pairs(~wt+mpg+disp+cyl,data = mtcars,
main = "Scatterplot Matrix")
“The simple graph has brought more information to the data analyst’s mind than any other device.”
how to visualise your data using ggplot2. R has several systems for making graphs,
ggplot2 is one of the most elegant and most versatile.
ggplot2 implements the grammar of graphics,
install.packages("tidyverse")
library(tidyverse)
>mpg
Creating a ggplot
To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Data
ggplot2 comes with a selection of built-in datasets that are used in examples to
illustrate various visualisation challenges.
Summarise()
The syntax of summarise() is basic and consistent with the
other verbs included in the dplyr library.
summarise(df, variable_name=condition)
arguments:
- `df`: Dataset used to construct the summary statistics
- `variable_name=condition`: Formula to create the new variable
Output:
## mean_run
## 1 19.20114
You can add as many variables as you want. You return the average games played and the average
sacrifice hits.
mean_SH = mean(SH, na.rm = TRUE): Summarize a second variable. You set na.rm = TRUE
because the column SH contains missing observations.
Output:
## mean_games mean_SH
## 1 51.98361 2.340085
Group_by vs no group_by
The function summerise() without group_by() does not make any sense. It creates summary statistic
by group. The library dplyr applies a function automatically to the group you passed inside the verb
group_by.
Note that, group_by works perfectly with all the other verbs (i.e. mutate(), filter(), arrange(), ...).
It is convenient to use the pipeline operator when you have more than one step. You can compute the
average homerun by baseball league.
data % > %
group_by(lgID) % > %
summarise(mean_run = mean(HR))
Code Explanation
Output:
##
# A tibble: 7 x 2
## lgID mean_run
## <fctr> <dbl>
## 1 AA 0.9166667
## 2 AL 3.1270988
## 3 FL 1.3131313
## 4 NL 2.8595953
## 5 PL 2.5789474
## 6 UA 0.6216216
## 7 <NA> 0.2867133
Function in summarise()
The verb summarise() is compatible with almost all the functions in R. Here is a short list of useful
functions you can use together with summarise():
Basic function
In the previous example, you didn't store the summary statistic in a data frame.
You can proceed in two steps to generate a date frame from a summary:
## Mean
ex1 <- data % > %
group_by(yearID) % > %
summarise(mean_game_year = mean(G))
head(ex1)
Code Explanation
The summary statistic of batting dataset is stored in the data frame ex1.
Output:
## # A tibble: 6 x 2
## yearID mean_game_year
## <int> <dbl>
## 1 1871 23.42308
## 2 1872 18.37931
## 3 1873 25.61538
## 4 1874 39.05263
## 5 1875 28.39535
## 6 1876 35.90625
Sum
## Sum
data % > %
group_by(lgID) % > %
summarise(sum_homerun_league = sum(HR))
Output:
## # A tibble: 7 x 2
## lgID sum_homerun_league
## <fctr> <int>
## 1 AA 341
## 2 AL 29426
## 3 FL 130
## 4 NL 29817
## 5 PL 98
## 6 UA 46
## 7 <NA> 41
Standard deviation
# Spread
data % > %
group_by(teamID) % > %
summarise(sd_at_bat_league = sd(HR))
Output:
## # A tibble: 148 x 2
## teamID sd_at_bat_league
## <fctr> <dbl>
## 1 ALT NA
## 2 ANA 8.7816395
## 3 ARI 6.0765503
## 4 ATL 8.5363863
## 5 BAL 7.7350173
## 6 BFN 1.3645163
## 7 BFP 0.4472136
## 8 BL1 0.6992059
## 9 BL2 1.7106757
## 10 BL3 1.0000000
## # ... with 138 more rows
There are lots of inequality in the quantity of homerun done by each team.
You can access the minimum and the maximum of a vector with the function min() and max().
The code below returns the lowest and highest number of games in a season played by a player.
data % > %
group_by(playerID) % > %
summarise(min_G = min(G),
max_G = max(G))
Output:
## # A tibble: 10,395 x 3
## <fctr> <int>
## 1 aardsda01 53 73
## 3 aasedo01 24 66
## 4 abadfe01 18 18
## 5 abadijo01 11 11
## 6 abbated01 3 153
## 7 abbeybe01 11 11
## 8 abbeych01 80 132
## 9 abbotgl01 5 23
## 10 abbotji01 13 29
Count
Count observations by group is always a good idea. With R, you can aggregate the the number of
occurence with n().
For instance, the code below computes the number of years played by each player.
# count observations
data % > %
group_by(playerID) % > %
arrange(desc(number_year))
Output:
## # A tibble: 10,395 x 2
## playerID number_year
## <fctr> <int>
## 1 pennohe01 11
## 2 joosted01 10
## 3 mcguide01 10
## 4 rosepe01 10
## 5 davisha01 9
## 6 johnssi01 9
## 7 kaatji01 9
## 8 keelewi01 9
## 9 marshmi01 9
## 10 quirkja01 9
For instance, you can find the first and last year of each player.
data % > %
group_by(playerID) % > %
summarise(first_appearance = first(yearID),
last_appearance = last(yearID))
Output:
## # A tibble: 10,395 x 3
nth observation
The fonction nth() is complementary to first() and last(). You can access the nth observation within a
group with the index to return.
For instance, you can filter only the second year that a team played.
# nth
data % > %
group_by(teamID) % > %
arrange(second_game)
Output:
## # A tibble: 148 x 2
## teamID second_game
## <fctr> <int>
## 1 BS1 1871
## 2 CH1 1871
## 3 FW1 1871
## 4 NY2 1871
## 5 RC1 1871
## 6 BR1 1872
## 7 BR2 1872
## 8 CL1 1872
## 9 MID 1872
## 10 TRO 1872
The function n() returns the number of observations in a current group. A closed function to n() is
n_distinct(), which count the number of unique values.
In the next example, you add up the total of players a team recruited during the all periods.
# distinct values
data % > %
group_by(teamID) % > %
arrange(desc(number_player))
Code Explanation
Output:
## # A tibble: 148 x 2
## teamID number_player
## <fctr> <int>
## 1 CHN 751
## 2 SLN 729
## 3 PHI 699
## 4 PIT 683
## 5 CIN 679
## 6 BOS 647
## 7 CLE 646
## 8 CHA 636
## 9 DET 623
## 10 NYA 612
## # ... with 138 more rows
Multiple groups
# Multiple groups
data % > %
arrange(desc(teamID, yearID))
Code Explanation
Output:
## # A tibble: 2,829 x 3
Filter
Before you intend to do an operation, you can filter the dataset. The dataset starts in 1871, and the
analysis does not need the years prior to 1980.
# Filter
data % > %
group_by(yearID) % > %
summarise(mean_game_year = mean(G))
Code Explanation
filter(yearID > 1980): Filter the data to show only the relevant years (i.e. after 1980)
group_by(yearID): Group by year
summarise(mean_game_year = mean(G)): Summarize the data
Output:
## # A tibble: 36 x 2
## yearID mean_game_year
## <int> <dbl>
## 1 1981 40.64583
## 2 1982 56.97790
## 3 1983 60.25128
## 4 1984 62.97436
## 5 1985 57.82828
## 6 1986 58.55340
## 7 1987 48.74752
## 8 1988 52.57282
## 9 1989 58.16425
## 10 1990 52.91556
Ungroup
Last but not least, you need to remove the grouping before you want to change the level of the
computation.
data % > %
group_by(playerID) % > %
ungroup() % > %
summarise(total_average_homerun = mean(average_HR_game))
Code Explanation
Output:
## # A tibble: 1 x 1
## total_average_homerun
## <dbl>
## 1 0.06882226
Summary
When you want to return a summary by group, you can use:
ungroup(df)
The table below summarizes the function you learnt with summarise()