03 Data Input Output

Al.I.
Cuza University of Iai

Faculty of Economics and Business Administration
Department of Accounting, Information Systems and
Statistics
Data Analysis & Data

Science with R
Data Input and Output
(Import and Export)
By Marin Fotache
Scripts associated with this

presentation
Scripts
03a_basics_of_data_input_output.R:
http://1drv.ms/1DRMOTC
03b_intermediate_data_input_output:
http://1drv.ms/1JKeKwi
PostgreSQL
scripts (for creating the DB to be imported

in R see script 03a...)
00a_1_creating_tables__sales.sql:
http://1drv.ms/1JKeRbh
00a_2_populating_tables__sales.sql:
http://1drv.ms/1JKeYU0
01_creare_bd_vinzari_PostgreSQL.sql:
http://1drv.ms/1JKf522
02_populare_bd_vinzari_PostgreSQL .sql:
http://1drv.ms/1JKfem5
Scripts associated with this

presentation (cont.)
Oracle
scripts (for creating the DB to be imported in R

see script 03b...)
01-01a_creating_tables__sales.sql
http://1drv.ms/1LBNImg
01-01a_ro_creare_bd_vinzari.sql
http://1drv.ms/1AqQTvg
01-01b_populating_tables__sales.sql
http://1drv.ms/1LBNGLe
01-01b_ro_populare_bd_vinzari.sql
http://1drv.ms/1A5a60K
Web sites with R tutorials for data

input/output
R
Data Import/Export
http://cran.r-project.org/doc/manuals/r-re
lease/R-data.html
Beginner's guide to R: Get your data
into R
http://www.computerworld.com/article/2497
164/business-intelligence/beginner-s-guid
e-to-r-get-your-data-into-r.html
Reading/Writing Data: Part 1
https://www.youtube.com/watch?v=aBzA
els6jPk&index=9&list=PLjTlxb-wKvXNSD
fcKPFH2gzHGyjpeCZmJ
Web sites with R tutorials for data

input/output (cont.)
Importing
Data Into R from Different
Sources
http://www.r-bloggers.com/importing-data-i
nto-r-from-different-sources/
Data Import & Export in R
http://science.nature.nps.gov/im/datamgmt
/statistics/r/fundamentals/index.cfm
Reading data from the new version of
Google Spreadsheets
http://blog.revolutionanalytics.com/2014/
06/reading-data-from-the-new-version-of-g
Loading data into statistical

packages
Traditional solutions:
Direct import from external data files (Excel, CSV, text files etc.) using
their menus
Save intermediate results from the data sources into common format
files (XML, CSV, JSON ) and then import these intermediate files into
the package;
Create data sources using ODBC or JDBC
Some
more recent options:
Customized (for data source and the destination package) ETL

procedures
Connecting to special APIs or web/data services which provide data sets

in formats easy to import (e.g. Google Analytics)
Import data from web servers log into NoSQL data stores
Performing database query in a database server directly from the

statistical package.
Sources of Data in R (adaptated

from [Kabacoff, 2011])
Hadoop
NoSQL
Data
Stores
Loading data sets stored

within packages
See
previous presentation
Many packages include datasets, such as ggplot2; aftera package is
loaded, all of its datasets are available:
> library(ggplot2)
> str(diamonds)
'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut
: Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4
5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x
: num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y
: num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z
: num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Loading data objects saved in the

workspace of a previous session
Save
the workspace associated with the current

session (the workspace contains all the of the
existing data objects at a point in time ):
> ws.name <- paste("work", Sys.Date(),
+ ".RData", sep="")
> save.image(file=ws.name)
Restore
(load) a previously saved workspace

(and all the data object in the workspace):
> load("work2014-09-12.RData")
When
more worspaced have been saved, one

can choos wich one to load
> load(file.choose())
Data entered from the keyboard (1)

The
simplest method of data entry

function edit() launches a text editor that
allows entering your data manually
If
the data frame exists:

> student_gi <- edit(student_gi)
or
> fix(student_gi)
Data entered from the keyboard

(2)
If
the data frame does not exist, follow two

steps:
1 Create an empty data frame (or matrix) with the

variable names and types you want to have in the final
dataset.
> mydata <- data.frame(age=numeric(0),

+ gender=character(0),
weight=numeric(0))
2 Invoke the text editor on this data object, enter data,

and save the results back to the data object.
> mydata <- edit(mydata)
Or
> fix(mydata)
Data entered from clipboard

One
can copy into clipboard small sections of data in a table

(e.g. a spreadsheet, a Web HTML table) using control-C
(copy command)
On
Windows, command read.table handles clipboard data

with a header row that is separated by tabs, and stores the
data in a data frame (x):
> x <- read.table(file = "clipboard", sep="\t",

+ header=TRUE)
On
Mac, the pipe ("pbpaste") function the equivalent:

copy without header
> x <- read.table(pipe("pbpaste"), sep="\t",

+ header=FALSE)
copy with header
> y <- read.table(pipe("pbpaste"), sep="\t",

+ header=TRUE)
Import from local CSV/delimited

text files
read.table()
reads a file in table format

and saves it as a data frame
> mydataframe = read.table(file,

+ header=logical_value,
+
sep="delimiter", row.names="name")
file is a delimited ASCII file
header is a logical value indicating whether the first
row contains variable names (TRUE or FALSE )
sep specifies the separating data values
row.names is an optional parameter specifying
one or more variables to represent row identifiers.
Data input from local delimited text

files
Data
frame births2006 is located in directory

DataSets/births2006
Delimitator in the source file is Tab (\t)
The name of the file to be imported is births2006.txt
Qualifying the subdirectory is a bit different from an operating
system to another
>
+
+
+
+
+
+
+
+
+
switch(Sys.info()[['sysname']],
Windows= {births2006 <- read.table(
"births2006\\births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")},
Linux = {births2006 <- read.table(
"births2006/births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")},
Darwin = {births2006 <- read.table(
"births2006/births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")} )
Data input from local delimited text

files (cont.)
When
we are not sure about the file name,

instead of the filename one can use the
function file.choose():
births2006.2 <- read.table(file.choose(),

+ fileEncoding = "UTF-8",
+ header = TRUE,
+ sep="\t")
Data input from local CSV files

Import
one dataset from Irina Dan's Ph.D. thesis concerning

a study of using e-documents in companies
Data source is a tab delimited file - companyinfo.csv
Current working directory is .../DataSets
File to be imported is located in directory
.../DataSets/IrinaDan
>
+
+
+
+
+
+
+
+
+
Windows= {comp <- read.table(
"IrinaDan\\companyinfo.csv",
header=TRUE, sep=";", stringsAsFactors=FALSE)},
Linux = {comp <- read.table(
"IrinaDan/companyinfo.csv",
header=TRUE, sep=";", stringsAsFactors=FALSE)},
Darwin = {comp <- read.table(
"IrinaDan/companyinfo.csv",
header=TRUE, sep=";", + stringsAsFactors=FALSE)}
)

(cont.)
Import
a dataset from Dragos Cogean's Ph.D. thesis

which compares two cloud database services, Mongo
and MySQL
Data sources are tab delimited files
The .txt (tab-delimited) file resides in directory
\DataSets\DragosCogean
Notice the second version of function switch()
>
+
+
+
+
+
+
+
+
InsertMongoALL <- switch(Sys.info()[['sysname']],

Windows= { read.table(
"DragosCogean\\InsertMongo_ALL.txt",
fileEncoding = "UTF-8", header=TRUE,
sep="\t", stringsAsFactors=FALSE)},
Darwin = { read.table(
"DragosCogean/InsertMongo_ALL.txt",
fileEncoding = "UTF-8", header=TRUE,
sep="\t", stringsAsFactors=FALSE)} )

(cont.)
Import
Toyota Corolla second hand cars

data set (located in
...\DataSets\ToyotaCorolla directory);
Also notice the third version of function
switch()
> ToyotaCorolla <- read.table(
+
+ Windows =
{"ToyotaCorolla\\ToyotaCorolla.csv"},
+ Darwin= {"ToyotaCorolla/ToyotaCorolla.csv"}),
+
fileEncoding = "UTF-8", header = TRUE,
+
sep=","
+ )
Data input from text file available on

web
Heart
attack data set
Description available at:
http://courses.statistics.com/software/R/tables4R.htm
The data set (as delimited text file) available at:
http://courses.statistics.com/Intro1/Lesson2/heartatk4R.txt
> heart.att = read.table(
http://courses.statistics.com/Intro1/Lesson2/heartatk
4R.txt
, header=TRUE)
> head(heart.att)
Patient DIAGNOSIS SEX DRG DIED CHARGES LOS AGE
1
1
41041
F 122
0 4752.00 10 79
2
2
41041
F 122
0 3941.00
6 34
3
3
41091
F 122
0 3657.00
5 76
4
4
41081
F 122
0 1481.00
2 80
5
5
41091
M 122
0 1681.00
1 55
Data input from CSV file available on

web
data set: 356 people polled on their
smoking status (Smoke) and their socioeconomic
status (SES).
The data file contains only two columns, and
when read R interprets them both as factors:
Smoking
> smoker <read.csv("http://www.cyclismo.org/tutorial/R/_static/

smoker.csv")
> head(smoker)
Smoke SES
1 former High
2 former High
3 former High
4 former High
5 former High
6 former High
Data input from CSV file available on

web
data set: 356 people polled on their
smoking status (Smoke) and their socioeconomic
status (SES).
The data file contains only two columns, and
when read R interprets them both as factors:
Smoking
> smoker <read.csv("http://www.cyclismo.org/tutorial/R/_static/

smoker.csv")
> head(smoker)
Smoke SES
1 former High
2 former High
3 former High
4 former High
5 former High
6 former High
Download and read

When
a data set is large, instead of the direct import...
> dat.csv <read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")

...
one can proceed in two steps:

1. download the file
>
download.file("http://archive.ics.uci.edu/ml/machinelearning-databases/arrhythmia//arrhythmia.data",
destfile="data.csv")
trying URL 'http://archive.ics.uci.edu/ml/machinelearning-databases/arrhythmia//arrhythmia.data'
Content type 'text/plain; charset=UTF-8' length 402355
bytes (392 Kb)
opened URL
==================================================
downloaded 392 Kb
2. import the downloaded file
> df.2 <- read.csv("data.csv")
Importing data from Excel files

The
"simplest" way to read an Excel

(.xls/.xlsx) file is to save it in Excel as a text
(tab delimited) or csv file and then to read
it as in previous slides
Loading directly into R .xls/.xlsx files is
possible through various packages:
RO D BC
gdata
xlsReadW rite
XLConnect
xlsx
Problems (on Windows systems)

when loading some packages
> install.packages("xlsx")
> library(xlsx)
Loading required package: rJava

Loading required package: xlsxjars
xlsx
requires package rJava; on Windows systems that

sometimes creates problems (e.g. R 64 bits on Windows
7)
On my computer (Windows 7 64bit), a downloaded java
runtime in directory C:\\Program Files\\Java\\jre7\\ so
before package rJava or any other package which uses
rJava, I need to run:
> options(java.home="C:\\Program
Files\\Java\\jre7\\")
Reading data from Excel files

We'll
prefer package xlsx:
> library(xlsx)
ADL
master students (2013-2014) file is located in

directory .../DataSets/ADL)
Path is specified differently for Windows and Mac OS
systems:
> workbook <-
+
Windows=
{ "ADL\\ADL2013_Studenti.xlsx"},
+
Darwin = { "ADL/ADL2013_Studenti.xlsx"})
> adl2013_stud <- read.xlsx(workbook, 1)

> str(adl2013_stud)
Import data from local PostgreSQL

databases
PostgreSQL
database serves is installed locally, on my laptop

(on Windows systems check if the PostgreSQL service is
started)
Package needed: RPostgreSQL
> install.packages("RPostgreSQL")
> library(RPostgreSQL)
Load
the PostgreSQL driver
> drv <- dbDriver("PostgreSQL")

Open
a connection
On Windows systems:
> con <- dbConnect(drv, dbname="bd2014", user="bd2014",

+
password="bd2014")
On Mac OS
> con <- dbConnect(drv, port=5433, dbname="sales2014",

+
user="sales2014", password="sales2014")

databases (cont.)
Launch
the PostgreSQL query; the result of the query will be saved into data
frame invoice_detailed:
> invoice_detailed <+
dbGetQuery(con,
"SELECT i.invoiceNo, invoiceDate, i.customerId,
customerName, place, countyName, region,
comments, invoiceRowNumber, i_d.productId,
productName, unitOfMeasurement, category,
quantity, unitPrice, quantity * unitPrice AS amountWithoutVAT,
quantity * unitPrice * (1 + VATPercent) AS amount
FROM invoices i
INNER JOIN invoice_details i_d ON i.invoiceNo = i_d.invoiceNo
INNER JOIN products p ON i_d.productId = p.productId
INNER JOIN customers c ON i.customerid = c.customerid
INNER JOIN postcodes pc ON c.postCode = pc.postCode
INNER JOIN counties ON pc.countyCode =counties.countyCode
ORDER BY i.invoiceNo, invoiceRowNumber")

databases (cont.)
Launch
the PostgreSQL query; the result of the query will be saved

into data frame invoice_detailed:
> head(invoice_detailed,3)
invoiceno invoicedate customerid customername place
1
1111
2012-08-01
1001 Client 1 SRL
Iasi
1111
2012-08-01
1001 Client 1 SRL
Iasi
1111
2012-08-01
1001 Client 1 SRL
Iasi
countyname
region comments invoicerownumber productid
Iasi Moldova
<NA>
Iasi Moldova
<NA>
Iasi Moldova
<NA>
productname unitofmeasurement
category quantity
Product 1
b500ml Category A
50
Product 2
kg Category B
75
Product 5
unit Category A
50
unitprice amountwithoutvat amount

1
1000
50000
62000
1050
78750
88200
7060
353000 437720
Saving the data frame(s)

Data
frame(s) will be saved (for further use) in

directory .../DataSets/sales
Path
qualification is different between Windows

and Mac systems:
> file.name <-
Windows=
{ "sales\\invoice_detailed.RData"},
Darwin = { "sales/invoice_detailed.RData"})
> save(invoice_detailed, file = file.name)
After
saving, whenever needed, the data frame can

be loaded into Rstudio session with load function
Close connections/drivers
After
Close
the import, the resources must be freed

all PostgreSQL connections
for (connection in dbListConnections(drv) )

{
dbDisconnect(connection)
}
Frees
all the resources on the driver
> dbUnloadDriver(drv)
Import data from a remote PostgreSQL

databases (!!!!!!!!!!!!!!!!!!!)
...for
the moment it is impossible to externally

(outside FEAA) access the database servers
Access Oracle databases through

JDBC
Package
ROracle was intended to provide access to Oracle databases

Unfortunately, now package ROracle is not available
Next example was inspired by
http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc
/
As the name suggests, the solution needs dealing with some Java
"things"
Requirements: JDK/JRE previously installed

Download ojdbc jar from www.oracle.com (in my case, ojdbc6.jar)
Set JAVA_HOME, set max. memory, and load rJava library
Sys.setenv(JAVA_H O M E= '/path/to/java_hom e')
on my Mac OS:
>
Sys.setenv(JAVA_HOME='/Library/Java/JavaVirtualMachines/jdk1.7
.0_45.jdk/Contents/Home')
> options(java.parameters="-Xmx2g")
> install.packages("rJava")

JDBC (cont.)
Getting some information
Java version
> .jinit()
> print(.jcall("java/lang/System", "S", "getProperty",
"java.version"))
classPath (just for the record)
> .jclassPath()
Load RJDBC package
> Install.packages(RJDBC)
> library(RJDBC)
Create connection driver and open connection
> jdbcDriver <JDBC(driverClass="oracle.jdbc.OracleDriver",
classPath="/Users/admin/Downloads/ojdbc6.jar")

JDBC (cont.)
Open connection
> jdbcConnection <- dbConnect(jdbcDriver,

+
"jdbc:oracle:thin:@//10.10.0.7:1521/orcl",
"bd2",
"bd2")
Launch the Oracle query and store the result into the data
frame st
> st <- dbGetQuery(jdbcConnection,
+ "SELECT * FROM studenti")
Close connection
> dbDisconnect(jdbcConnection)
Import data from MongoDB
Import data from Cassandra
Import data from Hadoop
Read HTML tables from the web
Package needed: XML
> install.packages("XML")
> library(XML)
> myURL <"http://www.jaredlander.com/2012/02/another-kindof-super-bowl-pool/"
> dfHTML <- readHTMLTable(myURL, which=1,
header=FALSE,
+
stringsAsFactors = FALSE)
> head(dfHTML,3)
V1
V2
V3
1 Participant 1 Giant A Patriot Q

2 Participant 2 Giant B Patriot R
Import XML files
Package needed: XML
> library(XML)
Web address of the xml file
> url <"http://www.statistics.life.ku.dk/primer/mydata.xml

"
Import
> indata <- xmlToDataFrame(url)

> head(indata, 5)
Girth Height Volume
1
8.3
70
10.3
8.6
65
10.3
8.8
63
10.2
10.5
72
16.4
Reading HTML pages with multiple

tables
Package needed: XML
> library(XML)
The web page contains text and a number of tables
> url.1 <'http://en.wikipedia.org/wiki/World_population'

> tbls.1 <- readHTMLTable(url.1)
> class(tbls.1)
[1] "list"
Display how many tables are on the page
> length(tbls.1)
[1] 28
Read only the 1st table on this page (which=1)
> tbl.1.1 <- readHTMLTable(url.1, which=1,
header=F, stringsAsFactors = FALSE)
Importing data from other

statistical packages
Package
needed: foreign
> install.packages("foreign")
> library(foreign)
Read a Stata data file (.dta)
> states <- read.dta("states.dta")
Read a local SPSS file
> spss1 <- read.spss("p004.sav",
+ use.value.labels = TRUE, to.data.frame =
TRUE)
Import the SPSS file directly from web address
> spss2 <read.spss("http://www.ats.ucla.edu/stat/spss/ex
amples/chp/p004.sav",
+ use.value.labels = TRUE,
Save/export R data objects

Save
a data frame as a .csv file

> write.csv(spss2, file = "spss2.csv")
Save a data frame as a tab delimited text file
> write.table(spss2, file = "spss2.txt",
+ sep = "\t", fileEncoding = "UTF-8")
Save a data frame as an Excel (xlsx) file (requires
package xlsx)
> write.xlsx(spss2, file = "spss2.xlsx",
sheetName="spss2")
> write.xlsx(echipe.4,
+ file = "Centralizator BD2_2013_SIA1.xlsx",
+ sheetName="t4.echipe",
+ row.names=FALSE, append=TRUE, showNA=FALSE)
Save/export R data objects (cont.)

Save
a dataframe as a .dta file (requires package

foreign)
> write.dta(spss2, file = "spss2.dta")

Save
to binary R format (can save multiple

datasets and R objects)
> save(invoice.details.ro,
+ file = "invoice.details.ro.RData")
> save(states, spss2, dat.xls,
+ file = "temp.RData")

03 Data Input Output

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

03 Data Input Output

Hochgeladen von

Copyright:

Verfügbare Formate

Al.I.

Cuza University of Iai

Data Analysis & Data

Scripts associated with this

scripts (for creating the DB to be imported

Scripts associated with this

scripts (for creating the DB to be imported in R

Web sites with R tutorials for data

Web sites with R tutorials for data

Data Into R from Different

Loading data into statistical

Create data sources using ODBC or JDBC

more recent options:

Customized (for data source and the destination package) ETL

Connecting to special APIs or web/data services which provide data sets

Performing database query in a database server directly from the

Sources of Data in R (adaptated

Loading data sets stored

Loading data objects saved in the

the workspace associated with the current

(load) a previously saved workspace

more worspaced have been saved, one

Data entered from the keyboard (1)

simplest method of data entry

the data frame exists:

Data entered from the keyboard

the data frame does not exist, follow two

1 Create an empty data frame (or matrix) with the

> mydata <- data.frame(age=numeric(0),

2 Invoke the text editor on this data object, enter data,

Data entered from clipboard

can copy into clipboard small sections of data in a table

Windows, command read.table handles clipboard data

> x <- read.table(file = "clipboard", sep="\t",

Mac, the pipe ("pbpaste") function the equivalent:

> x <- read.table(pipe("pbpaste"), sep="\t",

copy with header

> y <- read.table(pipe("pbpaste"), sep="\t",

Import from local CSV/delimited

reads a file in table format

> mydataframe = read.table(file,

Data input from local delimited text

frame births2006 is located in directory

Data input from local delimited text

we are not sure about the file name,

births2006.2 <- read.table(file.choose(),

Data input from local CSV files

one dataset from Irina Dan's Ph.D. thesis concerning

Data input from local CSV files

a dataset from Dragos Cogean's Ph.D. thesis

InsertMongoALL <- switch(Sys.info()[['sysname']],

Data input from local CSV files

Toyota Corolla second hand cars

Data input from text file available on

attack data set

Description available at:

The data set (as delimited text file) available at:

Data input from CSV file available on

> smoker <read.csv("http://www.cyclismo.org/tutorial/R/_static/

Data input from CSV file available on

> smoker <read.csv("http://www.cyclismo.org/tutorial/R/_static/

Download and read

a data set is large, instead of the direct import...

> dat.csv <read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")