Sie sind auf Seite 1von 43

Al.I.

Cuza University of Iai


Faculty of Economics and Business Administration
Department of Accounting, Information Systems and
Statistics

Data Analysis & Data


Science with R
Data Input and Output
(Import and Export)
By Marin Fotache

Scripts associated with this


presentation

Scripts

03a_basics_of_data_input_output.R:
http://1drv.ms/1DRMOTC
03b_intermediate_data_input_output:
http://1drv.ms/1JKeKwi

PostgreSQL

scripts (for creating the DB to be imported


in R see script 03a...)
00a_1_creating_tables__sales.sql:
http://1drv.ms/1JKeRbh
00a_2_populating_tables__sales.sql:
http://1drv.ms/1JKeYU0
01_creare_bd_vinzari_PostgreSQL.sql:
http://1drv.ms/1JKf522

02_populare_bd_vinzari_PostgreSQL .sql:
http://1drv.ms/1JKfem5

Scripts associated with this


presentation (cont.)
Oracle

scripts (for creating the DB to be imported in R


see script 03b...)
01-01a_creating_tables__sales.sql
http://1drv.ms/1LBNImg
01-01a_ro_creare_bd_vinzari.sql
http://1drv.ms/1AqQTvg
01-01b_populating_tables__sales.sql
http://1drv.ms/1LBNGLe
01-01b_ro_populare_bd_vinzari.sql
http://1drv.ms/1A5a60K

Web sites with R tutorials for data


input/output
R

Data Import/Export
http://cran.r-project.org/doc/manuals/r-re
lease/R-data.html
Beginner's guide to R: Get your data
into R
http://www.computerworld.com/article/2497
164/business-intelligence/beginner-s-guid
e-to-r-get-your-data-into-r.html
Reading/Writing Data: Part 1
https://www.youtube.com/watch?v=aBzA
els6jPk&index=9&list=PLjTlxb-wKvXNSD
fcKPFH2gzHGyjpeCZmJ

Web sites with R tutorials for data


input/output (cont.)
Importing

Data Into R from Different

Sources
http://www.r-bloggers.com/importing-data-i
nto-r-from-different-sources/
Data Import & Export in R
http://science.nature.nps.gov/im/datamgmt
/statistics/r/fundamentals/index.cfm
Reading data from the new version of
Google Spreadsheets
http://blog.revolutionanalytics.com/2014/
06/reading-data-from-the-new-version-of-g

Loading data into statistical


packages
Traditional solutions:

Direct import from external data files (Excel, CSV, text files etc.) using
their menus

Save intermediate results from the data sources into common format
files (XML, CSV, JSON ) and then import these intermediate files into
the package;

Create data sources using ODBC or JDBC

Some

more recent options:

Customized (for data source and the destination package) ETL


procedures

Connecting to special APIs or web/data services which provide data sets


in formats easy to import (e.g. Google Analytics)

Import data from web servers log into NoSQL data stores

Performing database query in a database server directly from the


statistical package.

Sources of Data in R (adaptated


from [Kabacoff, 2011])

Hadoop

NoSQL
Data
Stores

Loading data sets stored


within packages
See

previous presentation
Many packages include datasets, such as ggplot2; aftera package is
loaded, all of its datasets are available:
> library(ggplot2)
> str(diamonds)
'data.frame': 53940 obs. of 10 variables:
$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut
: Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4
5 ...
$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num 55 61 65 58 58 57 57 55 61 61 ...
$ price : int 326 326 327 334 335 336 336 337 337 338 ...
$ x
: num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y
: num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z
: num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Loading data objects saved in the


workspace of a previous session
Save

the workspace associated with the current


session (the workspace contains all the of the
existing data objects at a point in time ):
> ws.name <- paste("work", Sys.Date(),
+ ".RData", sep="")
> save.image(file=ws.name)
Restore

(load) a previously saved workspace


(and all the data object in the workspace):
> load("work2014-09-12.RData")
When

more worspaced have been saved, one


can choos wich one to load
> load(file.choose())

Data entered from the keyboard (1)


The

simplest method of data entry


function edit() launches a text editor that
allows entering your data manually
If

the data frame exists:


> student_gi <- edit(student_gi)
or
> fix(student_gi)

Data entered from the keyboard


(2)
If

the data frame does not exist, follow two


steps:

1 Create an empty data frame (or matrix) with the


variable names and types you want to have in the final
dataset.

> mydata <- data.frame(age=numeric(0),


+ gender=character(0),
weight=numeric(0))

2 Invoke the text editor on this data object, enter data,


and save the results back to the data object.
> mydata <- edit(mydata)
Or
> fix(mydata)

Data entered from clipboard


One

can copy into clipboard small sections of data in a table


(e.g. a spreadsheet, a Web HTML table) using control-C
(copy command)

On

Windows, command read.table handles clipboard data


with a header row that is separated by tabs, and stores the
data in a data frame (x):

> x <- read.table(file = "clipboard", sep="\t",


+ header=TRUE)
On

Mac, the pipe ("pbpaste") function the equivalent:


copy without header

> x <- read.table(pipe("pbpaste"), sep="\t",


+ header=FALSE)

copy with header

> y <- read.table(pipe("pbpaste"), sep="\t",


+ header=TRUE)

Import from local CSV/delimited


text files
read.table()

reads a file in table format


and saves it as a data frame

> mydataframe = read.table(file,


+ header=logical_value,
+
sep="delimiter", row.names="name")
file is a delimited ASCII file
header is a logical value indicating whether the first
row contains variable names (TRUE or FALSE )
sep specifies the separating data values
row.names is an optional parameter specifying
one or more variables to represent row identifiers.

Data input from local delimited text


files
Data

frame births2006 is located in directory


DataSets/births2006
Delimitator in the source file is Tab (\t)
The name of the file to be imported is births2006.txt
Qualifying the subdirectory is a bit different from an operating
system to another
>
+
+
+
+
+
+
+
+
+

switch(Sys.info()[['sysname']],
Windows= {births2006 <- read.table(
"births2006\\births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")},
Linux = {births2006 <- read.table(
"births2006/births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")},
Darwin = {births2006 <- read.table(
"births2006/births2006.txt",
fileEncoding = "UTF-8", header = TRUE, sep="\t")} )

Data input from local delimited text


files (cont.)
When

we are not sure about the file name,


instead of the filename one can use the
function file.choose():

births2006.2 <- read.table(file.choose(),


+ fileEncoding = "UTF-8",
+ header = TRUE,
+ sep="\t")

Data input from local CSV files


Import

one dataset from Irina Dan's Ph.D. thesis concerning


a study of using e-documents in companies
Data source is a tab delimited file - companyinfo.csv
Current working directory is .../DataSets
File to be imported is located in directory
.../DataSets/IrinaDan
>
+
+
+
+
+
+
+
+
+

switch(Sys.info()[['sysname']],
Windows= {comp <- read.table(
"IrinaDan\\companyinfo.csv",
header=TRUE, sep=";", stringsAsFactors=FALSE)},
Linux = {comp <- read.table(
"IrinaDan/companyinfo.csv",
header=TRUE, sep=";", stringsAsFactors=FALSE)},
Darwin = {comp <- read.table(
"IrinaDan/companyinfo.csv",
header=TRUE, sep=";", + stringsAsFactors=FALSE)}
)

Data input from local CSV files


(cont.)
Import

a dataset from Dragos Cogean's Ph.D. thesis


which compares two cloud database services, Mongo
and MySQL
Data sources are tab delimited files
The .txt (tab-delimited) file resides in directory
\DataSets\DragosCogean
Notice the second version of function switch()
>
+
+
+
+
+
+
+
+

InsertMongoALL <- switch(Sys.info()[['sysname']],


Windows= { read.table(
"DragosCogean\\InsertMongo_ALL.txt",
fileEncoding = "UTF-8", header=TRUE,
sep="\t", stringsAsFactors=FALSE)},
Darwin = { read.table(
"DragosCogean/InsertMongo_ALL.txt",
fileEncoding = "UTF-8", header=TRUE,
sep="\t", stringsAsFactors=FALSE)} )

Data input from local CSV files


(cont.)
Import

Toyota Corolla second hand cars


data set (located in
...\DataSets\ToyotaCorolla directory);
Also notice the third version of function
switch()
> ToyotaCorolla <- read.table(
+
switch(Sys.info()[['sysname']],
+ Windows =
{"ToyotaCorolla\\ToyotaCorolla.csv"},
+ Darwin= {"ToyotaCorolla/ToyotaCorolla.csv"}),
+
fileEncoding = "UTF-8", header = TRUE,
+
sep=","
+ )

Data input from text file available on


web
Heart

attack data set

Description available at:

http://courses.statistics.com/software/R/tables4R.htm

The data set (as delimited text file) available at:

http://courses.statistics.com/Intro1/Lesson2/heartatk4R.txt
> heart.att = read.table(
http://courses.statistics.com/Intro1/Lesson2/heartatk
4R.txt
, header=TRUE)
> head(heart.att)
Patient DIAGNOSIS SEX DRG DIED CHARGES LOS AGE
1
1
41041
F 122
0 4752.00 10 79
2
2
41041
F 122
0 3941.00
6 34
3
3
41091
F 122
0 3657.00
5 76
4
4
41081
F 122
0 1481.00
2 80
5
5
41091
M 122
0 1681.00
1 55

Data input from CSV file available on


web
data set: 356 people polled on their
smoking status (Smoke) and their socioeconomic
status (SES).
The data file contains only two columns, and
when read R interprets them both as factors:
Smoking

> smoker <read.csv("http://www.cyclismo.org/tutorial/R/_static/


smoker.csv")
> head(smoker)
Smoke SES
1 former High
2 former High
3 former High
4 former High
5 former High
6 former High

Data input from CSV file available on


web
data set: 356 people polled on their
smoking status (Smoke) and their socioeconomic
status (SES).
The data file contains only two columns, and
when read R interprets them both as factors:
Smoking

> smoker <read.csv("http://www.cyclismo.org/tutorial/R/_static/


smoker.csv")
> head(smoker)
Smoke SES
1 former High
2 former High
3 former High
4 former High
5 former High
6 former High

Download and read


When

a data set is large, instead of the direct import...

> dat.csv <read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")


...

one can proceed in two steps:


1. download the file

>
download.file("http://archive.ics.uci.edu/ml/machinelearning-databases/arrhythmia//arrhythmia.data",
destfile="data.csv")
trying URL 'http://archive.ics.uci.edu/ml/machinelearning-databases/arrhythmia//arrhythmia.data'
Content type 'text/plain; charset=UTF-8' length 402355
bytes (392 Kb)
opened URL
==================================================
downloaded 392 Kb
2. import the downloaded file

> df.2 <- read.csv("data.csv")

Importing data from Excel files


The

"simplest" way to read an Excel


(.xls/.xlsx) file is to save it in Excel as a text
(tab delimited) or csv file and then to read
it as in previous slides
Loading directly into R .xls/.xlsx files is
possible through various packages:

RO D BC
gdata
xlsReadW rite
XLConnect
xlsx

Problems (on Windows systems)


when loading some packages
> install.packages("xlsx")
> library(xlsx)

Loading required package: rJava


Loading required package: xlsxjars
xlsx

requires package rJava; on Windows systems that


sometimes creates problems (e.g. R 64 bits on Windows
7)
On my computer (Windows 7 64bit), a downloaded java
runtime in directory C:\\Program Files\\Java\\jre7\\ so
before package rJava or any other package which uses
rJava, I need to run:
> options(java.home="C:\\Program
Files\\Java\\jre7\\")

Reading data from Excel files


We'll

prefer package xlsx:

> library(xlsx)
ADL

master students (2013-2014) file is located in


directory .../DataSets/ADL)
Path is specified differently for Windows and Mac OS
systems:
> workbook <-

switch(Sys.info()[['sysname']],

+
Windows=
{ "ADL\\ADL2013_Studenti.xlsx"},
+

Darwin = { "ADL/ADL2013_Studenti.xlsx"})

> adl2013_stud <- read.xlsx(workbook, 1)


> str(adl2013_stud)

Import data from local PostgreSQL


databases
PostgreSQL

database serves is installed locally, on my laptop


(on Windows systems check if the PostgreSQL service is
started)
Package needed: RPostgreSQL
> install.packages("RPostgreSQL")
> library(RPostgreSQL)
Load

the PostgreSQL driver

> drv <- dbDriver("PostgreSQL")


Open

a connection

On Windows systems:

> con <- dbConnect(drv, dbname="bd2014", user="bd2014",


+

password="bd2014")
On Mac OS

> con <- dbConnect(drv, port=5433, dbname="sales2014",


+

user="sales2014", password="sales2014")

Import data from local PostgreSQL


databases (cont.)
Launch

the PostgreSQL query; the result of the query will be saved into data
frame invoice_detailed:

> invoice_detailed <+

dbGetQuery(con,

"SELECT i.invoiceNo, invoiceDate, i.customerId,

customerName, place, countyName, region,

comments, invoiceRowNumber, i_d.productId,

productName, unitOfMeasurement, category,

quantity, unitPrice, quantity * unitPrice AS amountWithoutVAT,

quantity * unitPrice * (1 + VATPercent) AS amount

FROM invoices i

INNER JOIN invoice_details i_d ON i.invoiceNo = i_d.invoiceNo

INNER JOIN products p ON i_d.productId = p.productId

INNER JOIN customers c ON i.customerid = c.customerid

INNER JOIN postcodes pc ON c.postCode = pc.postCode

INNER JOIN counties ON pc.countyCode =counties.countyCode

ORDER BY i.invoiceNo, invoiceRowNumber")

Import data from local PostgreSQL


databases (cont.)
Launch

the PostgreSQL query; the result of the query will be saved


into data frame invoice_detailed:

> head(invoice_detailed,3)
invoiceno invoicedate customerid customername place
1

1111

2012-08-01

1001 Client 1 SRL

Iasi

1111

2012-08-01

1001 Client 1 SRL

Iasi

1111

2012-08-01

1001 Client 1 SRL

Iasi

countyname

region comments invoicerownumber productid

Iasi Moldova

<NA>

Iasi Moldova

<NA>

Iasi Moldova

<NA>

productname unitofmeasurement

category quantity

Product 1

b500ml Category A

50

Product 2

kg Category B

75

Product 5

unit Category A

50

unitprice amountwithoutvat amount


1

1000

50000

62000

1050

78750

88200

7060

353000 437720

Saving the data frame(s)


Data

frame(s) will be saved (for further use) in


directory .../DataSets/sales

Path

qualification is different between Windows


and Mac systems:

> file.name <-

switch(Sys.info()[['sysname']],

Windows=
{ "sales\\invoice_detailed.RData"},
Darwin = { "sales/invoice_detailed.RData"})
> save(invoice_detailed, file = file.name)
After

saving, whenever needed, the data frame can


be loaded into Rstudio session with load function

Close connections/drivers
After
Close

the import, the resources must be freed


all PostgreSQL connections

for (connection in dbListConnections(drv) )


{
dbDisconnect(connection)
}
Frees

all the resources on the driver

> dbUnloadDriver(drv)

Import data from a remote PostgreSQL


databases (!!!!!!!!!!!!!!!!!!!)
...for

the moment it is impossible to externally


(outside FEAA) access the database servers

Access Oracle databases through


JDBC
Package

ROracle was intended to provide access to Oracle databases


Unfortunately, now package ROracle is not available
Next example was inspired by
http://www.r-bloggers.com/connecting-r-to-an-oracle-database-with-rjdbc
/
As the name suggests, the solution needs dealing with some Java
"things"

Requirements: JDK/JRE previously installed


Download ojdbc jar from www.oracle.com (in my case, ojdbc6.jar)
Set JAVA_HOME, set max. memory, and load rJava library
Sys.setenv(JAVA_H O M E= '/path/to/java_hom e')
on my Mac OS:

>
Sys.setenv(JAVA_HOME='/Library/Java/JavaVirtualMachines/jdk1.7
.0_45.jdk/Contents/Home')
> options(java.parameters="-Xmx2g")
> install.packages("rJava")

Access Oracle databases through


JDBC (cont.)

Getting some information

Java version

> .jinit()
> print(.jcall("java/lang/System", "S", "getProperty",
"java.version"))

classPath (just for the record)

> .jclassPath()
Load RJDBC package
> Install.packages(RJDBC)
> library(RJDBC)
Create connection driver and open connection
> jdbcDriver <JDBC(driverClass="oracle.jdbc.OracleDriver",
classPath="/Users/admin/Downloads/ojdbc6.jar")

Access Oracle databases through


JDBC (cont.)

Open connection

> jdbcConnection <- dbConnect(jdbcDriver,


+

"jdbc:oracle:thin:@//10.10.0.7:1521/orcl",

"bd2",

"bd2")
Launch the Oracle query and store the result into the data
frame st
> st <- dbGetQuery(jdbcConnection,
+ "SELECT * FROM studenti")

Close connection

> dbDisconnect(jdbcConnection)

Import data from MongoDB

Import data from Cassandra

Import data from Hadoop

Read HTML tables from the web

Package needed: XML

> install.packages("XML")
> library(XML)
> myURL <"http://www.jaredlander.com/2012/02/another-kindof-super-bowl-pool/"
> dfHTML <- readHTMLTable(myURL, which=1,
header=FALSE,
+

stringsAsFactors = FALSE)

> head(dfHTML,3)
V1

V2

V3

1 Participant 1 Giant A Patriot Q


2 Participant 2 Giant B Patriot R

Import XML files

Package needed: XML

> library(XML)

Web address of the xml file

> url <"http://www.statistics.life.ku.dk/primer/mydata.xml


"

Import

> indata <- xmlToDataFrame(url)


> head(indata, 5)
Girth Height Volume
1

8.3

70

10.3

8.6

65

10.3

8.8

63

10.2

10.5

72

16.4

Reading HTML pages with multiple


tables
Package needed: XML
> library(XML)
The web page contains text and a number of tables

> url.1 <'http://en.wikipedia.org/wiki/World_population'


> tbls.1 <- readHTMLTable(url.1)
> class(tbls.1)
[1] "list"
Display how many tables are on the page
> length(tbls.1)
[1] 28
Read only the 1st table on this page (which=1)
> tbl.1.1 <- readHTMLTable(url.1, which=1,
header=F, stringsAsFactors = FALSE)

Importing data from other


statistical packages
Package

needed: foreign
> install.packages("foreign")
> library(foreign)
Read a Stata data file (.dta)
> states <- read.dta("states.dta")
Read a local SPSS file
> spss1 <- read.spss("p004.sav",
+ use.value.labels = TRUE, to.data.frame =
TRUE)
Import the SPSS file directly from web address
> spss2 <read.spss("http://www.ats.ucla.edu/stat/spss/ex
amples/chp/p004.sav",
+ use.value.labels = TRUE,

Save/export R data objects


Save

a data frame as a .csv file


> write.csv(spss2, file = "spss2.csv")
Save a data frame as a tab delimited text file
> write.table(spss2, file = "spss2.txt",
+ sep = "\t", fileEncoding = "UTF-8")
Save a data frame as an Excel (xlsx) file (requires
package xlsx)
> write.xlsx(spss2, file = "spss2.xlsx",
sheetName="spss2")
> write.xlsx(echipe.4,
+ file = "Centralizator BD2_2013_SIA1.xlsx",
+ sheetName="t4.echipe",
+ row.names=FALSE, append=TRUE, showNA=FALSE)

Save/export R data objects (cont.)


Save

a dataframe as a .dta file (requires package


foreign)

> write.dta(spss2, file = "spss2.dta")


Save

to binary R format (can save multiple


datasets and R objects)

> save(invoice.details.ro,
+ file = "invoice.details.ro.RData")
> save(states, spss2, dat.xls,
+ file = "temp.RData")

Das könnte Ihnen auch gefallen