Datasets

Data Frames
Benchmark Datasets
Data Frames
…. table / two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each
column.
Key Characteristics of a data frame :

• The column names should be non-empty
• The row names should be unique
• Each column should contain same number of data items.
empdata <- data.frame(

emp_id = c (1:5),
emp_name = c("A","B","C","D","E"),
salary =
c(9000.00,8900.50,8000.00,9000,8900))
print(emp.data)
Extract
Columns
Add
Column
Tabular Data Creation using DATA FRAMES
students <- data.frame(

rollno = c(1, 2, 3),
height = c(152, 171.5, 165),
weight = c(81, 93, 78),
marks = c(98, 89, 97)
)
print(students)
Permanent Saving with fix(dataframe) rather than edit()

# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",

"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)
# Create the second data frame Adding Row in Data Frame

emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)
# Bind the two data frames.

emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)
Datasets in R
library loads the

package MASS (for
Modern Applied
Statistics with S) into
memory.
Command data() will

list all the datasets in
loaded packages.
The
command data(phones)
will load the data
set phones into
memory.
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
Importing Dataset
Importing Datasets from External Files
Reading Text Files
Common Issue
Blank Line at end is

mandatory
Otherwise, it will
generate the error
Performance Evaluation on Reading Data
library(data.table)
start_time <- Sys.time()
DT <- fread("d:\\g.csv")
end_time <- Sys.time()
end_time - start_time

DT <- read.csv("d:\\g.csv")

DT <- read.table("d:\\g.csv")
print("csv") "User CPU time" gives the
system.time(DT <- read.csv("d:\\g.csv")) CPU time spent by the
print("fread") current process (i.e., the
system.time(DT <- fread("d:\\g.csv")) current R session) and
print("table") "system CPU time" gives
system.time(DT <- read.table("d:\\g.csv")) the CPU time spent by the
kernel (the operating
library(sqldf) system) on behalf of the
options(max.print=9000000) current process.
s <- sqldf("select * from DT")
# options(max.print = .Machine$integer.max) extreme version

Missing Value Identification and Imputation
is.na(x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)
Show All Records having Missing Values
complete.cases()
Excluding Missing Values from Analysis
Creating New Data without Missing Values

HMisc Package in R for Missing Value Imputation
Harrell Miscellaneous Package
library(Hmisc)
impute(BostonHousing$ptratio, mean) # replace with mean
impute(BostonHousing$ptratio, median) # median
impute(BostonHousing$ptratio, 20) # replace specific number
# or if you want to impute manually
BostonHousing$ptratio[is.na(BostonHousing$ptratio)] <-
mean(BostonHousing$ptratio, na.rm = T) # not run
Prediction Imputation using kNN Algorithm (Package DMwR)
> knnOutput <- knnImputation(dm1)
> anyNA(knnOutput)
[1] FALSE
> actuals <- Book$second[is.na(Book1$second)]
> predicteds <- knnOutput[is.na(Book1$second), "second"]
> regr.eval(actuals, predicteds)
Mean Based Imputation (Package DMwR)
> actuals <- Book$second[is.na(Book1$second)]
> predicteds <- rep(mean(Book1$second, na.rm=T), length(actuals))
> regr.eval(actuals, predicteds)

library(DMwR) Evaluation of
dm1=read.csv("d:\\Book1.csv")
kNN
dm1
ori=read.csv("d:\\original.csv")
And
ori Random Forest
actuals <- ori$id3[is.na(dm1$id3)] Approach
knnOutput <- knnImputation(dm1) # kNN Approach

predictedknn <- knnOutput[is.na(dm1$id3), "id3"]
regr.eval(actuals, predicteds)
library(mice) # Multivariate Imputation by Chained Equations

miceMod <- mice(dm1, method="rf")
miceOutput <- complete(miceMod) # generate the completed data.
anyNA(miceOutput)
predictedmice <- miceOutput[is.na(dm1$id3), "id3"]
predictedmice
regr.eval(actuals, predictedmice)
Error Factor Improved and Deeply Analyzed
Evaluation of Accuracy and Error Factors
Mean absolute percentage error

Conversion of File Format
Comma-separated data .csv
Fortran data data
Pipe-separated data .psv
Fixed-width format data .fwf
Tab-separated data .tsv
SAS .sas7bdat gzip comma-separated .csv.gz
SPSS .sav data
Stata .dta CSVY (CSV + YAML .csvy
SAS XPORT .xpt metadata header)
SPSS Portable .por Feather R/Python .feather
Excel .xls interchange format
Excel .xlsx Fast Storage .fst
R syntax .R JSON .json
Saved R objects .RData, .rda Matlab .mat
Serialized R objects .rds OpenDocument .ods
Epiinfo .rec Spreadsheet
Minitab .mtp HTML Tables .html
Systat .syd
Shallow XML .xml
“XBASE” database files .dbf
documents
YAML .yml
Weka Attribute-Relation File .arff
Clipboard default is tsv
Format
Data Interchange Format .dif Google Sheets as Comma-separated
data
convert("forconvert.csv", "mtcars.sav")
convert("forconvert.csv", "mtcars.json")
convert("forconvert.csv", "mtcars.xml")
convert("mtcars.json", "mtcars.csv")
Package: rio
convert("mtcars.json", "mtcars.xlsx")
convert("mtcars.json", "mtcars.tsv")
convert("mtcars.json", "mtcars.fwf")
convert("mtcars.json", "mtcars.yml")
Install and Activate Required Packages
install.packages("twitteR", "RCurl", "RJSONIO", "stringr")
library(twitteR)
library(RCurl)
library(RJSONIO)
library(stringr)
Declare Twitter API Credentials

api_key <- "R8r4laheY8O6PfdyXInVLXdyp"
api_secret <- "TUKiWLKgm6kSHmcuF198bSYLDG1NKcwBo3aRXzZU789mZ7K6UV"
token <- "96367691-OY4lfHTmtlxmrVgnYFeXh8vVlm03nby0t118WoFuX"
token_secret <- "A1TmeW83EB7n6M3LgOhVS9jF7aNaxfPqUBJl9ark3L8Vx"
Twitter Connection and Fetch Live Tweets
setup_twitter_oauth(api_key, api_secret, token, token_secret)
tweets <- searchTwitter("Obamacare OR ACA OR 'Affordable Care

Act' OR #ACA", n=100, lang="en", since="2014-08-20")
tweets.df <- twListToDF(tweets)
tweets_geolocated <- searchTwitter("Obamacare OR ACA OR

'Affordable Care Act' OR #ACA", n=100, lang="en",
geocode='34.04993,-118.24084,50mi', since="2014-08-20")
tweets_geoolocated.df <- twListToDF(tweets_geolocated)

Operator Behavior
Obamacare ACA Both "Obamacare" and "ACA"; not case sensitive
Obamacare ACA Exact phrase "Obamacare ACA"; not case sensitive
Obamacare OR ACA Either "Obamacare" or "ACA" or both; not case sensitive; the OR
operator IS case sensitive.
Obamacare -ACA "Obamacare" but not "ACA
#Obamacare Hashtag "Obamacare"
from:BarackObama Sent from Barack Obama
to:BarackObama Sent to Barack Obama
@BarackObama Feferencing Barack Obama's account
Obamacare since:2014-08-25 "Obamacare" and sent since 2010-08-25 (year-month-day)
ACA until:2014-08-22 "ACA" and sent before 2010-08-25

Sentiment Analysis using Tidy Data
Tidy data Structure

• Each variable is a column
• Each observation is a row
• Each type of observational unit is a table
Tidy text format

• A table with one-token-per-row.
• A token is a meaningful unit of text, such as a word, that we are
interested in using for analysis, and tokenization is the process of
splitting text into tokens
Sentiment Dataset
The tidytext package contains several sentiment lexicons in

the sentiments dataset.
> library(tidytext)
> sentiments
The three general-purpose lexicons are

• AFINN from Finn Årup Nielsen,
• bing from Bing Liu and collaborators, and
• nrc from Saif Mohammad and Peter Turney.
http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Sentiment Analysis of Tweets from Social Media
library(syuzhet)
t="this politician is very effective"

tokens <- get_tokens(t, pattern = "\\W")
bing_vector <- get_sentiment(tokens, method="bing")
bing_vector
t="i like the movie Bahubali. It is presenting history"

bing_vector <- get_sentiment(tokens, method="bing")
bing_vector
barplot(bing_vector, xlab="Word / Token", ylab="Score", col="red")
plot(bing_vector,xlab="Word / Token", ylab="Score", col="red")

Sentiment Analysis on AFINN
t="this politician is effective"
bing_vector <- get_sentiment(tokens, method="afinn")
bing_vector
t="this politician is good"

bing_vector
t="this politician is bad"

bing_vector
GAURAV KUMAR
Magma Research and Consultancy Services

Ambala Cantt., Haryana, India
Mobile Numbers : +91-9034001978
E-mail : kumargaurav.in@gmail.com
http://www.gauravkumarindia.com

Datasets

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Datasets

Hochgeladen von

Copyright:

Verfügbare Formate

Data Frames

Key Characteristics of a data frame :

empdata <- data.frame(

students <- data.frame(

Permanent Saving with fix(dataframe) rather than edit()

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",

# Create the second data frame Adding Row in Data Frame

# Bind the two data frames.

library loads the

Command data() will

Blank Line at end is

start_time <- Sys.time()

start_time <- Sys.time()

# options(max.print = .Machine$integer.max) extreme version

Creating New Data without Missing Values

> actuals <- Book$second[is.na(Book1$second)]

> predicteds <- knnOutput[is.na(Book1$second), "second"]

> regr.eval(actuals, predicteds)

Mean Based Imputation (Package DMwR)

> actuals <- Book$second[is.na(Book1$second)]

> predicteds <- rep(mean(Book1$second, na.rm=T), length(actuals))

> regr.eval(actuals, predicteds)

knnOutput <- knnImputation(dm1) # kNN Approach

library(mice) # Multivariate Imputation by Chained Equations

Mean absolute percentage error

Declare Twitter API Credentials

setup_twitter_oauth(api_key, api_secret, token, token_secret)

tweets <- searchTwitter("Obamacare OR ACA OR 'Affordable Care

tweets.df <- twListToDF(tweets)

tweets_geolocated <- searchTwitter("Obamacare OR ACA OR

tweets_geoolocated.df <- twListToDF(tweets_geolocated)

Obamacare ACA Both "Obamacare" and "ACA"; not case sensitive

Obamacare ACA Exact phrase "Obamacare ACA"; not case sensitive

Obamacare -ACA "Obamacare" but not "ACA

#Obamacare Hashtag "Obamacare"

from:BarackObama Sent from Barack Obama

to:BarackObama Sent to Barack Obama

@BarackObama Feferencing Barack Obama's account

Obamacare since:2014-08-25 "Obamacare" and sent since 2010-08-25 (year-month-day)

ACA until:2014-08-22 "ACA" and sent before 2010-08-25

Tidy data Structure

Tidy text format

The tidytext package contains several sentiment lexicons in

The three general-purpose lexicons are

t="this politician is very effective"

t="i like the movie Bahubali. It is presenting history"

plot(bing_vector,xlab="Word / Token", ylab="Score", col="red")

t="this politician is good"

t="this politician is bad"

Magma Research and Consultancy Services

Mobile Numbers : +91-9034001978

Das könnte Ihnen auch gefallen