Sie sind auf Seite 1von 40

Data Frames

Benchmark Datasets
Data Frames
…. table / two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each
column.

Key Characteristics of a data frame :


• The column names should be non-empty
• The row names should be unique
• Each column should contain same number of data items.

empdata <- data.frame(


emp_id = c (1:5),
emp_name = c("A","B","C","D","E"),
salary =
c(9000.00,8900.50,8000.00,9000,8900))

print(emp.data)
Extract
Columns
Add
Column
Tabular Data Creation using DATA FRAMES

students <- data.frame(


rollno = c(1, 2, 3),
height = c(152, 171.5, 165),
weight = c(81, 93, 78),
marks = c(98, 89, 97)
)
print(students)

Permanent Saving with fix(dataframe) rather than edit()


# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)

# Create the second data frame Adding Row in Data Frame


emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)

# Bind the two data frames.


emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)
Datasets in R

library loads the


package MASS (for
Modern Applied
Statistics with S) into
memory.

Command data() will


list all the datasets in
loaded packages.

The
command data(phones)
will load the data
set phones into
memory.
https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
Importing Dataset
Importing Datasets from External Files
Reading Text Files

Common Issue

Blank Line at end is


mandatory

Otherwise, it will
generate the error
Performance Evaluation on Reading Data
library(data.table)
start_time <- Sys.time()
DT <- fread("d:\\g.csv")
end_time <- Sys.time()
end_time - start_time

start_time <- Sys.time()


DT <- read.csv("d:\\g.csv")
end_time <- Sys.time()
end_time - start_time

start_time <- Sys.time()


DT <- read.table("d:\\g.csv")
end_time <- Sys.time()
end_time - start_time
print("csv") "User CPU time" gives the
system.time(DT <- read.csv("d:\\g.csv")) CPU time spent by the
print("fread") current process (i.e., the
system.time(DT <- fread("d:\\g.csv")) current R session) and
print("table") "system CPU time" gives
system.time(DT <- read.table("d:\\g.csv")) the CPU time spent by the
kernel (the operating
library(sqldf) system) on behalf of the
options(max.print=9000000) current process.
s <- sqldf("select * from DT")

# options(max.print = .Machine$integer.max) extreme version


Missing Value Identification and Imputation
is.na(x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
is.na(y) # returns a vector (F F F T)
Show All Records having Missing Values
complete.cases()
Excluding Missing Values from Analysis

Creating New Data without Missing Values


HMisc Package in R for Missing Value Imputation
Harrell Miscellaneous Package

library(Hmisc)
impute(BostonHousing$ptratio, mean) # replace with mean
impute(BostonHousing$ptratio, median) # median
impute(BostonHousing$ptratio, 20) # replace specific number
# or if you want to impute manually
BostonHousing$ptratio[is.na(BostonHousing$ptratio)] <-
mean(BostonHousing$ptratio, na.rm = T) # not run
Prediction Imputation using kNN Algorithm (Package DMwR)
> knnOutput <- knnImputation(dm1)

> anyNA(knnOutput)
[1] FALSE

> actuals <- Book$second[is.na(Book1$second)]

> predicteds <- knnOutput[is.na(Book1$second), "second"]

> regr.eval(actuals, predicteds)

Mean Based Imputation (Package DMwR)

> actuals <- Book$second[is.na(Book1$second)]

> predicteds <- rep(mean(Book1$second, na.rm=T), length(actuals))

> regr.eval(actuals, predicteds)


library(DMwR) Evaluation of
dm1=read.csv("d:\\Book1.csv")
kNN
dm1
ori=read.csv("d:\\original.csv")
And
ori Random Forest
actuals <- ori$id3[is.na(dm1$id3)] Approach

knnOutput <- knnImputation(dm1) # kNN Approach


predictedknn <- knnOutput[is.na(dm1$id3), "id3"]
regr.eval(actuals, predicteds)

library(mice) # Multivariate Imputation by Chained Equations


miceMod <- mice(dm1, method="rf")
miceOutput <- complete(miceMod) # generate the completed data.
anyNA(miceOutput)
predictedmice <- miceOutput[is.na(dm1$id3), "id3"]
predictedmice
regr.eval(actuals, predictedmice)
Error Factor Improved and Deeply Analyzed
Evaluation of Accuracy and Error Factors

Mean absolute percentage error


Conversion of File Format
Comma-separated data .csv
Fortran data data
Pipe-separated data .psv
Fixed-width format data .fwf
Tab-separated data .tsv
SAS .sas7bdat gzip comma-separated .csv.gz
SPSS .sav data
Stata .dta CSVY (CSV + YAML .csvy
SAS XPORT .xpt metadata header)
SPSS Portable .por Feather R/Python .feather
Excel .xls interchange format
Excel .xlsx Fast Storage .fst
R syntax .R JSON .json
Saved R objects .RData, .rda Matlab .mat
Serialized R objects .rds OpenDocument .ods
Epiinfo .rec Spreadsheet
Minitab .mtp HTML Tables .html
Systat .syd
Shallow XML .xml
“XBASE” database files .dbf
documents
YAML .yml
Weka Attribute-Relation File .arff
Clipboard default is tsv
Format
Data Interchange Format .dif Google Sheets as Comma-separated
data
convert("forconvert.csv", "mtcars.sav")

convert("forconvert.csv", "mtcars.json")

convert("forconvert.csv", "mtcars.xml")

convert("mtcars.json", "mtcars.csv")
Package: rio
convert("mtcars.json", "mtcars.xlsx")

convert("mtcars.json", "mtcars.tsv")

convert("mtcars.json", "mtcars.fwf")

convert("mtcars.json", "mtcars.yml")
Install and Activate Required Packages
install.packages("twitteR", "RCurl", "RJSONIO", "stringr")

library(twitteR)
library(RCurl)
library(RJSONIO)
library(stringr)

Declare Twitter API Credentials


api_key <- "R8r4laheY8O6PfdyXInVLXdyp"
api_secret <- "TUKiWLKgm6kSHmcuF198bSYLDG1NKcwBo3aRXzZU789mZ7K6UV"
token <- "96367691-OY4lfHTmtlxmrVgnYFeXh8vVlm03nby0t118WoFuX"
token_secret <- "A1TmeW83EB7n6M3LgOhVS9jF7aNaxfPqUBJl9ark3L8Vx"
Twitter Connection and Fetch Live Tweets

setup_twitter_oauth(api_key, api_secret, token, token_secret)

tweets <- searchTwitter("Obamacare OR ACA OR 'Affordable Care


Act' OR #ACA", n=100, lang="en", since="2014-08-20")

tweets.df <- twListToDF(tweets)

tweets_geolocated <- searchTwitter("Obamacare OR ACA OR


'Affordable Care Act' OR #ACA", n=100, lang="en",
geocode='34.04993,-118.24084,50mi', since="2014-08-20")

tweets_geoolocated.df <- twListToDF(tweets_geolocated)


Operator Behavior

Obamacare ACA Both "Obamacare" and "ACA"; not case sensitive

Obamacare ACA Exact phrase "Obamacare ACA"; not case sensitive

Obamacare OR ACA Either "Obamacare" or "ACA" or both; not case sensitive; the OR
operator IS case sensitive.

Obamacare -ACA "Obamacare" but not "ACA

#Obamacare Hashtag "Obamacare"

from:BarackObama Sent from Barack Obama

to:BarackObama Sent to Barack Obama

@BarackObama Feferencing Barack Obama's account

Obamacare since:2014-08-25 "Obamacare" and sent since 2010-08-25 (year-month-day)

ACA until:2014-08-22 "ACA" and sent before 2010-08-25


Sentiment Analysis using Tidy Data

Tidy data Structure


• Each variable is a column
• Each observation is a row
• Each type of observational unit is a table

Tidy text format


• A table with one-token-per-row.
• A token is a meaningful unit of text, such as a word, that we are
interested in using for analysis, and tokenization is the process of
splitting text into tokens
Sentiment Dataset

The tidytext package contains several sentiment lexicons in


the sentiments dataset.

> library(tidytext)
> sentiments

The three general-purpose lexicons are


• AFINN from Finn Årup Nielsen,
• bing from Bing Liu and collaborators, and
• nrc from Saif Mohammad and Peter Turney.
http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
Sentiment Analysis of Tweets from Social Media
library(syuzhet)

t="this politician is very effective"


tokens <- get_tokens(t, pattern = "\\W")
bing_vector <- get_sentiment(tokens, method="bing")
bing_vector

t="i like the movie Bahubali. It is presenting history"


tokens <- get_tokens(t, pattern = "\\W")
bing_vector <- get_sentiment(tokens, method="bing")
bing_vector
barplot(bing_vector, xlab="Word / Token", ylab="Score", col="red")

plot(bing_vector,xlab="Word / Token", ylab="Score", col="red")


Sentiment Analysis on AFINN
t="this politician is effective"
tokens <- get_tokens(t, pattern = "\\W")
bing_vector <- get_sentiment(tokens, method="afinn")
bing_vector

t="this politician is good"


tokens <- get_tokens(t, pattern = "\\W")
bing_vector <- get_sentiment(tokens, method="afinn")
bing_vector

t="this politician is bad"


tokens <- get_tokens(t, pattern = "\\W")
bing_vector <- get_sentiment(tokens, method="afinn")
bing_vector
GAURAV KUMAR

Magma Research and Consultancy Services


Ambala Cantt., Haryana, India

Mobile Numbers : +91-9034001978

E-mail : kumargaurav.in@gmail.com

http://www.gauravkumarindia.com

Das könnte Ihnen auch gefallen