Beruflich Dokumente
Kultur Dokumente
Data Structures in R
Vector Data frame
An ordered collection of data of the same type It is a table with rows and columns; data within each column has
> a = c(1,2,3) the same type (e.g. number, text, logical), but different columns
> a*2 may have different types.
[1] 2 4 6
Example:
Matrix >a
A rectangular table of data of the same type Policy_no premium State
12345 100 IL
List
25486 400 NY
An ordered collection of data of arbitrary types.
63254 350 FL
> doe = list(name="john",age=28,married=F)
> doe$name
[1] "john“
> doe$age
[1] 28
Linear Rectangular
All Same Type Vector Matrix
Mixed List Data frame
Lapply, Sapply & Apply
Lapply Sapply
•When the same or similar tasks need to be performed multiple times for sapply( li, fct )
all elements of a list or for all columns of an array. Like apply, but tries to simplify the result, by converting it
•May be easier and faster than “for” loops into a vector or array of appropriate size
•lapply(li, function )
•To each element of the list li, the function function is applied. > li = list("klaus","martin","georg")
•The result is a list whose elements are the individual function > sapply(li, toupper)
results. [1] "KLAUS" "MARTIN" "GEORG"
> li = list("klaus","martin”)
> lapply(li, toupper) > fct = function(x) { return(c(x, x*x)) }
> [[1]] > sapply(1:5, fct)
> [1] "KLAUS" [,1] [,2] [,3] [,4] [,5]
> [[2]] [1,] 1 2 3 4 5
> [1] "MARTIN" [2,] 1 4 9 16 25
Apply
apply( arr, margin, fct )
Apply the function fct along some dimensions of the array arr, according to margin, and return a vector or array of the appropriate size.
>x
[,1] [,2] [,3] > apply(x, 1, sum)
[1,] 5 7 0 [1] 12 24 17 14
[2,] 7 9 8 > apply(x, 2, sum)
[3,] 4 6 7
[4,] 6 3 5
Regular expressions
• grep(pattern, string, value = TRUE) • sub(pattern, replacement, string) - replace first match
• grepl(pattern , string) • gsub(pattern, replacement, string) - replace all
matches
Importing & Exporting data
Understanding data table
DT[i , j , by ]
• Order of execution
DT[i , j , by ] DT[ 1 , 3 , 2 ]
Filtering rows & selecting columns data table
Operation Syntax
Subsetting rows by numbers # select first to tenth row
dt[1:10 , ]
Using column names to select rows # selecting rows with subln_grp as "PL"
based on a condition dt[ subln_grp == "PL" , ]
# Select rows with state as Florida and premium greater than zero
dt[ state == "FL" & premium > 0 , ]
Note - .() is an alias to list(). If .() is used, the returned value is a data.table. If .() is
not used, the result is a vector.
Subsetting rows and selecting # Select only rows where segment is "retail" and relevant columns
columns together dt[ segment == "retail" , .(policy_no , premium , state) ]
Summarizing data table
• Count and aggregation
Operation Syntax
Count # Counting number of policies with PL as their subln_grp
dt[ subln_grp == "PL" , .N ]
Aggregation including filtering # Taking average premium and count of policies for which deductible is greater
than $100
dt[ deductible > 100 , .(count = .N , average = mean(premium , na.rm = TRUE)) ]
Summarizing data table
• Group by
Operation Syntax
Simple group by # Taking average premium and count of policies by segment
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) , by = segment ]
# Taking average premium and count of policies for which deductible is greater
than $100 by segment
dt[ deductible > 100 , .(count = .N , average = mean(premium , na.rm = TRUE)) ,
by = segment ]
OR
dt[ , .(count = .N , average = mean(premium , na.rm = TRUE)) ,
by = .(segment , ded = deductible > 100 ) ]
OR
Operation Syntax
Updating variables # Doubling the premium amount
dt[ , premium := premium *2]
Parameter Description
id.vars ID columns with IDs for multiple entries
measure.vars Columns containing values to fill into cells
variable.name and value.name Names of new columns for variables and values derived from old headers
Parameter Description
Id ~ y Formula with a LHS: ID columns containing IDs for multiple entries. And a RHS:
columns with values to spread in column headers
Syntax:
wide_data <- as.data.table(fread("reshape_data.csv"))
long_data <- melt(wide_data, id.vars = c("policy_no" , "state" , "subln_grp"), measure.vars = c("written_premium", "endorsement_premium"),
variable.name = "premium_type" , value.name = "premium")
wide_data <- dcast(long_data , policy_no + state + subln_grp ~ premium_type , value.var = "premium")
Set() family in data table
Operation Syntax
Create or upadate column names by # Change the name of premium field to wrt_prem
reference setnames(dt, "premium", "wrt_prem")
Reorder columns by reference #Change the column order of the data table
setcolorder(dt, c("deductible", "coverage", "segment", "state", "subln_grp",
"policy_no", "premium"))
Binning variables in data table
Other operations in data table
Other operations in data table
Joins in data table
• Right join
Syntax:
right_join <- dt1[dt2, on = "policy_no“ ] # right join (dt1 is left, dt2 is right)
right_join <- merge(dt1, dt2 , by = "policy_no" , all.y = TRUE)
dt1 dt2
right_join
dt1 captures written premium and state dt2 captures coverage and location code
information
dt1 dt2
left_join
dt1 captures written premium and state dt2 captures coverage and location code
information
dt1 dt2
inner_join
dt1 captures written premium and state dt2 captures coverage and location code
information
Inner_join has all rows where dt1’s key columns values match dt2’s key column values
Joins in data table
• Left anti join
Syntax:
left_anti_join <- dt1[ !dt2, on = "policy_no“ ]
Syntax:
# Converting character to date
dt1[ , `:=` (policy_start_date =
as.Date(policy_start_date, "%m/%d/%y") ,
policy_end_date =
as.Date(policy_end_date , "%m/%d/%y"))]
dt2[ , endorsement_eff_dt :=
as.Date(endorsement_eff_dt , "%m/%d/%y")]
# Rolling backward
rolling_join <- dt2[dt1 , roll = -365 ]
ggplot2 - Grammar of Graphics plot
Data: in ggplot2, data must be stored as an R data frame
Coordinate system: describes 2-D space that data is projected onto - for example, Cartesian
coordinates, polar coordinates, map projections
Geoms: describe type of geometric objects that represent data - for example, points, lines, polygons
Aesthetics: describe visual characteristics that represent data - for example, position, size, color,
shape, transparency, fill
Scales: for each aesthetic, describe how visual characteristic is converted to display values - for
example, log scales, color scales, size scales, shape scales
Stats : describe statistical transformations that typically summarize data - for example, counts,
means, medians, regression lines
Facets: describe how data is split into subsets and displayed as multiple small graphs
Creating a plot object
creates a plot object that can be assigned to a variable
can specify data frame and aesthetic mappings (visual characteristics that represent data)
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium ))
p
x‐axis position indicates subln_grp
y‐axis position indicates premium
Adding a layer
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium , color = state))
p + geom_point(size = 2)
Layer
Purpose:
Display the data – allows viewer to see patterns, overall structure, local structure, outliers
Display statistical summaries of the data – allows viewer to see counts, means, medians, IQRs, model
predictions
Data and aesthetics (mappings) may be inherited from ggplot() object or added, changed, or dropped
within individual layers
Each geom_xxx() has a default stat (statistical transformation) associated with it , but the default
statistical transformation can be changed using stat parameter
Adding a geom layer
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = subln_grp , y = premium , color = state))
p + geom_blank() p + geom_point()
p + geom_jitter() p + geom_count()
Displaying Statistical Summary
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state))
p + geom_bar()
Already transformed data
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
transfrmd_data <- plot_data[ , count := .N , by = state]
transfrmd_data <- unique(transfrmd_data[,.(count , state)])
p <- ggplot(data = transfrmd_data , aes(x = state , y = count))
p + geom_col()
# or
p + geom_bar(stat = "identity")
p + geom_freqpoly(aes(color = state))
Displaying Statistical Summaries
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state , y = premium))
p + geom_boxplot()
Position
Syntax:
plot_data <- as.data.table(fread("ggplot_data.csv"))
p <- ggplot(data = plot_data , aes(x = state , fill = deductible > 100))
p + geom_bar() p + geom_bar(position="stack")
p + geom_bar(position="dodge") p + geom_bar(position="fill")