Beruflich Dokumente
Kultur Dokumente
2. Each type offers researcher progressively more power in analyzing & testing the
validity of a scale
Levels of Scale Measurement
1. Business researchers use many scales or number systems
2. Traditionally, the level of scale measurement is seen as important because it
determines the mathematical comparisons that are allowable
3. The 4 levels/ types of scale measurement are:
1. Nominal Scales of Measurement
2. Ordinal Nominal Interval
4. Ratio
4. Each type offers researcher progressively more power in analyzing & testing the
validity of a scale
Nominal Scale
1. Represents most elementary level of measurement
2. Assigns value to an object for identification/ classification purpose only
3. Nominal scaling is arbitrary
4. Set of numbers, letters, or any other identification is equally valid
5. Note: - Nos. are not representing different quantities or the value of the object
UNDERSTANDING FOOD
NUTRITIONAL EDUCATION WITH DATA
Dr. Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: shahmsc@ipeindia.org Mobile: + (91)98666 66620
Unit II: Business Analytics using R
Introduction to R Programming; Installing R and R Studio; Data Structures in R: Vectors, Dataframes,
Lists, Matrices and Array Operations; Summarizing Data: Numerical and Graphical Summaries; Data
Visualization in R: Histogram, Bar Chart, Scatter Plot, Box Plot, Corrgram, Corrplot, ggplot2; Data
Manipulation in R: Built-in Functions, Apply Functions, Date and Time Functions, Dplyr Functions, Pipe
Operator; Data Transformation: Filtering, Dropping, Merging, Sorting, Reshaping of Data, Detecting
Missing Values in the Data, Imputation; Data Import and Export Techniques in R; Statements:
Conditional Statements and Control Statements.
Statistical Applications: Parametric One and Two Sample Tests: Z-test, t-test, Chi-square test, and
ANOVA; Non-parametric Test: Mann Whitney U test, Wilcoxon test, Kruskal Wallis test; Correlation
Analysis; Simple and Multiple Linear Regression.
Unit III: Business Analytics using Python
Introduction to Python Programming; Installing Python, Pycharm and Anaconda; Data Structures in
Python: Variables, Files, Lists, Dictionaries, Tuples, and Sets; Functions: In-built, User-defined and
Lambda Functions; Statements: Conditional Statements and Control Statements; Exception
Handling.
Data Import and Export Techniques in Python; Summarizing Data: Numerical and Graphical
Summaries; Data Visualization in Python: Matplotlib and Seaborn libraries; Data Analysis: Numpy,
Pandas and Sklearn Libraries; Model building for Simple and Multiple Linear Regression.
2. BUSINESS ANALYTICS USING R STUDIO
CONTENTS
1. Installing R & R Studio
2. Data Types & Data Structures
1. Vectors
2. Dataframes
3. Lists
4. Matrices
5. Arrays
6. Factors
3. Data Input
4. Operators in R
5. Organizing the data
6. Summarizing Data – Numerical summary
1. INSTALLING R & R-STUDIO
• R is an industrial strength open source statistical package
• R & R Studio Download - Takes 2 minutes to install
• R Installation - https://cran.r-project.org/bin/windows/base/old/3.5.2/
• R Studio Installation
• Provides an integrated environment for working in R
• https://www.rstudio.com/products/rstudio/download/
R
• Is a Case sensitive statistical
programming language
• Is an interpreter
• Is a Free open source
platform
• Is Developed by R.
Gentleman and R. Ihaka at
Bell Labs, during 1990’s
• Is becoming lingua franca
for data science - mastering
core skills will be easier in R
• Has High-level data analytics
& statistical functions
• Is an effective data
visualization tool
• Data scientists use R - 2 best companies to work in modern economy - Google & Facebook
• Tool of choice for data scientists - At Microsoft - Apply machine learning to data from Bing,
Azure, Office, & data from other (Sales, Marketing & Finance) departments
• Widely used in a variety of companies - Bank of America, Ford, TechCrunch, Uber, & Trulia
R provides many functions to examine features of vectors and other objects
• > (Command prompt) - Commands are entered one at a time
• # - Beginning of comment
• q() or Escape key - Terminates the current R session
• ?command or >help(command) – Seeks Inbuilt help for R
• View(dataset) - Invoke a spreadsheet - style data viewer on a matrix
• Commands are separated either by a semi-colon (;) or by a newline. Elementary commands can be
grouped into one compound expression by braces ({ }). If a command is not complete, R will prompt + on
subsequent lines
• <- or = or -> (assignment operators)- Results of calculations can be stored in objects
• Keyboard Vertical arrow keys - Used to scroll forward & backward through a command history
• Basic functions are available by default. Other functions are contained in packages as (built-in & user-
created functions) that can be attached as needed. They are kept in memory during an interactive
session
o Packages - Collection of R functions, data & compiled code in a well-defined format
o R comes with a standard set of packages (including base, datasets, utils, grDevices, graphics, stats, and
methods), providing a wide range of functions & datasets that are available by default
o There are more than 5,500 user-contributed modules called packages that can be downloaded from
http://cran.r-project.org/web/packages
o library - Directory where packages are stored on your computer
o .libPaths() – Shows where your library is located
o search() - Lists the packages that are loaded and ready for use
o install.packages() – To install a package for the first time
o Like any software, packages are often updated by their authors. To update any package that is already
installed, use the command update.packages()
o Installed.packages() – Lists the packages you have, along with their version numbers, dependencies, &
other information
o library(package) - Loads package for current session
o apropos("command") or find("command") - Find library containing "command“
o library(help=package) - Lists datasets and functions in package
o To use an installed package in R session, you need to load the package using the library() command
o class() - what kind of object is it?
o is.integer() & is.numeric() – Is the data type of the object integer or numeric?
o typeof() - what is the object’s data type (low-level)?
o length() - how long is it? What about two dimensional objects?
o attributes() - does it have any metadata?
o x[i] - access element i of vector x
o names(list) - show names of the elements in list
o list$element - access element of list
o Inf infinity; Example: - 1/0
o NaN - Not a number ; Example: - 0/0
o NA - Missing value, when no value has been assigned
o NULL - Null object ; Example: - An empty list
o help.start() - Online access to R manuals
o help(command) or ?command - Help on command
o help.search("command") or ??command - Search the help system
o demo() - Run R demo; Example: - demo(graphics)
o example(function) - Shows example of function
o str(a) - Displays the internal structure of an object
o summary(a) - Displays a summary of object
o dir() - Show files in the current directory
o setwd(path) - Set windows directory to path
o ls() or objects() - Shows objects in the search path
o rm(x,y,...) - Remove object x,y,... from workspace
o rm(list=objects()) - Remove everything created so far
o file - Denotes file name like "path/rscript.r" or "ascii.txt"
o save.image(file) - Save current workspace in file
o load(file) - Load previously saved workspace file
o data(x) - Load specified data set (data frame) x
o source(file) - Execute the R-script file
o sink(file) - Send output to file instead of screen, until sink()
o scan(file) - Read file into a vector or list
o read.table(file,header=TRUE) - Read file into data frame; rst line denes column names; space is separator;
see help for options on row naming, NA treatment etc.
o read.csv(file,header=TRUE) - Read comma-delimited file
o read.delim(file,header=TRUE) - Read tab-delimited file
o read.fwf(le,widths) - Read fixed width formatted file; integer vector widths define column widths
o save(file,x,y,...) - Save x,y,... to file in portable format
o print(x,y,...) - Print objects x,y,... in default format
o format(x,y,...) - Format x,y,...for pretty printing; Example: - print(format(x))
o cat(x,y,...,file="", sep=" ") - Concatenate and print x,y,... to file (default console), using separator sep
o write.table(x,file="",row.names=T,col.names=T, sep=" ") - Print x as data frame, using separator sep; eol is
end-of-line separator, na is missing values string
o write.table(x,"clipboard",sep="nt",col.names=NA) - Write a table to the clipboard for Excel in Windows
o read.delim("clipboard") - Read a table copied from Excel
o foreign - read data stored by Minitab, SAS, SPSS, Stata, e.g. read.dta("statafile"), read.ssd("sasfile"),
read.spss("spssfile")
o x=numeric(n) - Create numeric n-vector with all elements 0
o c(x,y,...) - Stack (concatenate) values or vectors x,y,... to form a long vector
o a:b - Generate sequence of integers a,...,b; :has priority; e.g. 1:4 + 1 is 2 3 4 5
o seq(from,to,by=) - Generate a sequence with increment by; length= species length
o seq(along=x) - Generate 1, 2, ..., length(x)
o rep(x,n) - Replicate x n times, e.g. rep(c(1,2,3),2) is 1 2 3 1 2 3; each= repeat each element of x each
times, e.g. rep(c(1,2,3),each=2) is 1 1 2 2 3 3
o x=complex(n) - create complex n-vector with all elements 0
o list(...) - Create list of the named or unnamed arguments; e.g. list(a=c(1,2),b="hi",c=3i)
o factor(x,levels=) - Encode vector x as factor (define groups)
o gl(n,k,length=n*k,labels=1:n) - Regular pattern of factor levels; n is the number of levels, and k is the
number of replications
o expand.grid(x,y,...) - Create data frame with all combinations of the values of x,y,..., e.g.
expand.grid(c(1,2,3),c(10,20))
o Dealing with missing values - One of the most important problems in statistics is incomplete data sets. To
deal with missing values, R uses the reserved keyword NA, which stands for Not Available
is.na() - Tests whether a value is NA. Returns TRUE if the value is NA
Example: - x = NA is.na(x) TRUE
is.nan() - Tests whether a value is not an NA. Returns TRUE if value is not an NA
Example: - x = NA is.nan(x) FALSE
> is.na(mtcars$mpg)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [17]
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> is.nan(mtcars$cyl)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [17]
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 DATA TYPES & DATA STRUCTURES
• In any programming language, variables store information
• Based on the data type of a variable, operating system (OS) allocates memory & decides what can be
stored in the reserved memory
is.numeric(X) TRUE
X 1.5 X = as.numeric(X) X
Integer X = 15L X 5 print(class(X)) integer
is.integer(X) TRUE
X 15 X = as.integer(X) X
Complex Z=1+2i Z 1+2i print(class(Z)) complex
is.complex(Z) TRUE
Logical a=4 b=6 print(class(C)) logical
C=a>b FALSE
Character X = "R Studio“ X R Studio print(class(X)) character
is.character(X) TRUE
Y 1
Y=1 Y = as.character(y) print(Y)
Raw name = "IPE" r = charToRaw(name) class(r) raw
all() Tests whether all the resulting elements are TRUE all(smarks>amarks) FALSE
Gives difference between current & next value in the diff(final) 33 -41 -36 26 40 -30 11 -44 11
diff(x)
vector
sum(x) Calculates the sum of all values in vector x sum(final) 469
cumprod(final)
5.600000e+01 4.984000e+03 2.392320e+05
cumprod(x) Gives the cumulative product of all values in x 2.870784e+06 1.090898e+08 8.509004e+09 [7]
4.084322e+11 2.409750e+13 3.614625e+14
9.398024e+15
Gives the minimum for all values in x from the start of cummin(final) 56 56 48 12 12 12 12 12 12 12
cummin(x)
the vector until the position of that value
Gives the maximum for all values in x from the start of cummax(final) 56 89 89 89 89 89 89 89 89 89
cummax(x)
the vector until the position of that value
vector = c(10,20,30,40,50,60,50,30,20,50) vector[vector==50] = NA
vector 10 20 30 40 NA 60 NA 30 20 NA
sum(is.na(vector)) 3 # Checks for NA’s
rowSums(matrix_name) Returns the sum of each row rowSums(cosmeticsexpenditure) 220 260 300
colMeans(cosmeticsexpenditure)
colMeans(matrix_name) Returns the mean of each column
YMen YWomen SCM SCW 20 50 80 110
colSums(cosmeticsexpenditure)
colSums(matrix_name) Returns the mean of each column
YMen YWomen SCM SCW 60 150 240 330
apply(cosmeticsexpenditure,2,mean)
Works equally well for a matrix as it YMen YWomen SCM SCW 20 50 80 110
apply(x,MARGIN,FUN)
does for a data frame object apply(cosmeticsexpenditure,1,mean) 55 65 75
apply(cosmeticsexpenditure,1,mean) 55 75 NA
2.5. Arrays – array()
• While matrices are confined to two dimensions, arrays can be of any number of
dimensions
• Arrays have 2 very important features
• They contain only a single type of value
• They have dimensions
• Syntax: array(vector, dimensions, dimnames)
where vector – data for the array; dimensions – numeric vector giving maximal index
for each dimension; & dimnames – optional list of dimension labels
• Dimensions of an array determine the type of the array. An array with 2 dimensions is
a matrix. A data structure with more than 2 dimensions is an array
dim1 = c(“A1”,”A2”)
dim2 = c(“B1”,”B2”,”B3”)
dim3 = c(“C1”,”C2”,”C3”, ,”C4”)
z = array(1:24, c(2,3,4), dimnames=list(dim1,dim2,dim3))
2.6. Factors – factor()
• Factors store vector along with the distinct values of the elements as labels
• Labels are always character irrespective of whether it is numeric or character or Boolean etc.
in the input vector
• Factors are created using the factor() function. nlevels functions gives the count of levels
iphone_colors = c('green','green','yellow','red','red','red','green')
factor_iphone = factor(iphone_colors) # Create a factor object
print(factor_iphone) # Print the factor
green green yellow red red red green Levels: green red yellow
print(nlevels(factor_iphone)) 3
3. Data Input
• Data comes from a variety of sources & in a variety of formats
• Reading data into a statistical system for analysis & exporting the results to some other
system for report writing can be frustrating tasks that can take far more time than the
statistical analysis itself
• R provides a wide range of tools fro importing data. This section describes the import &
export facilities available either in R itself or via packages
• Definitive guide for importing data in R is the R Data Import/ Export manual available
at http://mng.bz/urwn
• Statistical systems like R are not particularly well suited to manipulations of large-scale
data
• Easiest form of data to import into R is a simple text file, and this will often be
acceptable for problems of small or medium scale. The primary function to import from
a text file is scan, & this underlies most of the more convenient functions
• Statistical consultants are presented data in some proprietary binary format (Excel
spreadsheet/ SPSS file) by a client. This section discusses what facilities are available to
access such files directly from R
• For much larger databases it is common to handle the data using a database
management system (DBMS). For many such DBMSs the extraction operation can be
done directly from an R package
1. Importing data from a delimited text file
read.table() - Data can be imported from a delimited text file. It reads a file in a table
format & saves it as a data frame. Each row of the table appears as one line in the file
Syntax: - mydf1 = read.table(file, header=logical_value, sep="delimiter",
row.names="name")
where file - a delimited ASCII file
header - Logical value indicating whether first row contains variable names (TRUE/FALSE)
sep - specifies delimiter separating data values (“,” , “\t”, “ “)
row.names - Optional parameter; Specify 1 or more variables to represent row identifiers
Example: - grades = read.table("C:/Users/npm/Desktop/IPE - Class of 2019 - Reference
Material/Business Analytics for Managers - 2020/studentmarks.csv", header=TRUE,
sep=",", row.names="StudentID")
grades
Note: - read.table() function has many additional options for fine-tuning the data
import. See help(read.table) for details
2. Importing data from Excel
read.xlsx() – Imports Excel worksheets directly using xlsx package. xlsx package can
be used to read, write, & format Excel 97/ 2000/ XP/ 2003/ 2007 files. It imports a
worksheet into a dataframe
Syntax: - mydf2 = read.xlsx(file, n) or mydf3 = read.excel(file)
where file – Path to an Excel workbook
n – Number of worksheets to be imported
Example: - HI = read_excel("C:/Users/npm/Desktop/IPE - Class of 2019 - Reference
Material/Business Analytics for Managers - 2020/Health_Insurance.xlsx")
HI
Note: - xlsx package can do more than import worksheets. It can create and
manipulate Excel XLSX files as well. Programmers who need to develop an interface
between R and Excel should check out this relatively new package
• Operator performs specific mathematical/ logical manipulations
4
• R - Rich in built-in operators. Operators in R include:
• Arithmetic Operators
O
• Relational Operators
P
• Logical Operators
E
• Assignment Operators
R
• Miscellaneous Operators
A
T Order of the operations
O 1. Exponentiation
R 2. Multiplication & Division in the order in which the operators are presented
S 3. Addition & Subtraction in the order in which the operators are presented
4. Mod operator (%%) & integer division operator (% / %) have same priority as the
I normal operator (/) in calculations
N 5. Basic order of operations in R: Parenthesis (), Exponents (^), Multiplication, Division,
Addition & Subtraction (PEMDAS)
R 6. Operations put between parenthesis is carried out first
Operator Description Example Arithmetic Operators
x+y y added to x 2+3=5
x–y y subtracted from x 8- 2=6
x*y x multiplied by y 2*3=6
x/y x multiplied by y 10 / 5 = 2
x^y x raised to the power y 2 ^3 = 8
x %% y Remainder of x divided by y (x mod y) 7 %% 3 = 1
x%/%y x divided by y but rounded down (integer divide) 7 %/% 3 = 2
trunc(x) Integer part of x trunc(2.5) = 2
trunc(-2.5) = -2
ceiling(x) Round up to nearest integer ceiling(2.5) = 3
x>y Returns TRUE if x is larger than y TRUE TRUE TRUE TRUE TRUE
x<y Returns TRUE if x is smaller than y FALSE FALSE FALSE FALSE FALSE
x >= y Returns TRUE if x is > or exactly equal to y TRUE TRUE TRUE TRUE TRUE
x <= y Returns TRUE if x is < or exactly equal to y FALSE FALSE FALSE FALSE FALSE
Used to multiply a matrix with its M = matrix( c(1,2,3,4), nrow = 2,ncol = 2,byrow = TRUE) x=t(M)
%*%
transpose NewMatrix=M %*% t(M)
5. ORGANIZING THE DATA
(FREQUENCY AND CONTINGENCY TABLES)
• Tables, diagrams & graphs provide
1. easy way to assimilate summaries of data
2. part way in describing data
• In this section, we’ll look at frequency & contingency tables for categorical variables
1. Measures of Central Tendency - Amount by which all the data values coexist about a defined
typical value
2. Measures of Dispersion/ Variation - Amount of data values that are dispersed about a typical
value
3. Measures of Skewness - Measure of the lack of symmetry in a distribution
4. Measures of Kurtosis - Measure of the degree of peakedness in the distribution
SUMMARY - DATA DESCRIPTORS
UNIVARIATE ANALYSIS
Central Tendency
X i
X i 1
XG ( X1 X2 Xn )1/ n
n
Mid point of ranked values Most frequently observed value
Arithmetic Mean/ Average - Sum of all the observed values of the variable divided by the
number of cases
o Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 2 3 4 5 15 1 2 3 4 10 20
3 4
1 2 3 4 55 15 5 5 5
3
5 5
Geometric Mean (GM) - Positive root of Suppose you receive a 5 percent increase in
product of observations. Used to salary this year & a 15 percent increase next
measure rate of change of a variable year. The average annual percent increase
over time is 9.886, not 10.0. Why is this so?
GM rate of return, measures the status of XG ( X1 X2 Xn )1/ n
an investment over time, where Ri is the GM ( 1.05 )( 1.15 ) 1.09886
rate of return in time period i RG [(1 R1 ) (1 R2 ) (1 Rn )]1/ n 1
Harmonic Mean - Number of variables divided by the sum of the
reciprocals of the variables n
H
Appropriate for situations when the average of rate is desired n 1
Useful for ratios such as speed (=distance/time) etc i 1
xi
Median = 3 Median = 3
Mode – Value in the data set that occurs with greatest frequency
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
CONTD...
Percentile - Provides information about how the data are spread over
the interval from the smallest value to the largest value
• pth percentile is a value such that at least p percent of the items take on
this value or less and at least (100 - p) percent of the items take on this
value or more
• Arrange the data in ascending order
• Compute index I, the position of the pth percentile
• If i is not an integer, round up. p th percentile is the value in the i th position
• If i is an integer, pth percentile is the average of the values in positions i and i +1
At least 80% of the items At least 20% of the items
425 430 430 435 435 435 435 435 440 440
take on a value of 542 or less take on a value of 542 or more
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
56/70 = .8 or 80% 14/70 = .2 or 20%
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510 425 430 430 435 435 435 435 435 440 440
510 515 525 525 525 535 549 550 570 570 440 440 440 445 445 445 445 445 450 450
575 575 580 590 600 600 600 600 615 615
450 450 450 450 450 460 460 460 465 465
i = (p/100)n = (80/100)70 = 56 465 470 470 472 475 475 475 480 480 480
Average of 56th & 57th values: 80th Percentile 480 485 490 490 490 500 500 500 500 510
= (535 + 549)/2 = 542 510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
25% 25% 25% 25%
Q1 CONTD…
Q2 Q3
Quartile - Split the ranked data into 4 segments with an equal
number of values per segment
The first quartile, Q1, is the value for which 25% of the observations
are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are larger)
Only 25% of the observations are greater than the third quartile
First quartile position: Q1 = (n+1)/4
Second quartile position: Q2 = (n+1)/2 (the median position)
Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480 Third quartile = 75th percentile = 525
i = (p/100)n = (75/100)70 = 52.5 = 53
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
CONTD…
Statistic Formula Excel Formula Pro’s Con’s
Sum all values Familiar and uses all
divided by the Influenced by
Mean =AVERAGE(Data) the sample
number of values extreme values
information
Variation
Interquartile Range - 425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
Difference between 450 450 450 450 450 460 460 460 465 465
the third quartile & the 465 470 470 472 475 475 475 480 480 480
first quartile 480 485 490 490 490 500 500 500 500 510
Very sensitive to the 510 515 525 525 525 535 549 550 570 570
smallest and largest 575 575 580 590 600 600 600 600 615 615
data values 3rd Quartile (Q3) = 525 1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
2
2 i
( x )
CONTD...
2
N
Variance - Average of the The hourly wages
squared differences for a sample of part-
between each data value time employees at
and the mean Home Depot are:
• Difference between the value $12, $20, $16, $18,
of each observation (xi) & the and $19. What is the
mean (for a sample, m for a sample variance?
population)
130
4.3095
7
S
CONTD...
CV
X
100%
Coefficient of Variation - Used to compare/ measure relative variation
between two or more sets of data in percentage (%)
• The coefficient of variation indicates how large the standard deviation is in relation
to the mean
Both stocks have the same standard deviation, but stock B is less
variable relative to its price
CONTD...
Same center, different variation
60 Standard Deviation xi x 2
STDEV(B4:B16) 22.20967
i 1
n 1
Interpretation
For cars in this sample, the mean mpg is 20.1, with a standard deviation of 6.0. The distribution is
skewed to the right (+0.61) and somewhat flatter than a normal distribution (–0.37). Majority of the
automobiles considered in the Motor Trend Car Road Tests (mtcars) dataset have less mileage
than mean while very give more mpg.
Caselet 2 – Cars93 Dataset (In-built)
The Cars93 data frame has data of 93 Cars on Sale in the USA in 1993 arranged in 93 rows and 27 columns.
The description of the variables are in the data set are as follows:
1. Manufacturer - Manufacturer
2. Model - Model
3. Type – A factor with levels “Small", "Sporty", "Compact", "Midsize", "Large“ & "Van"
4. Min.Price – Minimum Price ($1000): Price for a basic version
5. Price – Mid-range Price ($1000): Average of Min.Price & Max.Price
6. Max.Price – Maximum Price ($1000): Price for a premium version
7. MPG.city – City MPG (miles per US gallon by EPA rating)
8. MPG.highway – Highway MPG
9. AirBags – Air Bags standard. Factor: none, driver only, or driver & passenger
10. DriveTrain – Drive train type: rear wheel, front wheel or 4WD; (factor)
11. Cylinders – No. of cylinders (missing for Mazda RX-7, which has a rotary engine)
12. EngineSize - Engine size (litres)
13. Horsepower - Horsepower (maximum)
14. RPM - RPM (revs per minute at maximum horsepower)
15. Rev.per.mile - Engine revolutions per mile (in highest gear)
16. Man.trans.avail - Is a manual transmission version available? (yes or no, Factor)
17. Fuel.tank.capacity - Fuel tank capacity (US gallons)
18. Passengers - Passenger capacity (persons)
19. Length - Length (inches)
20. Wheelbase - Wheelbase (inches)
21. Width - Width (inches)
22. Turn.circle - U-turn space (feet)
23. Rear.seat.room - Rear seat room (inches) (missing for 2-seater vehicles)
24. Luggage.room - Luggage capacity (cubic feet) (missing for vans)
25. Weight - Weight (pounds)
26. Origin - Of non-USA or USA company origins? (factor)
27. Make - Combination of Manufacturer and Model (character)
Assignment
1. Load the data set Cars93 with data(Cars93,package=“MASS”) and set randomly any
5 observations in the variables Horsepower and Weight to NA (missing values)
2. Calculate the arithmetic mean & the median of the variables Horsepower and
Weight
3. Calculate the standard deviation and the interquartile range of the variable Price
UNDERSTANDING FOOD
NUTRITIONAL EDUCATION WITH DATA
1. Lets learn about the structure of the dataset – str(USDA)
• There are 7058 observations collected on 16 variables
• Variables include: ID – Unique Identification no. starting with 1001; Description – Text description of
each food item studied; Calories – Amount of calories in 100 grams of food, measured in kilo calories;
Protein, TotalFat, Carbohydrate, SaturatedFat, Sugar – measured in grams; Sodium, Cholesterol,
Calcium, Iron, Potassium, VitaminC – measured in milli grams; VitaminE & VitaminD – measured in
standard national units
file:///C:/Users/npm/Desktop/IPE%20-%20Class%20of%202019%20-
%20Reference%20Material/R%202019/R%20-%20MITOpenCourseware%20-
%2010.07.2019/contents/an-introduction-to-analytics/understanding-food-nutritional-education-
with-data-recitation/video-4-creating-plots-in-r/index.htm
CASE STUDY
AN ANALYTICAL DETECTIVE
• Motor Vehicle Theft crimes - City of Chicago, Illinois, United States
Source: file:///C:/Users/npm/Desktop/IPE%20-%20Class%20of%202019%20-%20Reference%20Material/R%202019/R%20-%20MITOpenCourseware%20-
%2010.07.2019/contents/an-introduction-to-analytics/assignment-1/index.htm
Crime is an international concern, but it is documented & handled in very different ways
in different countries. In the United States, violent crimes & property crimes are recorded
by the Federal Bureau of Investigation (FBI). Additionally, each city documents crime, &
some cities release data regarding crime rates. There are two main types of crimes:
violent crimes, and property crimes. In this problem, we'll focus on one specific type of
property crime, called "motor vehicle theft" (sometimes referred to as grand theft auto).
This is the act of stealing, or attempting to steal, a car.
Chicago is the third most populous city in the United States, with a population of over 2.7
million people. In this problem, we'll use some basic data analysis in R to understand the
motor vehicle thefts in Chicago.
Variables Description: ID: a unique identifier for each observation; Date: the date the crime occurred;
LocationDescription: the location where the crime occurred; Arrest: whether or not an arrest was made for
the crime (TRUE if an arrest was made, & FALSE if an arrest was not made); Domestic: whether or not the
crime was a domestic crime, meaning that it was committed against a family member (TRUE if it was
domestic, and FALSE if it was not domestic); Beat: the area, or "beat" in which the crime occurred. This is the
smallest regional division defined by the Chicago police department; District: the police district in which the
crime occured. Each district is composed of many beats, and are defined by the Chicago Police
Department; CommunityArea: the community area in which the crime occurred. Since the 1920s, Chicago
has been divided into what are called "community areas", of which there are now 77. The community areas
were devised in an attempt to create socially homogeneous regions; Year: the year in which the crime
occurred; Latitude: the latitude of the location at which the crime occurred; Longitude: the longitude of the
location at which the crime occurred.
1. Read the dataset into R. It may take a few minutes to read in the data, as the dataset
is pretty large.
2. How many rows of data (observations) are in this dataset? How many variables are
there in the dataset? (Hint: Use str() & summary() functions to answer the following
questions.)
3. What is the maximum value of the variable "ID"? (Hint: Use max() function)
4. What is the minimum value of the variable "Beat"?
5. How many observations have value TRUE in the Arrest variable (this is the number of
crimes for which an arrest was made)?
6. How many observations have a LocationDescription value of ALLEY?
7. Chicago Police Department wish to increase the number of arrests made for motor
vehicle thefts in 5 locations. Which locations should police focus (excluding others)?
8. How many observations are in Top5? Create a subset of your data, only taking
observations for which the theft happened in one of these five locations, and call this
new data set "Top5". (Hint: Alternately, you could create five different subsets, and
then merge them together into one data frame using rbind.)
9. On which day of the week do the most motor vehicle thefts at gas stations happen?
10.One of the locations has a much higher arrest rate than the other locations. Which is
it?
1. Read the dataset into R. It may take a few minutes to read in the data, as the
dataset is pretty large.
Import Dataset button
2. How many rows of data (observations) are in this dataset? How many variables are
there in the dataset? (Hint: Use str() & summary() functions to answer the following
questions.)
str(cc)
dim(cc)
3. What is the maximum value of the variable "ID"? (Hint: Use max() function)
max(cc$ID)
4. What is the minimum value of the variable "Beat"?
min(cc$Beat)
summary(cc)
5. How many observations have value TRUE in the Arrest variable (this is the number of
crimes for which an arrest was made)?
summary(cc)
6. How many observations have a LocationDescription value of ALLEY?
table(cc$LocationDescription)
match("ALLEY",cc$LocationDescription)
summary(cc)
7. Chicago Police Department wish to increase the number of arrests made for motor vehicle thefts in 5
locations. Which locations should police focus (excluding others)?
sort(table(mvt$LocationDescription))
Locations with the largest number of motor vehicle thefts are listed in the last. These are Street, Parking Lot/
Garage (Non. Resid.), Residential Yard (Front/ Back), Alley, & Vehicle Non-commercial
8. How many observations are in Top5? Create a subset of your data, only taking observations for which the theft
happened in one of these five locations, and call this new data set "Top5". Alternately, you could create five
different subsets, and then merge them together into one data frame using rbind.
Top5 = subset(cc, LocationDescription=="STREET" | LocationDescription=="PARKING
LOT/GARAGE(NON.RESID.)" | LocationDescription=="RESIDENTIAL YARD (FRONT/BACK)"|
LocationDescription=="ALLEY" | LocationDescription=="VEHICLE NON-COMMERCIAL")
str(Top5)
X P(x)
0 0.125 Binomial:n=3p=.5
1 0.375 Minutes to Complete Task
0.4
2 0.375
0.3
3 0.125 0.3
1.000
0.2
P(x)
0.2
P(x)
0.1
0.1
0.0
0.0
1 2 3 4 5 6
0 1 2 3 Minute
C1
Example: Binomial Distribution Shaded area represents probability that
n=3 p=.5 task takes between 2 & 3 minutes
Theoretical Distributions
• Mathematical models – used to deduce mathematically what
the distribution of expected values should be based on the
basis of previous experiences or theoretical considerations
• There are 2 types of theoretical distributions:
• Discrete Theoretical distributions
• Continuous Theoretical distributions
# successes
# trials
P(success)
cumulative
(i.e. P(X≤x)?)
P(X=2)=.3020
MSA Electronics is experimenting with the manufacture of a new
transistor. Every hour a sample of 5 transistors is taken. The
probability of one transistor being defective is 0.15.
What is the probability of finding 3, 4, or 5 defective? Also find the
expected value (or mean) & variance of a binomial distribution.
Given
n = 5, p = 0.15, & r = 3, 4, or 5
= P(3)+P(4)+P(5)
= 0.0244+0.0022+0.0001
= 0.0267
Expected value np 5(0.15 ) 0.75
Variance np(1 p) 5(0.15 )(0.85 ) 0.6375
On average, 20% of the emergency room patients at Greenwood
General Hospital lack health insurance. In a random sample of 4
patients, what is the probability that at least 2 will be uninsured?
1. What is the mean and standard deviation of this binomial distribution?
2. What is the probability that the sample of 4 patients will contain at least 2 uninsured
patients?
3. What is the probability that fewer than 2 patients have insurance?
4. What is the probability that no more than 2 patients have insurance?
Parameters l = mean arrivals per unit of time p(x) =probability of exactly X arrivals/occurrences
x =desired event
Mean l
St. Dev.
l
A statistics instructor has observed
that the number of typographical
errors in new editions of textbooks
varies considerably from book to There is about 22% chance of
book. finding zero errors
1. After some analysis he concludes
that the number of errors is Poisson
distributed with a mean of 1.5 per
100 pages. The instructor randomly
selects 100 pages of a new book. There is a very small chance there
What is the probability that there are are no typos
no typos?
2. Suppose,instructor has just received P(X≤5) = P(0) + P(1) + … + P(5)
a copy of a new statistics book. He
notices that there are 400 pages
with a mean errors of 6 per page.
a) What is the probability that there are
no typos? For µ =6, P(X ≤ k) = .446
b) What is the probability that there are There is about a 45% chance there
five or fewer typos? are 5 or less typos
MS EXCEL
We are investigating the safety of dangerous intersection.
Past police records indicate the mean of 5 accidents per
month at the intersection. The number of accidents is
distributed according to a Poisson distribution, & the
Highway Safety Division wants us to Calculate the
probability in any month of exactly 0,1,2,3, or 4 accidents.
Highway Safety Division will take action to improve the
intersection if probability of more than 3 accidents per
month exceeds 0.65. Should they act?
Calculate the probability of having 0,1,2 or 3 accidents and then subtract
the sum from 1.0 to get the probability for more than 3 accidents
P(3 or fewer) = 0.03370 + 0.08425 + 0.14042 + 0.17552
= 0.26511
Probability of more than 3 accidents: P(more than 3) = 1 - 0.26511
= 0.73489
As 0.73489 exceeds 0.65, steps should be taken to improve the
intersection
Baggage is rarely lost by Northwest Airlines. Suppose a
random sample of 1,000 flights shows a total of 300 bags
were lost. Thus, the arithmetic mean number of lost bags per
flight is 0.3 (300/1,000). If the number of lost bags per flight
follows a Poisson distribution with u = 0.3.
Find the probability of not losing any bags.
On Thursday morning between 9 A.M. and 10 A.M. customers
arrive and enter the queue at the Oxnard University Credit
Union at a mean rate of 1.7 customers per minute.
• Find the mean and standard deviation
• What is the probability that there are exactly 4 Customers in the
bank
x l x 1.7
l e (1.7) e
PDF = P( x)
x! x!
Applications
• It is used in quality control statistics to count the no. of
defected samples.
• It is used in insurance problems to count the no. of
casualties.
• No. of faulty blades in a packet of 100.
• No. of printing mistakes at each page of the book.
• No. of suicides reported in a particular city.
Note: - Poisson is a good approximation of the binomial
when n ≥ 20 and p ≤ 0.05
Case Study - IV
1. P(X=0) = 0.27253
2. P(X=1) = 0.35429
3. P(X=2) = 0.23029
4. P(X=3) = 0.09979
5. P(X<3) = 0.27253 + 0.35429 + 0.23029 = 0.857 = 85.7%
NORMAL DISTRIBUTION
o Most popular & useful continuous probability distribution
o Defined by 2 parameters: standard deviation & mean
o Characteristics of Normal Probability Distribution: -
• Curve has a single peak; thus, it is uni-modal
• Mean of a normally distributed population lies at the center of its
normal curve
• For a normal curve, the mean, median & mode are the same value
• 2 tails of the normal probability distribution extend indefinitely & never
touch the horizontal axis
68% of the population would have Iqs 95.4% of the people have IQs 99.7% of the population have
between 85-115 points (±1). Only 16% between 70 IQs in the range from 55 to 145
of the people have IQs > 115 points and 130 (±2) points (±3)
• Standard normal distribution
Normal distribution whose mean
is zero & standard deviation is
one
• Increasing the mean shifts the
curve to the right
• Increasing the standard
deviation “flattens” the curve
• We can use the following
function to convert any normal
random variable to a standard
normal random variable
X
Z
where
X = value of the random variable we want to measure
µ = mean of the distribution
= standard deviation of the distribution
Z = number of standard deviations from X to the mean, µ
= 20 days
• From normal distribution table, for Z = 1.25
the area is 0.3944
• There are 89% (0.5+0.3944=0.8944) chances • For Z = 1.25 the area is 0.3944
that contract will not incur penalty • There are 10% (0.5-0.3944=0.1056)
• Probability of completing the triplexes in chances that contract will earn a
more than 125 days is 1 – 0.8944 = 0.1056 bonus of $5,000 on completion of
Layton Tire and Rubber Company wishes to set
a minimum mileage guarantee on its new
MX100 tire. Tests reveal the mean mileage is
67,900 with a standard deviation of 2,050 miles
and that the distribution of miles follows the
normal probability distribution. It wants to set
the minimum guaranteed mileage so that no
more than 4 percent of the tires will have to be
replaced.
What minimum guaranteed mileage should
Layton announce?
Case Study
p (T t ) 1 e t
4 ( 0.5 )
p (T 0.5) 1 e
1 e 2 0.86
0 .3
0 .3
f(z)
0 .2
f(x)
0 .2
0 .1
0 .1 0 .0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Z
0 .0
0 .8
0 .3
0 .7
f(z)
0 .6 0 .2
0 .5
f(x)
0 .4 0 .1
0 .3
0 .0
0 .2 -5 -4 -3 -2 -1 0 1 2 3 4 5
0 .1 Z
0 .0
95% Confidence Interval:
x 1.96
x
n
95% Confidence Interval: n = 40
NORMAL DISTRIBUTION (ND)
• Four (4) functions to generate the values x, q vector of quantiles
associated with the normal distribution - p vector of probabilities
pnorm, qnorm, dnorm, & rnorm n number of
observations.
• To get a full list of them & their options using If length(n) > 1, the
length is taken to be
the help command the number required
help(Normal) mean vector of means
dnorm(x, mean = 0, sd = 1, log = FALSE) sd vector of standard
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p deviations
= FALSE) log, log.p logical; if TRUE,
probabilities p are
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p
given as log(p)
= FALSE)
lower.tail logical; if TRUE
rnorm(n, mean = 0, sd = 1) (default), probabilities
are P[X ≤
• If we don’t specify mean & standard x] otherwise, P[X > x]
deviation, the standard normal distribution
is assumed (mean=0 & standard
deviation=1)
dnorm() – Given a set of values, returns height of probability
distribution at each point - dnorm(x, mean = 0, sd = 1, log = FALSE)
Returns height of normal curve - mean = zero (0) & standard deviation
= one (1)
dnorm(0)
User can change the mean & standard deviation in the argument
dnorm(0,mean=4)
dnorm(0,mean=4, sd=10)
Next, plot the graph of a normal distribution
curve(dnorm,-3,3)
x = seq(-3,3,by=0.1)
y = dnorm(x,mean=2.5,sd=0.8)
plot(x,y)
#For a normal curve with µ = 500 & σ = 100
## Loading required package: ggplot2
require(ggplot2)
ggplot(data.frame(x = c(200, 800)), aes(x = x)) + stat_function(fun =
dnorm, args = list(mean = 500, sd = 100)) + geom_segment(aes(x = 200,
pnorm() – Given a number/ a list, computes the probability that a
normally distributed random number will be less than (<) that
number - pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p =
FALSE)
Returns the area under a given value
pnorm(0,mean=2,sd=3)
For a normal curve with µ = 500 & σ = 100, what is the probability that
P(X<400)
pnorm(400,mean=500,sd=100)
Output: - 0.16
Alternatively, to get the area over a certain value, use the option
lower.tail=FALSE
What is the probability that P(X>650)
1 - pnorm(650,mean=500,sd=100) #OR
pnorm(650,mean=500,sd=100,lower.tail=FALSE)
Output: - 0.067
What is the probability that P(|X-500| > 150)
pnorm(350, 500, 100) + (1 - pnorm(650, 500, 100))
To plot the graph of pnorm, use curve function: curve(pnorm(x),-3,3)
x = seq(-20,20,by=.1)
y = pnorm(x,mean=2,sd=3)
plot(x,y)
pnorm(0,lower.tail=FALSE)
pnorm(0,mean=2,sd=1,lower.tail=FALSE)
x = seq(-20,20,by=.1)
y = pnorm(x,mean=2,sd=3, lower.tail = FALSE)
plot(x,y)
qnorm() – Inverse of pnorm(). Returns the Z-score of a given
probability
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(0.5)
qnorm(0.5,mean=1,sd=2)
x = seq(0,1,by=.05)
y = qnorm(x,mean=3,sd=2)
plot(x,y)
Cleaning Data
1. Selecting Variables – Allows to limit dataset to specified variables (columns)
2. Selecting Observations - Limits observations (rows - meeting a specific criteria. Multiple
criteria can be combined with the & (AND) and | (OR) symbols)
3. Creating/ recoding Variables - Creates new variables/ transforms existing ones
4. Summarizing data - Reduces multiple values to a single value
5. Using Pipes - Operator passes the result on the left to the first parameter of the function
on the right
6. Reshaping Data
7. Missing Data
Selecting Variables – select() function limits dataset to specified
variables (columns) – starwars Dataset
o Keep only the variables name, height & gender
s1 = select(starwars, name, height, gender)
o Keep the variables name, & all variables between mass & species inclusive
s2 = select(starwars, name, mass:species)
o Keep all variables except birth_year & gender - s3 = select(starwars, -
birth_year, -gender)
Selecting Observations – filter() function allows to limit
observations (rows) meeting a specific criteria. Multiple criteria
can be combined with the & (AND) and | (OR) symbols – starwars
Dataset
o Select only females - s4 = filter(starwars, gender == “female”)
o Select females from Alderaan - s5 = filter(starwars, gender == “female” &
homeworld == “Alderaan”)
o Select individuals from Alderaan, Coruscant, or Endor
s6 = filter(starwars, homeworld==“Alderaan” | homeworld==“Coruscant” |
homeworld==“Endor”)
s7 = filter(starwars, homeworld %in% c(“Alderaan”, “Coruscant”, “Endor”))
Creating/ Recoding Variables – mutate() function creates new
variables or transforms existing ones
o Convert height in cms to inches & mass in kgs to pounds
s8 = mutate(starwars, height = height * 0.394, mass = mass * 2.205)
o If height > 180 then heightcat = “tall”, otherwise heightcat = “short”
s9 = mutate(starwars, heightcat = ifelse (height > 180, “tall”, “short”)
o Convert eye color that is not black, blue or brown to other
s10=mutate(starwars,eye_color=ifelse(eyecolor %in%
c(“black”,”blue”,”brown”), eye_color,”other”)
o Set heights > 200 or heights < 75 to missing
s11 = mutate(starwars, height = ifelse(height < 75 | height > 200), NA, height)
Source: - http://r-statistics.co/Missing-Value-Treatment-With-
• http://r-statistics.co/Missing-Value-Treatment-With-R.html
• https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-
imputing-missing-values/
• https://www.rdocumentation.org/packages/VIM/versions/4.8.0/topics/aggr
• http://r-statistics.co/Missing-Value-Treatment-With-R.html
• https://ocw.mit.edu/courses/sloan-school-of-management/15-071-the-
analytics-edge-spring-2017/lecture-and-recitation-notes/
Dr. Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: shahmsc@ipeindia.org Mobile: + (91)98666 66620
INTRODUCTION TO PYTHON
Mode Variance
X i
X i 1
XG ( X1 X2 Xn )1/ n
n
Mid point of ranked Most frequently
values observed value
• Interval/ ratio variables - Mode,
CONSIDERATIONS FOR Median, & Mean
CHOOSING A MEASURE OF • Nominal variables - Mode
CENTRAL TENDENCY • Ordinal variables – Mode & Median
5.2 Measures of Dispersion/ Variation
• Measures of location only describe the center of the data. It doesn’t tell us
anything about the spread of the data
• Variation/ Dispersion - “spread” of data points about the center of the
distribution in a sample, that is, the extent to which the observations are
scattered
• Enables to study & compare the spread in two or more distributions
Example: - in choosing supplier A or supplier B we can measure the variability in delivery time for each
Variation
n
Standard deviation xi x 2
=STDEV(Data)
Uses same units as the raw data
Non-intuitive meaning
i 1
(s) ($ , £, ¥, etc.)
n 1
data_partial=data.iloc[:,[13,14,15]]
data_partial
USDA_partial= data_partial.describe(include='all')
USDA_partial
6. Select columns (Description & Sodium) for rows
(1,2,3,4, & 5) using loc.
data.loc[[1,2,3,4,5],['Description','Sodium']]
Source: https://www.nyclu.org/en/stop-and-frisk-data
http://michael.hahsler.net/research/arules_RUG_2015/demo/
year - year of stop yyyy; pct - precinct of stop 1 through 123; ser_num -UF-250 serial number nnn; datestop - date of stop mmddyyyy;
timestop - time of stop hhmm; city - location of stop city (1 – Manhattan, 2 – Brooklyn, 3 – Bronx, 4 – Queens, 5 - Staten Island); sex - suspect's
sex (0 – female, 1 – male); race - suspect's race (1 – black, 2 - black Hispanic, 3 - white Hispanic, 4 – white, 5 - Asian/Pacific Islander, 6 - Am.
Indian/Native); dob - suspect's date of birth mmddyyyy; age - suspect's age nnn; height - suspect's height in inches nnn; weight - suspect's
weight in pounds nnn; haircolor - suspect's haircolor (1 – black, 2 – brown, 3 – blonde, 4 – red, 6 – white, 7 – bald, 8 – sandy, 9 - salt and
pepper, 10 – dyed, 11 – frosted); eyecolor – suspect’s eyecolor (1 – black, 2 – brown, 3 – blue, 4 – green, 5 – hazel, 6 – gray, 7 - maroon,
pink, violet, 8 - two different); build - suspect's build (1 – heavy, 2 – musuclar, 3 – medium, 4 – thin); othfeatr - suspect's other features
[string];frisked - was suspect frisked?; searched - was suspect searched?; contrabn - was contraband found on suspect?; pistol - was a
pistol found on suspect?; riflshot - was a rifle found on suspect?; asltweap - was an assault weapon found on suspect?; knifcuti - was a knife
or cutting instrument found on suspect?; machgun - was a machine gun found on suspect?; othrweap - was another type of weapon
found on suspect; arstmade - was an arrest made?; arstoffn - offense suspect arrested for [string]; sumissue - was a summons issued?;
sumoffen -offense suspect was summonsed for [string]; crimsusp - crime suspected [string]; detailcm - crime code description see
attached; perobs - period of observation in minutes nnn; perstop - period of stop in minutes nnn; pf_hands - physical force used by officer –
hands; pf_wall - physical force used by officer - suspect on ground; pf_grnd - physical force used by officer - suspect against wall; pf_drwep
- physical force used by officer - weapon drawn; pf_ptwep - physical force used by officer - weapon pointed; pf_baton - physical force
used by officer – baton; pf_hcuff - physical force used by officer – handcuffs; pf_pepsp - physical force used by officer - pepper spray;
pf_other - physical force used by officer – other; cs_objcs - reason for stop - carrying suspicious object; cs_descr - reason for stop - fits a
relevant description; cs_casng - reason for stop - casing a victim or location; cs_lkout - reason for stop - suspect acting as a lookout;
cs_cloth - reason for stop - wearing clothes commonly used in a crime; cs_drgtr - reason for stop - actions indicative of a drug transaction;
cs_furtv - reason for stop - furtive movements; cs_vcrim - reason for stop - actions of engaging in a violent crime; cs_bulge - reason for stop -
suspicious bulge; cs_other - reason for stop – other; rf_vcrim - reason for frisk - violent crime suspected; rf_othsw - reason for frisk - other
suspicion of weapons; rf_attir - reason for frisk - inappropriate attire for season; rf_vcact - reason for frisk- actions of engaging in a violent
crime; rf_rfcmp - reason for frisk - refuse to comply w officer's directions; rf_verbl - reason for frisk - verbal threats by suspect; rf_knowl -
reason for frisk - knowledge of suspect's prior crim behav; rf_furt - reason for frisk - furtive movements; rf_bulg - reason for frisk - suspicious
bulge; sb_hdobj - basis of search - hard object; sb_outln - basis of search - outline of weapon; sb_admis - basis of search - admission by
suspect; sb_other - basis of search – other; ac_proxm - additional circumstances - proximity to scene of offense; ac_evasv - additional
circumstances - evasive response to questioning; ac_assoc - additional circumstances - associating with known criminals; ac_cgdir -
additional circumstances - change direction at sight of officer; ac_incid - additional circumstances - area has high crime incidence;
ac_time - additional circumstances - time of day fits crime incidence; ac_stsnd - additional circumstances - sights or sounds of criminal
activity; ac_rept - additional circumstances - report by victim/witness/officer; ac_inves - additional circumstances - ongoing investigation;
ac_other - additional circumstances – other; forceuse - reason for force (1 - defense of other, 2 - defense of self, 3 - overcome resistence, 4
– other, 5 - suspected flight, 6 - suspected weapon); inout - was stop inside or outside (0 – outside, 1 – inside); trhsloc - was location housing
or transit authority (0 – neighter, 1 - housing authority, 2 - transit authority); premname - location of stop premise name [string]; addrnum -
location of stop address number [string]; stname - location of stop street name [string]; stinter - location of stop intersection [string]; crossst -
location of stop cross street [string]; addrpct - location of stop address precinct [string]; sector - location of stop sector 1 through 21; beat -
location of stop beat nnn; post - location of stop post nnn; xcoord - location of stop x coord nnnnnnn (1983 state plane, feet, long island);
ycoord - location of stop y coord nnnnnnn (1983 state plane, feet, long island); typeofid - stopped person's identification type (1 - photo
id, 2 - verbal id; 3 - refused to provide id); othpers - were other persons stopped, questioned or frisked ?; explnstp - did officer explain
reason for stop?; repcmd - reporting officer's command 1 through 999; revcmd - reviewing officer's command 1 through 999; offunif - was
officer in uniform?; offverb - verbal statement provided by officer (if not in uniform); officrid - id card provided by officer (if not in uniform);
offshld - shield provided by officer (if not in uniform); radio - radio run; recstat - record status (0 - original value A, 1 - original value 1); linecm
count >1 additional details. All variables with no values listed have the following values: (0 – no, 1 - yes)
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: shahmsc@ipeindia.org Mobile: + (91)98666 66620
INTRODUCTION TO GRAPHICS
• One of the greatest powers of R is its graphical capabilities
• The Graphics Window
• When pictures are created in R, they are presented in the active graphical window. If
no such window is open when a graphical function is executed, R will open one.
• Some features of the graphics window:
• You can print directly from the graphics window, or choose to copy the graph to the clipboard and
paste it into a word processor. There, you can also resize the graph to fit your needs.
• A graph can also be saved in many other formats, including pdf, bitmap, metafile, jpeg, or
postscript.
• Each time a new plot is produced in the graphics window, the old one is lost. In MS Windows, you
can save a “history” of your graphs by activating the Recording feature under the History menu
(seen when the graphics window is active). You can access old graphs by using the “Page Up” and
“Page Down” keys. Alternatively, you can simply open a new active graphics window (by using the
function x11() in Windows/Unix and quartz() on a Mac).
• There are many functions in R that produce graphs, and they range
TWO BASIC GRAPHING FUNCTIONS
from the very basic to the very advanced and intricate.
• The plot() Function
• The most common function used to graph anything in R is the plot()
function.
• This is a generic function that can be used for scatterplots, time-series
plots, function graphs, etc.
• If a single vector object is given to plot(), the values are plotted on the
y-axis against the row numbers or index.
• If two vector objects (of the same length) are given, a bivariate
scatterplot is produced.
• For example, consider again the dataset trees in R. To visualize the
relationship between Height and Volume, we can draw a scatterplot:
plot(Height, Volume) # object trees is in the search path
• The first variable is plotted along the horizontal axis and the second
variable is plotted along the vertical axis. By default, the variable
names are listed along each axis.
• plot() function can allow for some pretty snazzy window dressing by
changing the function arguments from the default values. These
include adding titles/subtitles, changing the plotting character/color
(over 600 colors are available!), etc.
• See ?par for an overwhelming lists of these options.
• The curve() Function
CONTD...
• To graph a continuous function over a specified range of
values, the curve() function can be used (although interestingly
curve() actually calls the plot() function).
• The syntax of this function is:
curve(expr, from, to, add = FALSE, ...)
Where, expr: an expression written as a function of 'x'
from, to: the range over which the function will be plotted.
add: logical; if 'TRUE' add to already existing plot.
• Note that it is necessary that the expr argument is always
written as a function of 'x'.
• If the argument add is set to TRUE, the function graph will be
overlaid on the current graph in the graphics window
• For example, the curve() function can be used to plot the sine
function from 0 to 2π:
> curve(sin(x), from = 0, to = 2*pi)
GRAPH EMBELLISHMENTS
• In addition to standard graphics functions, there are a host of other functions that
can be used to add features to a drawn graph in the graphics window.
• These include:
Function Operation
abline() adds a straight line with specified intercept and slope
arrows() adds an arrow at a specified coordinate
lines() adds lines between coordinates
points() adds points at specified coordinates
rug() adds a “rug” representation to one axis of the plot
segments() similar to lines() above
text() adds text (possibly inside the plotting region)
title() adds main titles, subtitles, etc. with other options
CHANGING GRAPHICS PARAMETERS
• There is still more fine tuning available for altering the graphics settings. To make changes to
how plots appear in the graphics window itself, or to have every graphic created in the
graphics window follow a specified form, the default graphical parameters can be changed
using the par() function.
• There are over 70 graphics parameters that can be adjusted, so only a few will be mentioned
here. Some very useful ones are given below:
> par(mfrow = c(2, 2)) # gives a 2 x 2 layout of plots
> par(lend = 1) # gives "butt" line end caps for line plots2
> par(bg = "cornsilk") # plots drawn with this colored background
> par(xlog = TRUE) # always plot x axis on a logarithmic scale
• Any or all parameters can be changed in a par() command, and they remain in effect until
they are changed again (or if the program is exited). You can save a copy of the original
parameter settings in par(), and then after making changes recall the original parameter
settings.
• To do this, type
> oldpar <- par(no.readonly = TRUE)
UNI-VARIATE DATA
• There is a distinction between types of data in statistics and R.
• In particular, initially, data can be of three basic types:
categorical, discrete numeric and continuous numeric.
• Categorical data is data that records categories.
• Example: -
• in a doctor's chart which records data on a patient. the gender or the
history of illnesses might be treated as categories.
• the age of a person and their weight are numeric quantities.
• Categorical data
• We often view categorical data with tables but we may also
look at the data graphically with bar graphs or pie charts.
• Using tables - Its simplest usage looks like table(x) where x is a
categorical variable. The table command simply adds up the
frequency of each unique value of the data.
• Example: - Smoking survey - A survey asks people if they smoke or not.
The data is Yes, No, No, Yes, Yes. We can enter this into R with the c()
command, and summarize with the table command as follows
> x=c("Yes","No","No","Yes","Yes")
> table(x)
x
No Yes
23
• Factors CONTD...
• Categorical data is often used to classify data into various levels or factors.
• For example, the smoking data could be part of a broader survey on
student health issues.
• R has a special class for working with factors which is occasionally
important to know as R will automatically adapt itself when it knows it has
a factor.
• To make a factor is easy with the command
factor or as.factor.
• If you want to add colours to your box plot, you can use the option
col and specify a vector with the colour numbers or the colour
names.
boxplot(trees, las = 2, col =
c("red","sienna","palevioletred1","royalblue2","red","sienna
","palevioletred1", "royalblue2","red","sienna","palevioletre
d1","royalblue2"))
• Now, for the finishing touches, we can put some labels to plot.
• The common way to put labels on the axes of a plot is by using the
arguments xlab and ylab.
• boxplot(trees, ylab =“Value", xlab =“Attributes", las = 2,
col=c("red","sienna","palevioletred1","royalblue2","red","
sienna","palevioletred1","royalblue2","red","sienna","pale
violetred1","royalblue2“))
• Produce box-and-whisker plot(s) of the given (grouped) values.
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch =
FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL,
log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
horizontal = FALSE, add = FALSE, at = NULL)
where
• X for specifying data from which the boxplots are to be produced.
• range determines how far the plot whiskers extend out from the box. If range is
positive, the whiskers extend to the most extreme data point which is no more
than range times the interquartile range from the box. A value of zero causes
the whiskers to extend to the data extremes.
• width a vector giving the relative widths of the boxes making up the plot.
• varwidthif varwidth is TRUE, the boxes are drawn with widths proportional to the
square-roots of the number of observations in the groups.
• notchif notch is TRUE, a notch is drawn in each side of the boxes. If the notches
of two plots do not overlap this is ‘strong evidence’ that the two medians differ
• outlineif outline is not true, the outliers are not drawn
• names group labels which will be printed under each boxplot.
• boxwex a scale factor to be applied to all boxes. When there are only a few
groups, the appearance of the plot can be improved by making the boxes
narrower.
• staplewexstaple line width expansion, proportional to box width.
• outwexoutlier line width expansion, proportional to box width.
plotif TRUE (the default) then a boxplot is produced. If not, the summaries which
the boxplots are based on are returned.
borderan optional vector of colors for the outlines of the boxplots. The values in
border are recycled if the length of border is less than the number of plots.
colif col is non-null it is assumed to contain colors to be used to colour the bodies
of the box plots. By default they are in the background colour.
logcharacter indicating if x or y or both coordinates should be plotted in log
scale.
parsa list of (potentially many) more graphical parameters, e.g., boxwex or
outpch; these are passed to bxp (if plot is true); for details, see there.
horizontallogical indicating if the boxplots should be horizontal; default FALSE
means vertical boxes.
addlogical, if true add boxplot to current plot.
atnumeric vector giving the locations where the boxplots should be drawn,
particularly when add = TRUE; defaults to 1:n where n is the number of boxes.
CONTD…
• Do it by shift
• Examples
plot(D$metmin,D$wg,main='Met
Minutes vs. Weight
Gain',xlab='Mets(min)',
ylab='Weight Gain (lbs)')
plot(D$metmin,D$wg,main='Met
Minutes vs. Weight Gain',
xlab='Mets(min)',ylab='Weigh
t Gain (lbs)',pch=2)
PROBABILITY, DISTRIBUTIONS, AND
SIMULATION
• Distribution Functions in R
• R allows for the calculation of probabilities, the evaluation of probability
density/mass functions, percentiles, and the generation of pseudo-random variables
following a number of common distributions.
• The following table gives examples of various function names in R along with
additional arguments.
CONTD...
• Prefix each R name given above with ‘d’ for the density or mass
function, ‘p’ for the CDF, ‘q’ for the percentile function (also called the
quantile), and ‘r’ for the generation of pseudorandom variables.
• The syntax has the following form – we use the wildcard rname to
denote a distribution above:
> drname(x, ...) # the pdf/pmf at x (possibly a vector)
> prname(q, ...) # the CDF at q (possibly a vector)
> qrname(p, ...) # the pth (possibly a vector) percentile/quantile
rrname(n, ...) # simulate n observations from this distribution
• Conditioning
• Trellis plots are based on the idea of conditioning on the
values taken on by one or more of the variables in a data
set.
• In the case of a categorical variable, this means carrying out
the same plot for the data subsets corresponding to each of
the levels of that variable.
• In the case of a numeric variable, it means carrying out the
same plots data subsets corresponding to intervals of that
variable.
EXAMPLE # 1
EARTHQUAKE LOCATIONS
• R contains a data set
called quakes which gives
the location and
magnitude of
earthquakes under the
Tonga Trench, to the
North of New Zealand.
• The spatial distribution of
earthquakes in the area is
of major interest, because
this enables us to “see”
the structure of the
earthquake faults. Here is
a plot from the Geology
department at Berkeley,
which tries to present the
spatial structure.
Tonga Trench Earthquakes
Yellow: 0 − 70 km
Orange: 71 − 300 km
Red: 300 − 800 km
Problems with this Presentation
CONTD…
• There is a good deal of over-plotting and this makes it
hard to see all of the structure present in the data.
• The map makes it clear that we are looking down from
above on the scene, but deeper quakes appear to be
plotted on top of shallower ones.
• The division of depths into three intervals and presentation
using colour is relatively crude.
A Trellis Plot
• We can overcome many of the problems of the previous
plot by using a trellis display.
• We create the display by producing a sequence of
graphs, each of which presents a different range of
depths.
• In this case we will have a slight overlap of the intervals
being plotted.
• Explanation
• The plot is read left-to-right and bottom- CONTD…
to-top.
• Depth increases progressively through
the plot.
• There are eight different depth intervals,
each containing approximately the
same number of earthquakes.
• Consecutive depth intervals overlap by
a small amount.
• The range of depths covered by each
interval is indicated in the bar above
each plot.
• Interpretation
• The shallower earthquakes are
concentrated on two inclined fault
planes.
• The most easterly of these fault planes is
the one which bisects New Zealand.
• The Westerly fault plane has mainly
shallow earthquakes, while the Easterly
fault plane has both shallow and deep
earthquakes.
• The deep earthquakes show distinct
EXAMPLE # 2
BARLEY
• The data in the example shows the yields obtained
from field trials of barley seed. The data YIELDS
comes from
the 1930s where there was no direct genetic
modification. The trials were conducted in 1931 and
1932, using 10 different strains of barley and 6
different growing sites. There are 2 × 10 × 6 = 120
observations. It was suspected for a long time that
there was something odd about this data set.
The Trellis Plot CONTD…
• The plot we will look at
shows that barley yields
for each of the 10 strains
at the 6 sites and for
each year.
• The results for each site
are plotted on a
separate graph i.e. we
are working conditional
on the site.
• The yields from the two
years are superimposed
on each of the plots.
THE TRELLIS TECHNOLOGY
• There are a variety of displays which can be produced by Trellis, including:
• Bar Charts
• Dot Charts
• Box and Whisker Plots
• Histograms
• Density Traces
• QQ Plots
• Scatter Plots
• A common framework is used to produce all these plots.
• Every Trellis display consists of a series of rectangular panels, laid out in a regular row-
by-column array.
• The indexing of the array is left-to-right, bottom-to-top.
• The x axes of all the panels are identical. This is also true for the y axes.
• Each panel of the a display corresponds to conditioning, either on the levels of a
CONTD…
Shingles
• The conditioning carried out in the earthquake plot is described by a shingle.
• A shingle consists of a number of overlapping intervals (like the shingles on a
roof of a house).
• Assuming that the earthquake depths are contained in the variable depth,
the shingle is created as follows.
> depth = quakes$depth
> Depth = equal.count(depth, number=8, overlap=.1)
• The shingle assigned to Depth has 8 intervals with adjacent intervals having
10% of their values in common.
• A shingle contains the numerical values it was created from and can be
treated like a copy of that variable. For example:
> range(Depth)
[1] 40 680
> range(depth)
[1] 40 680
CONTD…
• A shingle also has the
information attached to it. This
can be displayed by printing
or plotting the shingle.
• > plot(Depth) Producing the
Plot
• The display of the
earthquakes is produced by
the function xyplot, which is
the Trellis variant of a scatter
plot function.
• The plot was produced as
follows:
> Depth =
equal.count(quakes$depth,
number = 8, overlap = .1)
xyplot(lat ~ long | Depth, data
= quakes, xlab = "Longtitude",
ylab = "Lattitude")
lat ~ long | Depth - is an instruction
to plot lat on the y axis against long
on the x axis with conditioning
intervals as described in Depth.
• The second argument to xyplot
CONTD…
Unconditional Plots
• The xyplot function can be used to produce an unconditional plot by
omitting the conditioning specification from the plot formula.
> xyplot(lat ~ long, data = quakes, xlab = "Longtitude", ylab = "Lattitude")
EXAMPLE # 2
BARLEY YIELD PLOT
• The barley yield plot is produced
by the function dotchart which
can be used to numeric values
against a categorical variable.
• In this case, the numeric variable is
the barley yield and the
categorical variable is the seed
strain.
• We also condition on the value of
another variable, the growing site.
• A First Attempt - creating a dot chart
> dotplot(variety ~ yield | site, data
= barley)
• A Second Attempt - conditioning on
both site and year
> dotplot(variety ~ yield | site * year,
data = barley)
• A Third Attempt - What we need
is to superimpose the two years CONTD…
for each site on a single panel
> dotplot(variety ~ yield | site, data
= barley, panel = panel.superpose,
group = year, pch = c(1, 3))
Layout Control
• The layout argument should be a
vector of three values: number of
rows, number of columns and
number of pages desired for the
display.
• For example, we can rearrange the
earthquake plot as follows:
> xyplot(lat ~ long | Depth, data = quakes,
layout = c(4, 2, 1),
xlab = "Longtitude",
ylab = "Lattitude")
Aspect Ratio Control CONTD…
• The panels in the previous plot are
rather too tall relative to their
widths.
• By default, plots are sized so that
they can occupy the full surface
of the output window.
Dotchart 1
• We start by displaying deaths against
age, conditional on population group.
dotplot(group ~ rate | age, xlab = "Death
Rate (per 1000)", layout = c(1, 5, 1))
Dotchart 2
• The first display is hard to read because
the variation within each age group is
“noisy.”
• Here we order the population categories
differently.
• Alternatively we can interchange the
roles of the cross-classifying variables.
> dotplot(age ~ rate | group, xlab = "Death
CONTD…
Dotchart 3
• The second display is better than the
first, but can improve it with a
different ordering of the panels.
• We’ll arrange the panels in a 2 × 2
array.
• This will allow us to make direct
male/female and urban/rural
comparisons.
> dotplot(age ~ rate | group, xlab =
"Death Rate (per 1000)", layout =
c(2, 2, 1))
Alternative Displays
• The previous displays presented the
data in “dotchart” displays.
• There are other alternatives, barcharts
for example.
> barchart(age ~ rate | group, xlab
= "Death Rate (per 1000)", layout =
c(2, 2, 1))
EXAMPLE # 4
• In this example we’ll examine
the heights of the members of
HEIGHTS OF SINGERS
a large choral society. • The
values are in a data set called
singer which is in the Lattice
data library. They can be
loaded with the data
command once the Lattice
library is loaded. The variables
are named height (inches) and
voice.part.
> bwplot(voice.part ~ height,
data=singer, xlab="Height (inches)")
> qqmath(~ height | voice.part,
aspect = 1, data = singer)
EXAMPLE # 5
MEASUREMENT OF EXHAUST FROM BURNING ETHANOL