Sie sind auf Seite 1von 295

BUSINESS ANALYTICS FOR MANAGERS

R STUDIO (VER.3.5.2) & PYTHON (VER.3.7.3)


SYLLABUS
Course Objective(s)
This course will cover the basic concepts of data preparation and analysis with
emphasis laid on business needs. The course is intended for first year
management students coming from a background of engineering, commerce,
arts, computer sciences, statistics, mathematics, economy and management.
This course seeks to present the student with a wide range of data analytic
techniques and is structured around the broad contours of the different types of
data analytics namely: descriptive, inferential, predictive, and prescriptive
analytics.

Unit I : Introduction to Business Analytics and Data Preparation


Types of Digital Data: Structured Data, Unstructured Data, and Semi-Structured Data;
Overview of Business Analytics; Functional Applications of Business Analytics in
Management.
Unit II: Business Analytics using R
Introduction to R Programming; Installing R and R Studio; Data Structures in R: Vectors, Dataframes,
Lists, Matrices and Array Operations; Summarizing Data: Numerical and Graphical Summaries; Data
Visualization in R: Histogram, Bar Chart, Scatter Plot, Box Plot, Corrgram, Corrplot, ggplot2; Data
Manipulation in R: Built-in Functions, Apply Functions, Date and Time Functions, Dplyr Functions, Pipe
Operator; Data Transformation: Filtering, Dropping, Merging, Sorting, Reshaping of Data, Detecting
Missing Values in the Data, Imputation; Data Import and Export Techniques in R; Statements:
Conditional Statements and Control Statements.
Statistical Applications: Parametric One and Two Sample Tests: Z-test, t-test, Chi-square test, and
ANOVA; Non-parametric Test: Mann Whitney U test, Wilcoxon test, Kruskal Wallis test; Correlation
Analysis; Simple and Multiple Linear Regression.
Unit III: Business Analytics using Python
Introduction to Python Programming; Installing Python, Pycharm and Anaconda; Data Structures in
Python: Variables, Files, Lists, Dictionaries, Tuples, and Sets; Functions: In-built, User-defined and
Lambda Functions; Statements: Conditional Statements and Control Statements; Exception
Handling.
Data Import and Export Techniques in Python; Summarizing Data: Numerical and Graphical
Summaries; Data Visualization in Python: Matplotlib and Seaborn libraries; Data Analysis: Numpy,
Pandas and Sklearn Libraries; Model building for Simple and Multiple Linear Regression.
Unit I
Types of Digital Data: Structured Data, Unstructured Data, and Semi-
Structured Data; Overview of Business Analytics; Functional
Applications of Business Analytics in Management.
CONTENTS
1. Introduction
2. Digital Data Formats
3. Types of Analytic Techniques
4. Tools for Data/ Business Analytics
5. Case Study – United States Department of Agriculture
(USDA)
• DATA IS THE SWORD OF THE 21ST CENTURY, THOSE WHO
WIELD IT WELL, THE SAMURAI.“
Jonathan Rosenberg, Advisor to CEO, Larry Page & Former
SVP, Google

• In an interview to CNBC, Eric Schmidt, Executive


Chairman of Google's parent company Alphabet
and Jonathan Rosenberg, Advisor to CEO, Larry
Page say “Basic understanding of data analytics is
incredibly important for next generation as it will
become increasingly important in workplaces. Basic
knowledge of how statistics works, a basic
knowledge of how people make conclusions over
big data would become trivial as it will help
businesses”.
• “How Google Works” authored by Schmidt &
Rosenberg highlighted, hiring professionals with the
right skills & penchant for bold, creative thinking as
the strategy that drove Googles innovation .
• According to the Bureau of Labor Statistics (BLS),
number of roles for individuals with this skills set is
expected to grow by 30 per cent over the next 7
years.
Tech Trends - Past & Present (9)
• Over the last 10 years, CLOUD, ANALYTICS & TECHNOLOGIES empowering digital
experiences have steadily disrupted IT operations, business models & markets. Their
impacts cannot be overstated, & their storylines continue to evolve
• Recently 3 new technologies – BLOCKCHAIN, COGNITIVE, & DIGITAL REALITY (AR, VR,
IoT etc.) have taken up to become a distinct macro force
• 3 foundational forces that make organizations harness innovation while maintaining
operational integrity include:
• Modernizing legacy core systems
• Transforming the business of technology
• Evolving cyber risk strategies

• Overtime these technologies evolved & expanded across industries


• Today they are considered foundational components not only of enterprise IT but of
corporate strategy too
• For purposeful, transformational change - Technologies advancing at a rapid pace;
they must have controlled collision between them
Source: - https://www.wipo.int/edocs/pubdocs/en/wipo_pub_1055.pdf
According to the "Technology Trends Dossier 2019" from CyberMedia Research (CMR)
• Presents a comprehensive view of the transformative technologies that will shape the
way enterprises operate in the next 2-3 years, specifically in 2019
• The ABCD-i (AI, Blockchain, Cloud, Data Analytics and IoT) will dominate the world of
technology in 2019
• Will gain more maturity in the next 2-3 years with several enterprises rolling out their
deployments
• while these technologies overshadowed traditional IT, they did not gain widespread
deployment because many of them are still evolving
• Digital is ready to change everything with a promise to reinvent your industry as
you know it
• ABCD - the lingua franca of digital business Analytics, biometrics, cloud & digital
are the building blocks of a digital business - VS Parthasarathy, Group CIO &
Group CFO, Mahindra & Mahindra
• Search engine giant Google - foraying into payments, operating systems, social
networking, communications, & automobile segment
• China’s e-commerce juggernaut Alibaba - is making inroads into finance,
payments, cloud & online insurance industry
• The app-based tax aggregator Uber - is reimaging urban transportation with on-
demand air transport with flying cars besides sniffing for long-term business
opportunities in food delivery
• Electric car major Tesla - is muscling its way into battery storage & solar panels.
• Software behemoth Microsoft - is limbering up for the cloud and mobile race
• Best way to predict the future is to create it. These forward-thinking organizations are
disrupting themselves to avoid getting disrupted
R AND R-STUDIO INSTALLATION
• R is an industrial strength open source statistical package
• R & R Studio Download - Takes 2 minutes to install
• R Installation
• https://cran.r-project.org/bin/windows/base/old/3.5.2/
• R Studio Installation
• Provides an integrated environment for working in R
• https://www.rstudio.com/products/rstudio/download/
CONTD...
INSTALL ANACONDA 5.0.0 (64-BIT)
WITH PYTHON 3.6
Anaconda Installation on windows Machine
1. Open - https://repo.continuum.io/archive/
2. Select - for 64 bit -
https://repo.continuum.io/archive/Anaconda
3-5.2.0-Windows-x86_64.exe
3. During installation, make sure to select the first
check-box to add anaconda to PATH
environment variable
4. Open Anaconda Navigator from command
prompt
5. Execute “Jupyter Notebook 5.5.0”
6. Select New – select Python 3
• Data growth has seen exponential acceleration since the
advent of the computer & internet
2. DIGITAL DATA
• Digital data can be classified into 3 forms
• Unstructured - Does not conform to a data model
FORMATS
• Semi-structured - Does not conform to a data model
but has some structure
10%
• Structured - Data conforms to some specification 10%
Unstructured data

• According to Merrill Lynch, 80-90% of the business data is 80%


Semi-structured data

either unstructured/ semi-structured


Structured data
• Gartner also estimates that unstructured data constitutes
80% of the whole enterprise data
Structured Data Representation
• DB & Excel like format
• Each column represents one
aspect of data and each row is
one record
• The points represent a
2dimensional vectors
• Each point talks about a customer
with age ‘x’ and income ‘y’
• Globally, India is amongst the Top 5
with
• 1.2 billion People
• Over 890 million Mobile subscribers
• 213 million Internet subscribers
• 115 million Facebook users
• 24 million LinkedIn users
• over 200,000 factories with an
estimated public & private sector
employment of 29 million
• Digital bits captured/ created each
year in India is expected to grow from
127 exa bytes to 2.9 zetta bytes
between 2012 & 2020
• With the flood of data available to
businesses, companies are turning to
analytics to extract meaning from the
huge volumes of data and make
smarter decisions for better business
outcomes
• Goal of Data Analytics - to
get actionable insights
GoodLife HealthCare Group is one of India’s leading healthcare
groups. The group began its operations in the year 2000 in a small
town off the south-east coast of India, with just one tiny hospital
building with 25 beds. Today, the group owns 20 multi-speciality
healthcare centers across all the major cities of India. The group
has witnessed some major successes and attributes it to its focus
on assembly line operations and standardizations. GoodLife
HealtCare offers the following facilities: Emergency care 24 x 7,
Support groups, Support and help through call centers. The group
believes in making a “Dent in Global Healthcare”. A few of its
major milestones are as listed below in chronological order: CASE STUDY - I
o Year 2000 – the birth of the GoodLife HealthCare Group. Functioning initially GOODLIFE HEALTHCARE
from a tiny hospital building with 25 beds GROUP
o Year 2002 – built a low cost hospital with 200 beds in India
Research Questions
o Year 2004 – gained foothold in other cities of India
o Year 2005 – the total number of healthcare centers owned by the group What data is present in the
touched the 20 mark system?
o The next 5 years saw the groups dominance in the form of it setting up a
How is it stored?
GoodLife HealthCare Research Institute to conduct research in molecular
biology and genetic disorders How important is the information?
o Year 2010 witnessed the group award for the “Best HealthCare Organization of
the Decade” How can this information
enhance healthcare services?
Structured Data Semi-structured Data Unstructured Data
•Good Life nurses make Dr.Vishnu of “GoodLife HealthCare” Dr.Sami, Dr.Raj & Dr.Rahul work
electronic records for every organization usually gets a blood test at the medical facility of
patient who visits the hospital. done for migraine patients visiting him. GoodLife. Over the past few
These records are stored in a It is his observation that patients with days, Dr.Sami & Dr.Raj have
relational database. Nurse
Nandu records the body migraine have high platelet count. He been exchanging long emails
temperature & blood pressure of makes a note of the diagnosis in about a particular case of
a patient Mr.Prem, and enters conclusion section of the report. gastro-intenstinal problem.
them in the hospital database. Dr.Mamatha searches the database Dr.Raj upon a particular
Dr. Dev, who is treating Prem when she is unable to find the cause of combination of drugs has
searches the database to know migraine in one her patients, but with successfully cured the disorders
his body temperature. Dr. Dev is no luck! Dr.Vishnu’s blood test reports in his patients. He has written an
able to locate the desired on patients were not successfully email about this combination of
information easily because the updated into the medical system drugs to Dr.Sami & Dr.Rahul.
hospital data is structured and is
stored in a relational database database as they were in the semi- Dr.Rahul has a patient with quite
structured format a similar case of gastro-intestinal
GoodLife Healthcare Patient Index Card disorder. He quickly searches the
GoodLife HealthCare - Blood Test Report
Patient ID <> Date <> organizations database for
Doctor <> Patient Age <>
Nurse Name <>
process, but with no luck as the
Patient Name <> WBC Count <>
email conversation has not
Patient Name <> Patient <> Hemoglobin Content <> RBC Count <>
Age been successfully updated into
Body <> Blood <>
Platelet Count <> the medical system database as
Temperature Pressure Conclusion <notes> it fell in the unstructured format
MEASUREMENT & SCALING CONCEPTS
Scale Characteristics
1. Scales are constructed based on 3 characteristics – Order, Distance & Origin
1. Order – Denotes relative size, position or importance associated with the characteristics of an object.
Represents relative position & doesn’t say anything about the absolute value of the characteristics of an
object
Example: - characteristics of an object may assume values: good, better & best. These 3 values are based
on the description of the characteristics of an object & have an order among them
2. Distance – Indicates that there are clearly known & measurable differences between the scale descriptors
Example: - No. of customers visiting the shop in a specific time period. For different times, nos. measured
have a clearly defined difference between them. Clearly a scale that has a distance also has order
3. Origin – Scale has a unique or fixed beginning point, also called a zero point
Example: - No. of customers visiting in a specific interval. It is a scale with origin as zero point for the case
when no customer is visiting in the interval

2. Each type offers researcher progressively more power in analyzing & testing the
validity of a scale
Levels of Scale Measurement
1. Business researchers use many scales or number systems
2. Traditionally, the level of scale measurement is seen as important because it
determines the mathematical comparisons that are allowable
3. The 4 levels/ types of scale measurement are:
1. Nominal Scales of Measurement
2. Ordinal Nominal Interval

3. Interval Ordinal Ratio

4. Ratio
4. Each type offers researcher progressively more power in analyzing & testing the
validity of a scale
Nominal Scale
1. Represents most elementary level of measurement
2. Assigns value to an object for identification/ classification purpose only
3. Nominal scaling is arbitrary
4. Set of numbers, letters, or any other identification is equally valid
5. Note: - Nos. are not representing different quantities or the value of the object

1. Example 1: - Researchers are blind folded & asked to


taste one of the 3 root beers, drinks are labeled A (not
cane sugar), B (corn syrup), or C (fruit extract). The
researcher can assign the letter C to any of the 3
options without damaging scale validity
2. Example 2: - Model in the figure depicts no. 5 on a
horse. This is merely a label to allow bettors & racing
enthusiasts to identify the horse. It doesn’t mean that it
is the 7th fastest horse or it is the 7th biggest or anything
else meaningful
Ordinal Scale
1. It is a ranking scale
2. Research participants are often asked to rank their preference. Ordinal scale lists the
options from most to least preferred, or vice versa
3. Researchers know how each item, person, or stock is judged relative to others, but
they don’t know by how much

Example: - When business professors take some time off & go


to the race track, even they know that a horse finishing in the
“show” position has finished after the “win” & “place” horses.
The order of finish can be accurately represented by an
ordinal scale using an order number rule: Assign positions: 1
to “win”; 2 to “place” position; 3 to “show” position
The winning horse defeated the place horse by a nose, but
the place horse defeated the show horse by 20 seconds.
Ordinal scale doesn’t tell how far apart the horses were, but
is good enough to let someone know the result of a wager
Interval Scale
1. Possesses both nominal & ordinal
properties, but they also capture
information about differences in
quantities of a concept
2. Interval scales are very useful
because they capture relative
quantities in the form of distances
between observations
Example: - A horse race in which the win Ratio Scale
horse is one second ahead of the place  Melissa’s college record shows 36
horse, which is 20 seconds ahead of the credit hours earned, while
show horse. Not only are the horses Kevin’s record shows 72 credit hours
identified by the order of finish, but the earned
difference between each horse’s  Kevin has twice as many credit
performance is known hours earned as Melissa
3. TYPES OF ANALYTIC
TECHNIQUES
1. Descriptive: What is happening?
• Comprehensive, accurate & live data
• uses data aggregation & mining
techniques to provide insight into the past
2. Diagnostic: Why is it happening?
• Discovers root-cause of the problem
• Ability to isolate all confounding
information
3. Predictive: What’s likely to happen?
• Historical patterns are used to predict
specific outcomes
• Uses statistical models/ forecasting
techniques to predict
4. Prescriptive: What do I need to do?
• Recommends actions & strategies
• Applies advanced analytical techniques
(optimization & simulation algorithms) to
advice on possible outcomes & make
specific recommendations
Descriptive Analytics: What is happening? Diagnostic Analytics: Why it is happening?
• Descriptive analytics can be classified into three areas that O Will empower an analyst to drill down
answer certain kinds of questions:
O Isolates the root-cause of a problem
1. Standard reporting and dashboards: What happened?
How does it compare to our plan? What is happening O Diagnostic analytics takes a deeper look at data to
now? attempt to understand the causes of events and
behaviors
2. Ad-hoc reporting: How many? How often? Where?
3. Analysis/query/drill-down: What exactly is the problem? O Examples: - Users can find the right candidate to fill a
Why is it happening? position, select high potential employees for
succession, and quickly compare succession metrics
• Categorizes, characterizes, consolidates and classifies data & performance reviews across select employees to
• Includes dashboards, reports (e.g., budget, sales, revenue reveal meaningful insights about talent pools. Well-
and costs) and various types of queries designed business information (BI) dashboards
featuring filters & drill down capabilities allow for a
• Tools include report generation, distribution capability and snapshot of employees across multiple categories
data visualization facilities such as location, division, performance and tenure
Predictive Analytics: What is likely to happen in future
based on previous trends & patterns?
• Examines data or content to answer the question “Why Prescriptive Analytics: What do I need to do?
did it happen?”
O Analyzes data of what has happened, why it has
• Is characterized by techniques such as drill-down, data happened & what might happen to help the user
discovery, data mining and correlations determine the best course of action
• Example: - Credit score helps financial institutions O Example: - Maps help the client in choosing the best
decide the probability of a customer paying credit bills route considering the distance of each route, time taken,
on time current traffic constraints etc.
Case Study - II
Center for Disease Control (CDC)
Mr.Jones works at Center for Disease Control (CDC) and his job is to analyze the data
gathered from around the country to improve their response time during flu season. CDC wants
to examine the data of past events (geographical spread of flu last winter – 2012) gathered to
better prepare the state for next winter (2013)
December 2000 happened to be a bad year for the flu epidemic. A new strain of the virus
has wreaked havoc. A drug company produced a vaccine that was effective in combating
the virus. But, the problem was that the company could not produce them fast enough to
meet the demand.
Government had to prioritize its shipments. Government had to wait a considerable
amount of time to gather the data from around the country, analyze it, and take action. The
process was slow and inefficient. The contributing factors included, not having fast enough
computer systems capable of gathering and storing the data (velocity), not having computer
systems that can accommodate the volume of the data pouring in from all of the medical
centers in the country (volume), and not having computer systems that can process images,
i.e, x-rays (variety).
Because of the havoc created by the flu epidemic in 2000 and the governments inability to
braze to the occasion there was a huge loss in lives.
US Government did not want the scenario in 2000 to repeat. So they decided to adopt
Big Data Technology in handling this flu epidemic.
Report generated by BI tool (Descriptive Analytics) presents State of New York to have
most outbreaks. Interactive visualization tool presented a map depicting the
concentration of flu & vaccine distribution in different states of United States last winter.
Visually a direct correlation is detected between the intensity of flu outbreak with the
late shipment of vaccines. It is noticed that the shipments of vaccine for the state of
New York were delayed last year. This gives a clue to further investigate the case to
determine if the correlation is causal using Diagnostic Analytics (discovery). Ms.Linda, a
data scientist applies Predictive Analytics to create a model and apply it to data
(demand, vaccine production rate, quantity etc.) in order to identify causal
relationships, correlations, weigh the effectiveness of decisions taken so far and to
prepare in tackling potential problems foreseen in the coming months. Prescriptive
Analytics integrates our tried-and-true predictive models into our repeatable processes
to yield desired outcomes.
Big Data technology solved the velocity-volume-variety problem. The Center for Disease
Control may receive the data from hospitals & doctors in real-time and Data Analytics
Software that sits on the top of Big Data computer system could generate actionable
items that can give the Government the agility it needs in times of crises.
4. TOOLS FOR DATA/ BUSINESS ANALYTICS

1. Relatively simple statistical tools - spread sheets of MS-Excel


2. Statistical software packages - KXEN, Statistica etc
3. Sophisticated business intelligence suites - SAS, Oracle, SAP,
IBM among the big players etc.
4. Open source tools - R and Weka are also gaining popularity
5. Companies are also developing in-house tools designed for
specific purposes
Most Popular Analytic Tools
1. MS Excel: Excellent reporting and dash boarding tool, can handle tables with up to 1 million rows making it a
powerful yet versatile tool
2. SAS: Largest independent vendor in the Indian business intelligence market, despite its monopolistic pricing, has
wide ranging capabilities from data management to advanced analytics
3. SPSS Modeler (Clementine): Data mining software tool by IBM, has an intuitive GUI and its point-and-click
modeling capabilities are very comprehensive
4. R: Is an open source Programming language for statistical computing and graphics
5. Statistica: provides data analysis, data management, data mining, and data visualization procedures. GUI is not
user-friendly & takes more time to learn some tools. It is a competitively priced product that is value for money
5. MATLAB: Allows matrix manipulations, plotting of functions & data, implementation of algorithms, and creation
of user interfaces with add-on toolboxes that extend MATLAB to specific areas of functionality. Matlab is not a
free software. However, there are clones like Octave and Scilab which are free and have similar functionality
6. Weka: Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software.
Weka, along with R, is amongst the most popular open source software
7. Salford systems: Provides a host of predictive analytics and data mining tools for businesses. The software is easy
to use and learn
8. KXEN: is one of the few companies that are driving automated analytics. Their products are easy to use, fast
and can work with large amounts of data
9. Angoss: Like Salford systems, Angoss has developed its products around classification and regression decision
tree algorithms. The tools are easy to learn and use, and the results easy to understand and explain. The GUI is
user friendly and a lot of features
5. CASE STUDY
THE IMPORTANCE OF FOOD & NUTRITION
Obesity Trends Among US Adults - USDA
Source: file:///C:/Users/npm/Desktop/IPE%20-%20Class%20of%202019%20-%20Reference%20Material/R%202019/R%20-%20MITOpenCourseware%20-
%2010.07.2019/contents/an-introduction-to-analytics/working-with-data-an-introduction-to-r/video-1-why-r/index.htm
More than 35% of US adults are obese. Obesity-related conditions are some of the
leading causes of preventable death (heart disease, stroke, type II diabetes).
Worldwide, obesity has nearly doubled since 1980. 65% of the world’s population lives in
countries where overweight and obesity kills more people than underweight.
Good nutrition is essential for a person’s overall health and well-being, and is now more
important than ever. Hundreds of nutrition and weight-loss applications. 15% of adults
with cell phones use health applications on their devices. These apps are powered by
the USDA Food Database.
The United States Department of Agriculture distributes a database of nutritional
information for over 7,000 different food items. Used as the foundation for most food and
nutrient databases in the US. Includes information about all nutrients. Calories, carbs,
protein, fat, sodium, . . .

UNDERSTANDING FOOD
NUTRITIONAL EDUCATION WITH DATA
Dr. Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: shahmsc@ipeindia.org Mobile: + (91)98666 66620
Unit II: Business Analytics using R
Introduction to R Programming; Installing R and R Studio; Data Structures in R: Vectors, Dataframes,
Lists, Matrices and Array Operations; Summarizing Data: Numerical and Graphical Summaries; Data
Visualization in R: Histogram, Bar Chart, Scatter Plot, Box Plot, Corrgram, Corrplot, ggplot2; Data
Manipulation in R: Built-in Functions, Apply Functions, Date and Time Functions, Dplyr Functions, Pipe
Operator; Data Transformation: Filtering, Dropping, Merging, Sorting, Reshaping of Data, Detecting
Missing Values in the Data, Imputation; Data Import and Export Techniques in R; Statements:
Conditional Statements and Control Statements.
Statistical Applications: Parametric One and Two Sample Tests: Z-test, t-test, Chi-square test, and
ANOVA; Non-parametric Test: Mann Whitney U test, Wilcoxon test, Kruskal Wallis test; Correlation
Analysis; Simple and Multiple Linear Regression.
Unit III: Business Analytics using Python
Introduction to Python Programming; Installing Python, Pycharm and Anaconda; Data Structures in
Python: Variables, Files, Lists, Dictionaries, Tuples, and Sets; Functions: In-built, User-defined and
Lambda Functions; Statements: Conditional Statements and Control Statements; Exception
Handling.
Data Import and Export Techniques in Python; Summarizing Data: Numerical and Graphical
Summaries; Data Visualization in Python: Matplotlib and Seaborn libraries; Data Analysis: Numpy,
Pandas and Sklearn Libraries; Model building for Simple and Multiple Linear Regression.
2. BUSINESS ANALYTICS USING R STUDIO
CONTENTS
1. Installing R & R Studio
2. Data Types & Data Structures
1. Vectors
2. Dataframes
3. Lists
4. Matrices
5. Arrays
6. Factors

3. Data Input
4. Operators in R
5. Organizing the data
6. Summarizing Data – Numerical summary
1. INSTALLING R & R-STUDIO
• R is an industrial strength open source statistical package
• R & R Studio Download - Takes 2 minutes to install
• R Installation - https://cran.r-project.org/bin/windows/base/old/3.5.2/
• R Studio Installation
• Provides an integrated environment for working in R
• https://www.rstudio.com/products/rstudio/download/
R
• Is a Case sensitive statistical
programming language
• Is an interpreter
• Is a Free open source
platform
• Is Developed by R.
Gentleman and R. Ihaka at
Bell Labs, during 1990’s
• Is becoming lingua franca
for data science - mastering
core skills will be easier in R
• Has High-level data analytics
& statistical functions
• Is an effective data
visualization tool
• Data scientists use R - 2 best companies to work in modern economy - Google & Facebook
• Tool of choice for data scientists - At Microsoft - Apply machine learning to data from Bing,
Azure, Office, & data from other (Sales, Marketing & Finance) departments
• Widely used in a variety of companies - Bank of America, Ford, TechCrunch, Uber, & Trulia
R provides many functions to examine features of vectors and other objects
• > (Command prompt) - Commands are entered one at a time
• # - Beginning of comment
• q() or Escape key - Terminates the current R session
• ?command or >help(command) – Seeks Inbuilt help for R
• View(dataset) - Invoke a spreadsheet - style data viewer on a matrix
• Commands are separated either by a semi-colon (;) or by a newline. Elementary commands can be
grouped into one compound expression by braces ({ }). If a command is not complete, R will prompt + on
subsequent lines
• <- or = or -> (assignment operators)- Results of calculations can be stored in objects
• Keyboard Vertical arrow keys - Used to scroll forward & backward through a command history
• Basic functions are available by default. Other functions are contained in packages as (built-in & user-
created functions) that can be attached as needed. They are kept in memory during an interactive
session
o Packages - Collection of R functions, data & compiled code in a well-defined format
o R comes with a standard set of packages (including base, datasets, utils, grDevices, graphics, stats, and
methods), providing a wide range of functions & datasets that are available by default
o There are more than 5,500 user-contributed modules called packages that can be downloaded from
http://cran.r-project.org/web/packages
o library - Directory where packages are stored on your computer
o .libPaths() – Shows where your library is located
o search() - Lists the packages that are loaded and ready for use
o install.packages() – To install a package for the first time
o Like any software, packages are often updated by their authors. To update any package that is already
installed, use the command update.packages()
o Installed.packages() – Lists the packages you have, along with their version numbers, dependencies, &
other information
o library(package) - Loads package for current session
o apropos("command") or find("command") - Find library containing "command“
o library(help=package) - Lists datasets and functions in package
o To use an installed package in R session, you need to load the package using the library() command
o class() - what kind of object is it?
o is.integer() & is.numeric() – Is the data type of the object integer or numeric?
o typeof() - what is the object’s data type (low-level)?
o length() - how long is it? What about two dimensional objects?
o attributes() - does it have any metadata?
o x[i] - access element i of vector x
o names(list) - show names of the elements in list
o list$element - access element of list
o Inf infinity; Example: - 1/0
o NaN - Not a number ; Example: - 0/0
o NA - Missing value, when no value has been assigned
o NULL - Null object ; Example: - An empty list
o help.start() - Online access to R manuals
o help(command) or ?command - Help on command
o help.search("command") or ??command - Search the help system
o demo() - Run R demo; Example: - demo(graphics)
o example(function) - Shows example of function
o str(a) - Displays the internal structure of an object
o summary(a) - Displays a summary of object
o dir() - Show files in the current directory
o setwd(path) - Set windows directory to path
o ls() or objects() - Shows objects in the search path
o rm(x,y,...) - Remove object x,y,... from workspace
o rm(list=objects()) - Remove everything created so far
o file - Denotes file name like "path/rscript.r" or "ascii.txt"
o save.image(file) - Save current workspace in file
o load(file) - Load previously saved workspace file
o data(x) - Load specified data set (data frame) x
o source(file) - Execute the R-script file
o sink(file) - Send output to file instead of screen, until sink()
o scan(file) - Read file into a vector or list
o read.table(file,header=TRUE) - Read file into data frame; rst line denes column names; space is separator;
see help for options on row naming, NA treatment etc.
o read.csv(file,header=TRUE) - Read comma-delimited file
o read.delim(file,header=TRUE) - Read tab-delimited file
o read.fwf(le,widths) - Read fixed width formatted file; integer vector widths define column widths
o save(file,x,y,...) - Save x,y,... to file in portable format
o print(x,y,...) - Print objects x,y,... in default format
o format(x,y,...) - Format x,y,...for pretty printing; Example: - print(format(x))
o cat(x,y,...,file="", sep=" ") - Concatenate and print x,y,... to file (default console), using separator sep
o write.table(x,file="",row.names=T,col.names=T, sep=" ") - Print x as data frame, using separator sep; eol is
end-of-line separator, na is missing values string
o write.table(x,"clipboard",sep="nt",col.names=NA) - Write a table to the clipboard for Excel in Windows
o read.delim("clipboard") - Read a table copied from Excel
o foreign - read data stored by Minitab, SAS, SPSS, Stata, e.g. read.dta("statafile"), read.ssd("sasfile"),
read.spss("spssfile")
o x=numeric(n) - Create numeric n-vector with all elements 0
o c(x,y,...) - Stack (concatenate) values or vectors x,y,... to form a long vector
o a:b - Generate sequence of integers a,...,b; :has priority; e.g. 1:4 + 1 is 2 3 4 5
o seq(from,to,by=) - Generate a sequence with increment by; length= species length
o seq(along=x) - Generate 1, 2, ..., length(x)
o rep(x,n) - Replicate x n times, e.g. rep(c(1,2,3),2) is 1 2 3 1 2 3; each= repeat each element of x each
times, e.g. rep(c(1,2,3),each=2) is 1 1 2 2 3 3
o x=complex(n) - create complex n-vector with all elements 0
o list(...) - Create list of the named or unnamed arguments; e.g. list(a=c(1,2),b="hi",c=3i)
o factor(x,levels=) - Encode vector x as factor (define groups)
o gl(n,k,length=n*k,labels=1:n) - Regular pattern of factor levels; n is the number of levels, and k is the
number of replications
o expand.grid(x,y,...) - Create data frame with all combinations of the values of x,y,..., e.g.
expand.grid(c(1,2,3),c(10,20))
o Dealing with missing values - One of the most important problems in statistics is incomplete data sets. To
deal with missing values, R uses the reserved keyword NA, which stands for Not Available
is.na() - Tests whether a value is NA. Returns TRUE if the value is NA
Example: - x = NA is.na(x) TRUE
is.nan() - Tests whether a value is not an NA. Returns TRUE if value is not an NA
Example: - x = NA is.nan(x) FALSE

> is.na(mtcars$mpg)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [17]
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> is.nan(mtcars$cyl)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [17]
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 DATA TYPES & DATA STRUCTURES
• In any programming language, variables store information

• Based on the data type of a variable, operating system (OS) allocates memory & decides what can be
stored in the reserved memory

• Frequently used R-objects


1. Vectors
2. Data Frames
3. Lists
4. Matrices
5. Arrays
6. Factors
• Simplest 6 data types of atomic vectors
1. Numeric
2. Integer
3. Complex
4. Character (String)
5. Logical (True/ False)
6. Raw
• In any programming language, variables are required to store data. Variables
assigned with a type need to assign that type of data only
• R has 6 basic/ fundamental data types, also known as atomic vector types that are
used extensively in R programs namely – Numeric, Integer, Character, Complex,
Logical & Raw
• Numeric – Numeric variables are used to represent numbers in R. Default data type. Represents
decimal data
• Integer – Special type of numeric in R, used to represent natural numbers. To specify a digit as an
integer the programmer needs to add L next to it. The users will not see the difference between the
integer 4 & the numeric 4L from the output point of view, however the function class() reveals the
difference
• Character – R allows the programmers to use the string of characters
• Complex – Handles complex numbers, which has 2 parts – a real value & an imaginary value
• Logical – One of the frequently used data type usually used for comparing 2 values. Values a logical
data type takes is TRUE or FALSE
• Raw – R provides flexibility to programmers to convert any given data type to a special data type
called raw. It is intended to hold any data as a sequence of byes, where it is possible to extract sub-
sequences of bytes & replace them as elements of a vector. In R, raw vectors, store fixed length
sequences of bytes.
Note: - All integer variables are numeric, but not all numeric variables are integers
Basic Data Types Example Function Output
Numeric X=1.5 X 1.5 class(X) numeric

is.numeric(X) TRUE

X 1.5 X = as.numeric(X) X
Integer X = 15L X 5 print(class(X)) integer

is.integer(X) TRUE

X 15 X = as.integer(X) X
Complex Z=1+2i Z 1+2i print(class(Z)) complex

is.complex(Z) TRUE
Logical a=4 b=6 print(class(C)) logical
C=a>b FALSE
Character X = "R Studio“ X R Studio print(class(X)) character

is.character(X) TRUE
Y 1
Y=1 Y = as.character(y) print(Y)
Raw name = "IPE" r = charToRaw(name) class(r) raw

s = rawToChar(r) class(s) character


2.1. Vectors – c()
• One-dimensional arrays that can hold numeric/ character/ logical data
• Constants / one-element vectors – Also called as Scalars hold constants
Example: - x = 2.414, h = TRUE, cn = 2 + 5i, name = ‘IPE’, v = charToRaw(‘IPE’)
• Types of Vectors
1. Numeric Vectors – contains all kinds of numbers
2. Integer Vectors – contains integer values only
3. Logical Vectors – contains logical values (TRUE and/ or FALSE)
4. Character Vectors – contains text

Atomic vectors Description Example


Logical It produces the TRUE, FALSE result class(h) "logical"
Marks=c(23.2,32.4,12.6,04.2,49.1)
Numeric Handles positive & negative decimals including 0
Marks 23.2 32.4 12.6 4.2 49.1
Marks=c(23L,32L,12L,04L,49L)
Integer Handles positive & negative integers including 0
Marks 23 32 12 4 49
Complex Handles complex data items class(cn) "complex"
Character Handles string/ character data items name=“IPE” class(name) "character"
v = charToRaw(‘IPE’) V 49 50 45
Raw Returns a raw vector of bytes
class(v) "raw"
Function Description Example
c() Creates a vector marks=c(10,20,30,40,50) marks 10 20 30 40 50
smarks=c(56,89,48,12,38) amarks=c(78L,48L,59L,15L,26L)
c() Combines vectors final = c(smarks,amarks)
final 56 89 48 12 38 78 48 59 15 26
class(x) Gives data type of the R object x class(final) "numeric"
nchar(x) Finds the length of the object nchar(name) 3
length(x) Returns the length of vector length(final) 10
seq(x) Creates simple sequence of integers seq(from=1, to=20, by = 2) 1 3 5 7 9 11 13 15 17 19
rep(x) Replicates vector said number of times rep(1:4, each = 2, len = 10) 1122334411
is.numeric(x) Tests whether x is an numeric vector is.numeric(final) TRUE
is.integer(x) Tests whether x is an integer vector is.integer(final) FALSE
is.character(x) Tests whether x is a character vector is.character(name) TRUE
o We can easily edit the vector by using indices in single/ multiple locations
Example:- If smarks=c(56,89,48,12,38); smarks[2]=100 Output: - smarks - 56 100 48 12 38
smarks[c(2,4,5)]=10 Output: - smarks - 56 100 48 100 100
o Vector operations are complicated when operating on 2 vectors of unequal length. Shorter vector
elements are repeated, in order, until they have been matched up with every element of the longer
vector. If longer vector is not a multiple of shorter one, a warning is given
Example:- If x=1:8 y=10:13; x+y Output: - 11 13 15 17 15 17 19 21
Function Description Example

all() Tests whether all the resulting elements are TRUE all(smarks>amarks) FALSE

any() Tests whether any element is TRUE any(smarks>amarks) TRUE

Gives difference between current & next value in the diff(final) 33 -41 -36 26 40 -30 11 -44 11
diff(x)
vector
sum(x) Calculates the sum of all values in vector x sum(final) 469

prod(x) Calculates the product of all values in vector x prod(final) 9.398024e+15

min(x) Gives the minimum of all values in x min(final) 12

max(x) Gives the maximum of all values in x max(final) 89


cumsum(final)
cumsum(x) Gives the cumulative sum of all values in x 56 145 193 205 243 321 369 428 443 469

cumprod(final)
5.600000e+01 4.984000e+03 2.392320e+05
cumprod(x) Gives the cumulative product of all values in x 2.870784e+06 1.090898e+08 8.509004e+09 [7]
4.084322e+11 2.409750e+13 3.614625e+14
9.398024e+15

Gives the minimum for all values in x from the start of cummin(final) 56 56 48 12 12 12 12 12 12 12
cummin(x)
the vector until the position of that value

Gives the maximum for all values in x from the start of cummax(final) 56 89 89 89 89 89 89 89 89 89
cummax(x)
the vector until the position of that value
vector = c(10,20,30,40,50,60,50,30,20,50) vector[vector==50] = NA
vector 10 20 30 40 NA 60 NA 30 20 NA
sum(is.na(vector)) 3 # Checks for NA’s

Command Description Example


max(vector,na.rm=FALSE) NA
max(x,na.rm=FALSE) Shows the maximum value
max(vector,na.rm=TRUE) 70
min(vector,na.rm=FALSE) NA
min(x,na.rm=FALSE) Shows the minimum value of the vector
min(vector,na.rm=TRUE) 10
Gives the length of the vector including length(vector) 10
length(x) NA values.
length(na.omit(x)) na.rm instruction doesn’t work with
length(). na.omit() strips out NA items length(na.omit(vector)) 7
mean(vector,na.rm=FALSE) NA
mean(x,na.rm=FALSE) Shows the arithmetic mean
mean(vector,na.rm=TRUE) 40
median(vector,na.rm=FALSE) NA
median(x,na.rm=FALSE) Shows the median
median(vector,na.rm=TRUE) 40
sd(vector,na.rm=FALSE) NA
sd(x,na.rm=FALSE) Shows the standard deviation
sd(vector,na.rm=TRUE) 21.60247
var(vector,na.rm=FALSE) NA
var(x,na.rm=FALSE) Shows the variance
var(vector,na.rm=TRUE) 466.6667
mad(vector,na.rm=FALSE) NA
mad(x,na.rm=FALSE) Shows the median absolute deviation
mad(vector,na.rm=TRUE) 29.652
Summary Commands with Multiple Results: - Produces several values
vector=c(10,20,30,40,50,60,70,80,90,100)
• log(vector)
2.302585 2.995732 3.401197 3.688879 3.912023 4.094345 4.248495 4.382027 4.499810 4.605170
• summary(vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 32.5 55.0 55.0 77.5 100.0
• quantile(vector)
0% 25% 50% 75% 100%
10.0 32.5 55.0 77.5 100.0
• fivenum(vector)
10 30 55 80 100
2.2. Data Frames – data.frame()
• Data frame is a list of vectors of equal length, where each column can contain different
modes of data
• Data frames are tabular data objects created using the data.frame() function
x=1:5
jdate=c("30-01-2017","28-01-2017","16-01-2017","02-02-2017","05-02-2017")
age=c(25,34,28,52,74)
diabetes=c('Type 1','Type 2','Type 2','Type 1','Type 2')
status=c('Poor','Improved','Excellent','Poor','Improved')
Diabetologist = data.frame(x,jdate,age,diabetes,status)
diabetesstatus=data.frame(PID=x,Join_Date=jdate,Age=age,Type=diabetes,Status=status)
View(diabetesstatus) #Views the contents of the dataset created
names(diabetesstatus) #Views the names assigned to the elements of the list
Function Description Example
nrow(x) Displays no. of rows nrow(diabetesstatus) 5
ncol(x) Displays no. of columns ncol(diabetesstatus) 5

dim(x) Displays no. of rows & columns dim(diabetesstatus) 5 5

rownames(x) Assign row names rownames(diabetesstatus) = c('One','Two','Three','Four','Five')

rownames(x) = NULL Set back to the generic index rownames(diabetesstatus) = NULL


Prints only first 6 rows of the data
head(x) head(diabetesstatus)
frame
tail(x) Prints last 6 rows of the data frame tail(diabetesstatus)
To access single variable in a
dataframe$colname diabetesstatus$Status
data frame $ argument is used
Summarizing a more complicated object
statsvector=c(10,20,30,40,50,60,70,80,90,100) accountsvector=c(50,60,70,80,90,10,20,30,40,50)
frenchvector=c(30,50,100,98,70,20,40,50,70,10)
studentroaster=data.frame(statsvector,accountsvector,frenchvector)
Command Description Example
max(frame) Largest value of data frame is returned max(studentroaster) 100
min(frame) Smallest value of data frame is returned min(studentroaster) 10
sum(frame) Sum of the entire data frame sum(studentroaster) 1588
Tukey summary values for the entire data fivenum(studentroaster$statsvector)
fivenum(frame)
frame is returned 10 30 55 80 100
length(frame) No. of columns of data frame is returned length(studentroaster) 3
summary(frame) Gives summary for each column summary(studentroaster)
rowMeans(frame) Returns the mean of each row rowMeans(studentroaster)
rowSums(frame) Returns the sum of each row rowSums(studentroaster)
colMeans(frame) Returns the mean of each column colMeans(studentroaster)
colSums(frame) Returns the mean of each column colSums(studentroaster)
apply(studentroaster,1,mean,na.rm=TRUE)
Enables to apply function to rows/
apply(studentroaster,2,mean,na.rm=TRUE)
apply(x,MARGIN,FUN) columns of matrix /data frame. Margin
apply(studentroaster,1,median,na.rm=TRUE)
(1/ 2) is 1 for rows & 2 is for columns
apply(studentroaster,2,median,na.rm=TRUE)
sapply(studentroaster, mean, na.rm=TRUE)
sapply(x, FUN, na.rm=TRUE)
sapply(studentroaster, sd, na.rm=TRUE)
2.3. Lists – list()
• Most complex of R data types - contains many different types of elements inside it like
vectors, functions and even another list inside it
x=1:5
jdate=c("30-01-2017","28-01-2017","16-01-2017","02-02-2017","05-02-2017")
age=c(25,34,28,52,74)
diabetes=c('Type 1','Type 2','Type 2','Type 1','Type 2')
status=c('Poor','Improved','Excellent','Poor','Improved')
patientroster=list(x,jdate,age,diabetes,status) # Creates a list
# Name the objects in the list
patientroster=list(PatientID=x,JoiningDate=jdate,AgeofthePatient=age,Disease=diabetes,CurrentStatus=status)
names(patientroster) #Views the names assigned to the elements of the list
Command Description Example
mean(patientroster$AgeofthePatient) The largest value of the entire data frame is returned 42.6
max(patientroster$AgeofthePatient) 74
The smallest value of the entire data frame is returned
min(patientroster$AgeofthePatient) 25
summary(patientroster) Returns the summary of the list
length(patientroster) The number of columns in the data frame is returned 5
lapply(patientroster,mean,na.rm=TRUE) List apply specifically works on list objects
sapply(patientroster,mean,na.rm=TRUE) Resulting output is infact a matrix
2.4. Matrices – matrix()
• Two-dimensional rectangular data set - created using a vector input to the matrix function - indices for
dimensions are separated by a comma – specifies index for row before comma & the index for column after
comma
• Every element, must be of same type, most commonly numeric's
• Similar to vectors - element-by-element addition, multiplication, subtraction, division & equality
• Syntax: matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE)
where data: the data vector nrow: desired number of rows
ncol: desired number of columns byrow: logical. If `FALSE' matrix is filled by columns
Function Description
c = matrix(1:30, ncol=5, byrow=TRUE) Creating Matrix - Arranging elements row-wise
c = matrix(1:30, ncol=5, byrow=FALSE) Creating Matrix - Arranging elements column-wise
c[1:2,2:3] Extract subset of the matrix - values of rows (1 & 2) & columns (2 & 3)
c[1:2,] Extract complete rows (1 & 2)
c[,3:5] Extract complete columns (3, 4 & 5)
c[,-3:-5] Drops values in a vector by using a negative value for the index
c[-c(1,3),] Drops the first & third rows of the matrix
c[3, ,drop = FALSE] To get the deleted third row returned as a matrix
c[3,2] = 4 Replacing a value in the matrix
c[1:2,4:5] = c(50,100,50,100) Replace a subset of values within the matrix by another matrix
c[4,] = c(1,2,3,4,5) Change 4th row values with specified values by not specifying other dimension
rowSums(c) & colSums(c) Returns with the sums of each row & column
Data of a single vector happens to be split into rows & columns
cosmeticsexpenditure=matrix(c(10,20,30,40,50,60,70,80,90,100,110,120),ncol=4)
colnames(cosmeticsexpenditure)=c("YMen","YWomen","SCM","SCW")

Command Description Example

mean(x[,2]) Returns mean of the second column mean(cosmeticsexpenditure[,2]) 50

mean(x[2,]) Returns mean of the second row mean(cosmeticsexpenditure[2,]) 65

rowMeans(matrix_name) Returns the mean of each row rowMeans(cosmeticsexpenditure) 55 65 75

rowSums(matrix_name) Returns the sum of each row rowSums(cosmeticsexpenditure) 220 260 300

colMeans(cosmeticsexpenditure)
colMeans(matrix_name) Returns the mean of each column
YMen YWomen SCM SCW 20 50 80 110
colSums(cosmeticsexpenditure)
colSums(matrix_name) Returns the mean of each column
YMen YWomen SCM SCW 60 150 240 330

apply(cosmeticsexpenditure,2,mean)
Works equally well for a matrix as it YMen YWomen SCM SCW 20 50 80 110
apply(x,MARGIN,FUN)
does for a data frame object apply(cosmeticsexpenditure,1,mean) 55 65 75
apply(cosmeticsexpenditure,1,mean) 55 75 NA
2.5. Arrays – array()
• While matrices are confined to two dimensions, arrays can be of any number of
dimensions
• Arrays have 2 very important features
• They contain only a single type of value
• They have dimensions
• Syntax: array(vector, dimensions, dimnames)
where vector – data for the array; dimensions – numeric vector giving maximal index
for each dimension; & dimnames – optional list of dimension labels
• Dimensions of an array determine the type of the array. An array with 2 dimensions is
a matrix. A data structure with more than 2 dimensions is an array
dim1 = c(“A1”,”A2”)
dim2 = c(“B1”,”B2”,”B3”)
dim3 = c(“C1”,”C2”,”C3”, ,”C4”)
z = array(1:24, c(2,3,4), dimnames=list(dim1,dim2,dim3))
2.6. Factors – factor()
• Factors store vector along with the distinct values of the elements as labels
• Labels are always character irrespective of whether it is numeric or character or Boolean etc.
in the input vector
• Factors are created using the factor() function. nlevels functions gives the count of levels
iphone_colors = c('green','green','yellow','red','red','red','green')
factor_iphone = factor(iphone_colors) # Create a factor object
print(factor_iphone) # Print the factor
green green yellow red red red green Levels: green red yellow
print(nlevels(factor_iphone)) 3
3. Data Input
• Data comes from a variety of sources & in a variety of formats
• Reading data into a statistical system for analysis & exporting the results to some other
system for report writing can be frustrating tasks that can take far more time than the
statistical analysis itself
• R provides a wide range of tools fro importing data. This section describes the import &
export facilities available either in R itself or via packages
• Definitive guide for importing data in R is the R Data Import/ Export manual available
at http://mng.bz/urwn
• Statistical systems like R are not particularly well suited to manipulations of large-scale
data
• Easiest form of data to import into R is a simple text file, and this will often be
acceptable for problems of small or medium scale. The primary function to import from
a text file is scan, & this underlies most of the more convenient functions
• Statistical consultants are presented data in some proprietary binary format (Excel
spreadsheet/ SPSS file) by a client. This section discusses what facilities are available to
access such files directly from R
• For much larger databases it is common to handle the data using a database
management system (DBMS). For many such DBMSs the extraction operation can be
done directly from an R package
1. Importing data from a delimited text file
read.table() - Data can be imported from a delimited text file. It reads a file in a table
format & saves it as a data frame. Each row of the table appears as one line in the file
Syntax: - mydf1 = read.table(file, header=logical_value, sep="delimiter",
row.names="name")
where file - a delimited ASCII file
header - Logical value indicating whether first row contains variable names (TRUE/FALSE)
sep - specifies delimiter separating data values (“,” , “\t”, “ “)
row.names - Optional parameter; Specify 1 or more variables to represent row identifiers
Example: - grades = read.table("C:/Users/npm/Desktop/IPE - Class of 2019 - Reference
Material/Business Analytics for Managers - 2020/studentmarks.csv", header=TRUE,
sep=",", row.names="StudentID")
grades
Note: - read.table() function has many additional options for fine-tuning the data
import. See help(read.table) for details
2. Importing data from Excel
read.xlsx() – Imports Excel worksheets directly using xlsx package. xlsx package can
be used to read, write, & format Excel 97/ 2000/ XP/ 2003/ 2007 files. It imports a
worksheet into a dataframe
Syntax: - mydf2 = read.xlsx(file, n) or mydf3 = read.excel(file)
where file – Path to an Excel workbook
n – Number of worksheets to be imported
Example: - HI = read_excel("C:/Users/npm/Desktop/IPE - Class of 2019 - Reference
Material/Business Analytics for Managers - 2020/Health_Insurance.xlsx")
HI
Note: - xlsx package can do more than import worksheets. It can create and
manipulate Excel XLSX files as well. Programmers who need to develop an interface
between R and Excel should check out this relatively new package
• Operator performs specific mathematical/ logical manipulations
4
• R - Rich in built-in operators. Operators in R include:
• Arithmetic Operators
O
• Relational Operators
P
• Logical Operators
E
• Assignment Operators
R
• Miscellaneous Operators
A
T Order of the operations
O 1. Exponentiation
R 2. Multiplication & Division in the order in which the operators are presented
S 3. Addition & Subtraction in the order in which the operators are presented

4. Mod operator (%%) & integer division operator (% / %) have same priority as the
I normal operator (/) in calculations
N 5. Basic order of operations in R: Parenthesis (), Exponents (^), Multiplication, Division,
Addition & Subtraction (PEMDAS)
R 6. Operations put between parenthesis is carried out first
Operator Description Example Arithmetic Operators
x+y y added to x 2+3=5
x–y y subtracted from x 8- 2=6
x*y x multiplied by y 2*3=6
x/y x multiplied by y 10 / 5 = 2
x^y x raised to the power y 2 ^3 = 8
x %% y Remainder of x divided by y (x mod y) 7 %% 3 = 1
x%/%y x divided by y but rounded down (integer divide) 7 %/% 3 = 2
trunc(x) Integer part of x trunc(2.5) = 2
trunc(-2.5) = -2
ceiling(x) Round up to nearest integer ceiling(2.5) = 3

round(x, n) Rounds the elements of x to n decimals

Assignment Operators Operator Description Example


x=y x=c(1,2,3,4,5) x 1 2 3 4 5
x <- y Left Alignment x<-c(1,2,3,4,5) x 1 2 3 4 5
x < <-y x<<-c(1,2,3,4,5) x 1 2 3 4 5
x ->y c(6,7,8,9,10)->>p p 6 7 8 9 10
Right Alignment
x ->> y c(6,7,8,9,10)->p p 6 7 8 9 10
If x = c(1.5,2.5,3.5,4.5,5.5)
y = c(1,2,3,4,5)
Relational Operators Operator Description Output
x == y Returns TRUE if x exactly equals y FALSE FALSE FALSE FALSE FALSE

x>y Returns TRUE if x is larger than y TRUE TRUE TRUE TRUE TRUE

x<y Returns TRUE if x is smaller than y FALSE FALSE FALSE FALSE FALSE

x >= y Returns TRUE if x is > or exactly equal to y TRUE TRUE TRUE TRUE TRUE

x <= y Returns TRUE if x is < or exactly equal to y FALSE FALSE FALSE FALSE FALSE

x != y Returns TRUE if x differs from y TRUE TRUE TRUE TRUE TRUE

Logical Operators Operator Description Output


x&y Returns the result of x and y TRUE TRUE TRUE TRUE TRUE
x|y Returns the result of x or y TRUE TRUE TRUE TRUE TRUE
!x Returns not of x FALSE FALSE FALSE FALSE FALSE
x OR(x, y) Returns the result of x or y FALSE FALSE FALSE FALSE FALSE
Takes first element of both the vectors &
x && y TRUE
gives the TRUE only if both are TRUE
Takes first element of both the vectors &
x || y TRUE
gives the TRUE if one of them is TRUE
Miscellaneous Operators
Operator Description Example

Creates the series of numbers in a = 1:5


:
sequence for a vector print(a * a) 1 4 9 16 25
A = 1:5 b = 5:10 t = c(5,10,15,20,25)
Used to identify if an element belongs
%in% print(a%in%t) FALSE FALSE FALSE FALSE TRUE
to a vector
print(b%in%t) TRUE FALSE FALSE FALSE FALSE TRUE

Used to multiply a matrix with its M = matrix( c(1,2,3,4), nrow = 2,ncol = 2,byrow = TRUE) x=t(M)
%*%
transpose NewMatrix=M %*% t(M)
5. ORGANIZING THE DATA
(FREQUENCY AND CONTINGENCY TABLES)
• Tables, diagrams & graphs provide
1. easy way to assimilate summaries of data
2. part way in describing data
• In this section, we’ll look at frequency & contingency tables for categorical variables

Arthritis Treatment Data


• Data from Koch \& Edwards (1988) from a double-blind
clinical trial investigating a new treatment for rheumatoid
arthritis is recorded as a data frame with 84 observations
and 5 variables
ID - patient ID
Treatment - factor indicating treatment (Placebo, Treated)
Sex - factor indicating sex (Female, Male)
Age - age of patient
Improved - treatment outcome (None, Some, Marked))
Generating Frequency Table
o R provides several methods for creating frequency and contingency tables. The most important
functions are listed in table below
o Let’s use each of these functions to explore categorical variables
o We’ll begin with simple frequencies, followed by two-way contingency tables, and end with multiway
contingency tables
One-way Frequency Table
1. table() - Generates simple frequency counts
table(Arthritis$Improved)
tab1 = with(Arthritis,table(Improved)) #Alternative method
2. prop.table() - Turn frequencies into proportions
prop.table(tab1)
3. prop.table()*100 - Turn frequencies into percentages
prop.table(tab1)*100
Interpretation: - We observe that 50 percent of study participants had no improvement, 17 percent had seen some
improvement while 33 percent experienced marked improvement
Two-way Frequency Table
1. For two-way tables, the format for the table() function is tw_table = table(A, B)
where A - row variable, & B - column variable.
tab2 = table(Arthritis$Treatment, Arthritis$Improved)
tab2
2. Alternatively, the xtabs() function allows you to create a contingency table using formula style input. The
format is c_table = xtabs(~ A + B, data=mydata)
where mydata - matrix or data frame
In general, variables to be cross-classified appear on the right of the formula (that is, to the right of the ~)
separated by + signs. If a variable is included on the left side of the formula, it’s assumed to be a vector of
frequencies
tab3 = xtabs(~Treatment+Improved, data = Arthritis)
tab3
3. margin.table() & prop.table() - Generate marginal
frequencies & proportions
For row sums & row proportions,
margin.table(tab3, 1) # Generate marginal frequencies
prop.table(tab3, 1) # Generate marginal proportions
index = 1, refers to the first variable in the table()
statement
Interpretation: - We observe that 51 percent of treated
individuals had marked improvement, compared to 16
percent of those receiving placebo
For column sums & row proportions,
margin.table(tab3, 2) # Generate marginal frequencies
prop.table(tab3, 2) # Generate marginal proportions
index = 2, refers to the second variable in the table()
statement
Note: - table() function ignores missing values (NAs) by
default. To include NA as a valid category in the
frequency counts, include the table option
useNA="ifany"
4. Crosstable
This function has options to report percentages (row, column, cell); specify decimal places; produce chi-
square, Fisher, and McNemar tests of independence; report expected & residual values (Pearson,
standardized, adjusted standardized); include missing values as valid; annotate with row & column titles
Check help(CrossTable) for details
library(gmodels)
tab4 = xtabs(~Treatment+Improved, data = Arthritis)
CrossTable(tab4)
CrossTable(Arthritis$Treatment,Arthritis$Improved) #Alternative Method
CrossTable(Arthritis$Treatment,Arthritis$Improved, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE,chisq =
TRUE)
Hypothesis: - No relation between the variables Treatment & Improved
Interpretation: - With p=0.001 <= 0.05, we can conclude that both variables treatment & improvement
are related. Therefore improvement in the patient is a result of the treatment received
Multi-dimensional Frequency Tables
1. Both table() & xtabs() can be used to generate multidimensional tables based on 3 or more categorical
variables
2. The margin.table(), prop.table(), & addmargins() functions extend naturally to more than two dimensions
3. Additionally, the ftable() function can be used to print multidimensional tables in a compact & attractive
manner
tab5 = xtabs(~ Treatment + Improved + Sex, Arthritis) # Stratified Table
tab5
ftable(tab5) # Flat Table
margin.table(tab5, 1)
margin.table(tab5, 2)
margin.table(tab5, 3)
margin.table(tab5,c(1,3))
ftable(prop.table(tab5, c(1, 2)))
6. SUMMARIZING DATA
(NUMERICAL AND GRAPHICAL SUMMARY)

• To interpret the significance of the data - a concise numerical description is preferred


• 3 key statistical measures that enable us to describe a data set are
1. Measures of central tendency (or location)
2. Measures of dispersion (or spread)
3. Measures of shape (skewness and kurtosis)

1. Measures of Central Tendency - Amount by which all the data values coexist about a defined
typical value
2. Measures of Dispersion/ Variation - Amount of data values that are dispersed about a typical
value
3. Measures of Skewness - Measure of the lack of symmetry in a distribution
4. Measures of Kurtosis - Measure of the degree of peakedness in the distribution
SUMMARY - DATA DESCRIPTORS
UNIVARIATE ANALYSIS

Describing Data Numerically

Central Variation Shape Kurtosis What do we mean by average?


Tendency
What do we mean by dispersion?

Mean Range Skewness What do we mean by shape?


Characteristic Interpretation
Coefficient of Variation
Median Central Tendency Where are the data values concentrated?
Variance
How much variation is there in the data?
Mode Dispersion How spread out are the data values?
Standard Deviation Are there unusual values?
Percentiles Are the data values distributed
Shape symmetrically? Skewed?
Mean Deviation Sharply peaked? Flat? Bimodal?
Quartiles Is the height and sharpness of the peak
Kurtosis
relative to the rest of the data?
5.1. MEASURES OF CENTRAL TENDENCY
• Uni-variate Analysis – Summary statistics of single numbers - measures of central tendency
o Average or typical observed value of a variable in a data set
o Center of the frequency distribution of the data
• Commonly used measures of central tendency appropriate for 3 different levels of
measurement (nominal, ordinal, and interval) – arithmetic mean, mode and median
• Geometric & harmonic means, are appropriate only for ratio variables

Central Tendency

Arithmetic Mean Median Mode Geometric Mean


n

X i
X  i 1
XG  ( X1  X2    Xn )1/ n
n
Mid point of ranked values Most frequently observed value
Arithmetic Mean/ Average - Sum of all the observed values of the variable divided by the
number of cases
o Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4

1  2  3  4  5 15 1  2  3  4  10 20
 3  4
1  2  3  4  55 15 5 5 5
 3
5 5

Geometric Mean (GM) - Positive root of Suppose you receive a 5 percent increase in
product of observations. Used to salary this year & a 15 percent increase next
measure rate of change of a variable year. The average annual percent increase
over time is 9.886, not 10.0. Why is this so?
GM rate of return, measures the status of XG  ( X1  X2    Xn )1/ n
an investment over time, where Ri is the GM  ( 1.05 )( 1.15 )  1.09886
rate of return in time period i RG  [(1  R1 )  (1  R2 )   (1  Rn )]1/ n  1
Harmonic Mean - Number of variables divided by the sum of the
reciprocals of the variables n
H 
 Appropriate for situations when the average of rate is desired n 1

 Useful for ratios such as speed (=distance/time) etc i 1
xi

Suppose a car travels 100 X 1/X


30 1/30 = 0.0333 n
miles with 10 stops, each stop H.M. 
1
after an interval of 10 miles. 35 1/35 = 0.0286 
X
40 1/40 = 0.0250
The speeds at which the car 
10
40 1/40 = 0.0250
travels is 30, 35, 40, 40, 45, 45 1/45 = 0.0222
0.2488
40, 50, 55, 55 and 30 miles 40 1/40 = 0.0250 = 40.2 mph
per hours respectively. 50 1/50 = 0.0200
What is the average speed 55 1/55 = 0.0182 Hence it is clear that
55 1/55 = 0.0182 the harmonic mean
with which the car travelled
30 1/30 = 0.0333 gives the totally correct
the total distance of 100 1 result.
  0.2488
miles? X
CONTD...
Median - Midpoint of the values after they have been ordered
from the smallest to the largest
• If n is odd, median is the middle observation in the data array
• If n is even, median will be the arithmetic average of middle 2
numbers
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

Mode – Value in the data set that occurs with greatest frequency

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
CONTD...
Percentile - Provides information about how the data are spread over
the interval from the smallest value to the largest value
• pth percentile is a value such that at least p percent of the items take on
this value or less and at least (100 - p) percent of the items take on this
value or more
• Arrange the data in ascending order
• Compute index I, the position of the pth percentile
• If i is not an integer, round up. p th percentile is the value in the i th position
• If i is an integer, pth percentile is the average of the values in positions i and i +1
At least 80% of the items At least 20% of the items
425 430 430 435 435 435 435 435 440 440
take on a value of 542 or less take on a value of 542 or more
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
56/70 = .8 or 80% 14/70 = .2 or 20%
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510 425 430 430 435 435 435 435 435 440 440
510 515 525 525 525 535 549 550 570 570 440 440 440 445 445 445 445 445 450 450
575 575 580 590 600 600 600 600 615 615
450 450 450 450 450 460 460 460 465 465
i = (p/100)n = (80/100)70 = 56 465 470 470 472 475 475 475 480 480 480
Average of 56th & 57th values: 80th Percentile 480 485 490 490 490 500 500 500 500 510
= (535 + 549)/2 = 542 510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
25% 25% 25% 25%

Q1 CONTD…
Q2 Q3
Quartile - Split the ranked data into 4 segments with an equal
number of values per segment
 The first quartile, Q1, is the value for which 25% of the observations
are smaller and 75% are larger
 Q2 is the same as the median (50% are smaller, 50% are larger)
 Only 25% of the observations are greater than the third quartile
First quartile position: Q1 = (n+1)/4
Second quartile position: Q2 = (n+1)/2 (the median position)
Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480 Third quartile = 75th percentile = 525
i = (p/100)n = (75/100)70 = 52.5 = 53
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
CONTD…
Statistic Formula Excel Formula Pro’s Con’s
Sum all values Familiar and uses all
divided by the Influenced by
Mean =AVERAGE(Data) the sample
number of values extreme values
information

Ignores extremes &


Middle value in Robust when extreme
Median =MEDIAN(Data) can be affected by
sorted array data values exist
gaps in data values
Most frequently Useful for attribute May not be unique,
Mode occurring data =MODE(Data) data or discrete data and is not helpful
value with a small range for continuous data
Positive root of Useful for growth Less familiar &
Geometric
the product of =GEOMEAN(Data) rates and mitigates requires positive
mean (G)
observations high extremes data
EXCEL WORKOUT – EXAMPLE
Marks obtained by the PGDM students in the statistics examination are
24 27 36 48 52 52 53 53 59 60 85 90 95
Use Excel to calculate Measure of central tendency, Measure of dispersion,
Measure of shape for the ‘raw ‘ data presented in the table above.
CONTD...
Sample
Statistic Formula Excel Formula Output
Data
24 Mean (x1+...+xn)/n AVERAGE(B4:B16) 56.46154
27 Median MEDIAN(B4:B16) 53
36 Mode MODE(B4:B16) 52
48 First Quartile Q1 = (n+1)/4 QUARTILE(B4:B16,1) 48
Second
52 Q2 = (n+1)/2 QUARTILE(B4:B16,2) 53
Quartile
52 Third Quartile Q3 = 3(n+1)/4 QUARTILE(B4:B16,3) 60
25th
53 i = (p/100)n PERCENTILE(B4:B16,0.25) 48
Percentile
CONSIDERATIONS FOR • Interval/ ratio variables -
CHOOSING A MEASURE • Mode, Median, & Mean
Nominal variables - Mode
OF CENTRAL TENDENCY • Ordinal
& Median
variables – Mode
DISPERSION (OR SPREAD OR
VARIATION)
• Measures of location only describe the center of the data. It
doesn’t tell us anything about the spread of the data
• Variation/ Dispersion - “spread” of data points about the center
of the distribution in a sample, that is, the extent to which the
observations are scattered
• Enables to study & compare the spread in two or more
distributions
 Example: - in choosing supplier A or supplier B we can measure the variability
in delivery time for each

Variation

Range Mean Variance Standard Coefficient


Deviation Deviation of Variation
Range = Xlargest – Xsmallest
CONTD...
Interquartile range = Q 3 – Q1
Range - Difference 425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
between the largest 450 450 450 450 450 460 460 460 465 465
and smallest data 465 470 470 472 475 475 475 480 480 480
values 480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
 Very sensitive to the
575 575 580 590 600 600 600 600 615 615
smallest and largest
data values Range = 615 - 425 = 190

Interquartile Range - 425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
Difference between 450 450 450 450 450 460 460 460 465 465
the third quartile & the 465 470 470 472 475 475 475 480 480 480
first quartile 480 485 490 490 490 500 500 500 500 510
 Very sensitive to the 510 515 525 525 525 535 549 550 570 570
smallest and largest 575 575 580 590 600 600 600 600 615 615
data values 3rd Quartile (Q3) = 525 1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80


 
2
2  i
( x )
CONTD...
2

N
Variance - Average of the The hourly wages
squared differences for a sample of part-
between each data value time employees at
and the mean Home Depot are:
• Difference between the value $12, $20, $16, $18,
of each observation (xi) & the and $19. What is the
mean (for a sample, m for a sample variance?
population)

Standard Deviation - Sample data: 10 12 14 15 17 18 18 24


Positive square root of the (10  X)2  (12  X)2  (14  X)2    (24  X)2
S
variance n 1

• standard deviation are the


same as the mean and data (10  16) 2  (12  16) 2  (14  16) 2    (24  16) 2

values 8 1

130
  4.3095
7
 S 
CONTD...
CV  
 X

  100%


Coefficient of Variation - Used to compare/ measure relative variation
between two or more sets of data in percentage (%)
• The coefficient of variation indicates how large the standard deviation is in relation
to the mean

Stock A:Average price last year = $50; Standard deviation = $5


 S  $5
CVA    100% 
 100%  10%
X  $50

Stock B: Average price last year = $100; Standard deviation =$5


 S  $5
CVB     100% 
 X   100%  5%
  $100

Both stocks have the same standard deviation, but stock B is less
variable relative to its price
CONTD...
Same center, different variation

Statistic Formula Excel Pro Con


Sensitive to
=MAX(Data)-
Range xmax – xmin Easy to calculate extreme data
MIN(Data)
values
n
  xi  x 
2
Plays a key role in Non-intuitive
Variance (s2) i 1 =VAR(Data)
mathematical statistics meaning
n 1
n
  xi  x 
2
Standard Uses same units as the Non-intuitive
i 1 =STDEV(Data)
deviation (s) raw data ($ , £, ¥, etc.) meaning
n 1
Measures relative
Coefficient of s Requires non-
variation (CV) 100  None variation in percent so
negative data
x can compare data sets
n
Mean absolute  xi  x Lacks “nice”
deviation i 1 =AVEDEV(Data) Easy to understand theoretical
(MAD) n properties
EXCEL WORKOUT
CONTD...
Sample
Statistic Formula Excel Formula Output
Data

53 Range xmax – xmin MAX(B4:B16)-MIN(B4:B16) 71


n
  xi  x 2
59 Variance i 1 VAR(B4:B16) 493.2692
n 1
n

60 Standard Deviation   xi  x 2
STDEV(B4:B16) 22.20967
i 1
n 1

85 Quartile Range QR = Q3 – Q1 F9-F7 12

90 Semi-Quartile Range SIQR = (Q3 – Q1)/2 (F9-F7)/2 6


n
Mean Absolute  xi  x
95 i 1 AVEDEV(B4:B16) 16.4142
Deviation n
SHAPE
• A fundamental task in many statistical analyses is to characterize the
location and variability of a data set
• A further characterization of the data includes the idea of skewness
and kurtosis
SKEWNESS
• Skewness - Measure of the
degree of asymmetry of a
distribution
• Excel uses Fisher’s measure of
skewness. From Excel, skew =
0.4410. This value can be
compared with a lower and
upper value to see if 0.4410
suggests a significantly skewed
distribution Excel Function Method
• In this example, the Fisher’s skew = SKEW(B4:B16)
lower/upper limit is ± 1.36 and = 0.4410
0.4410 lies between -1.36 and
6
+1.36. The value of 0.4410 Cri  2 
N
suggests the data values are
not significantly skewed
KURTOSIS
• Kurtosis - Measure of whether the
data are peaked or flat relative to
a normal distribution
• Excel uses Fisher’s measure of
Kurtosis
• From Excel, kurtosis = -0.4253.
This value can be compared with
a lower and upper value to see if -
0.4253 suggests a significantly
‘peaked’ distribution Excel Function Method
• Example:- Lower/upper limit is ± Fisher’s kurtosis =KURT(B4:B16)
2.72 and -0.4253 lies between - =- 0.4253
2.72 and + 2.72. The value of - 24
Cri  2 
0.4253 suggests the data values N
do not have a significant kurtosis
problem
SUMMARY – UNIVARIATE
ANALYSIS
Descriptive Statistics
• Descriptive statistics - Summarizes data by providing insights into the information contained in
the data
• Choice of summary statistics - depends on the type of variable (interval/ratio, ordinal, and
nominal data) being examined
• Describing/ examining data - Measures of location, variation, and shape
• Measuring Location - Performed through Measures of central tendency - Refers to value at the
center of the distribution - Common statistics include mean, median besides few more
• Measuring Variation - Performed through Measures of dispersion - Measures of how far the data
points lie from one another - Common statistics include standard deviation & coefficient of
variation. For data that aren’t normally-distributed, percentiles/ the interquartile range might
be used
• Skewness - Describes the symmetry of a distribution
• Kurtosis – Describes the peakedness of a distribution.
• Best tools to evaluate the shape of data - Histograms & related plots. Values close to 0, suggest
an approximately normal distribution. We can describe data shape as normally-distributed, log-
normal, uniform, skewed, bi-modal, & others
Several user-contributed packages offer functions for descriptive
statistics, including Hmisc, pastecs, & psych, doBy. These
packages aren’t included in the base distribution, hence they
need to be installed
1. Hmisc: - Contains functions useful for data analysis, high level graphics,
utility operations, functions for computing sample size & importing
datasets etc.
describe(mtcars)
• Prints a concise statistical summary (no. of variables & observations,
the no. of missing & unique values, the mean, quantiles, & 5 highest
& lowest values)
2. doBy: - Provides functions for descriptive statistics by group. It contains
function called summaryBy()
summaryBy(mpg+hp+wt~am,data=mtcars,FUN=mean)
summaryBy(cyl+gear~mpg,data=mtcars,FUN=summary)
– Categorizes observations on am data item & calculates mean of
mpg, hp & wt data items
– Variables on the left side of the ~ are the numeric variables to be
analysed, & variables on the right side are categorical/ grouping
variables
3. pastecs: - Includes stat.desc() that provides wide range of descriptive
statistics -
stat.desc(mtcars$mpg,basic=TRUE,desc=TRUE,norm=TRUE,p=0.95)
basic=TRUE (default) – displays no. of values, null values, missing
values, minimum, maximum, range (max-min) & sum of all non-missing
values
desc=TRUE (default) – displays median, mean, standard error of the
mean, 95% confidence interval for the mean, variance, standard
deviation, & coefficient of variation
norm=TRUE (not the default) – normal distribution statistics are
displayed, including skewness & kurtosis
p-value option is used to calculate the confidence interval for the
mean

o Displays values using scientific notation


stat.desc(mtcars$mpg,basic = TRUE, desc = TRUE,p=0.95)
o options() function, - values remain same, but their display becomes
simpler
options(scipen=100) #Convert to non-scientific notation
options(digits=2)
stat.desc(mtcars$wt,basic = TRUE, desc = TRUE,p=0.95)
o Convert to non-scientific notation
• A positive skewness indicates that the size of the right-
handed tail is larger than the left-handed tail
• A negative skewness indicates that the left-hand tail will
typically be longer than the right-hand tail
• The rule of thumb for Skewness
• If the skewness is between -0.5 and 0.5, the data are fairly
symmetrical
• If the skewness is between -1 and – 0.5 or between 0.5 and 1, the
data are moderately skewed
• If the skewness is less than -1 or greater than 1, the data are highly
skewed
• Kurtosis decreases as the tails become lighter & It
increases as the tails become heavier
• If the kurtosis is close to 0, then a normal distribution is often assumed. These
are called mesokurtic distributions
• If the kurtosis is less than zero, then the distribution is light tails and is called a
platykurtic distribution
• If the kurtosis is greater than zero, then the distribution has heavier tails and is
called a leptokurtic distribution
4. Summarize: – One more version of summarize for producing
stratified summary statistics
• Summarize the mean of the variable mpg for the unique cyl (4,6, &
8) values
summarize(X = mtcars$mpg, by = mtcars$cyl, FUN = mean)
• Summarize the mean of the variable mpg for the unique cyl (4,6, &
8) & gear (3,4, & 5) values
summarize(X = mtcars$mpg, by = llist(mtcars$cyl,mtcars$gear), FUN =
mean)
• Summarizes (minimum, maximum, mean, median, 1st quantile, 3rd
quantile) the variable mpg for the unique cyl (4,6, & 8) & gear (3,4,
& 5) values
summarize(X=mtcars$mpg,by=llist(mtcars$cyl,mtcars$gear),summary)
5. Psych: - Also offers functions for descriptive statistics. It contains
describe() function that provides the number of non-missing
observations, mean, standard deviation, median, trimmed
mean, median absolute deviation, minimum, maximum, range,
skew, kurtosis, & standard error of the mean
myvars = c(“mpg”,”hp”,”wt”)
describe(mtcars[myvars])
• Displays number of non-missing observations, mean, standard
deviation, median, trimmed mean, median absolute deviation,
minimum, maximum, range, skew, kurtosis, & standard error of the
mean )
• Similar to describe(). Stratifies data (mpg,hp,wt) by am (am=0 &
am=1)
describeBy(mtcars[myvars],list(am=mtcars$am))
describeBy(iris,list(iris$Species)) #iris dataset

Note: - Packages Hmisc and psych both provide describe() function.


How does R know which one to use? Simply, the package last
loaded masks the function of the older one. If you want the
Hmisc version, type Hmisc :: describe(mtcars)
6. aggregate() - Descriptive Statistics by Groups – Focus is usually
on descriptive statistics of each group, rather than the total
sample. Obtains descriptive statistics by group. Will not work if
the dataset has NOT been loaded using attach()
#List the mileages based on no. of cylinders (4,6,8)
aggregate(mtcars$mpg,by=list(Cylinder=mtcars$cyl),mean)
#List the mileage, horsepower, weight based on no. of cylinders (4,6,8)
myvars = c('mpg','hp','wt')
aggregate(mtcars[myvars],by=list(Cylinder=mtcars$cyl),mean)
#List the 10 varaibles of dataset based on no. of cylinders (4,6,8)
aggregate(mtcars,by=list(Cylinder=mtcars$cyl),mean)
#List the mileage based on no. of cylinders (4,6,8), no. of gears (3,4,5) & type of engine
(am=0,am=1)
aggregate(mtcars$mpg,by=list(AM=mtcars$am,Gear=mtcars$gear,Cylinder=mtcars$cyl),me
an)
#List the mpg,hp & wt based on no. of cylinders (4,6,8), no. of gears (3,4,5) & type of engine
(am=0,am=1)
aggregate(mtcars[myvars],by=list(AM=mtcars$am,Gear=mtcars$gear,Cylinder=mtcars$cyl),
mean)
#List all 10 varaibles of the dataset based on no. of cylinders (4,6,8), no. of gears (3,4,5) &
type of engine (am=0,am=1)
aggregate(mtcars,by=list(AM=mtcars$am,Gear=mtcars$gear,Cylinder=mtcars$cyl),mean)
#Shows the cyl(cylinder-wise) aggregated mean of mpg
aggregate(formula=mpg~cyl, data = mtcars, FUN=mean)
#Shows the aggregated standard deviation value for the wt & gear attributes
aggregate(formula=mpg~cyl, data=mtcars, FUN=sd)
#Shows the Gear-wise standard deviation of mpg
aggregate(wt~gear,data= mtcars, FUN = sd)
Caselet 1 - mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel
consumption and 10 aspects of automobile design and performance for 32 automobiles
(1973–74 models). The description of the 11 numeric variables with the 32 observations in
the data frame are as follows:
1.[,1] mpg – Miles/ (US) gallon
2.[,2] cyl – Number of Cylinders
3.[,3] disp – Displacement (cu.in.)
4.[,4] hp – Gross Horsepower
5.[,5] drat – Rear Axle Ratio
6.[,6] wt – Weight (1000 lbs)
7.[,7] qsec – ¼ Mile Time
8.[,8] vs – V/S Engine Shape
9.[,9] am –Transmission (0=automatic, 1=manual)
10.[,10] gear – Number of Forward Gears
11.[,11] carb – Number of Carburetors
Prepare a managerial report.
1. stat.desc(mtcars$mpg,basic=TRUE,desc=TRUE,norm=TRUE,p=0.95)
2. mt <- mtcars[c("mpg", "hp", "wt", "am")]
summary(mt)
stat.desc(mt,basic = TRUE,desc = TRUE,norm = TRUE,p=0.95)
3. vars <- c("mpg", "hp", "wt")
head(mtcars[vars])
summary(mtcars[vars]) #Provides minimum, maximum, quartiles, mean & median for
numerical variables
stat.desc(mtcars[vars])
stat.desc(mtcars[vars],norm=TRUE) #Surprisingly, the base installation doesn’t provide
functions for skew and kurtosis, but you can add your own

Interpretation
For cars in this sample, the mean mpg is 20.1, with a standard deviation of 6.0. The distribution is
skewed to the right (+0.61) and somewhat flatter than a normal distribution (–0.37). Majority of the
automobiles considered in the Motor Trend Car Road Tests (mtcars) dataset have less mileage
than mean while very give more mpg.
Caselet 2 – Cars93 Dataset (In-built)
The Cars93 data frame has data of 93 Cars on Sale in the USA in 1993 arranged in 93 rows and 27 columns.
The description of the variables are in the data set are as follows:
1. Manufacturer - Manufacturer
2. Model - Model
3. Type – A factor with levels “Small", "Sporty", "Compact", "Midsize", "Large“ & "Van"
4. Min.Price – Minimum Price ($1000): Price for a basic version
5. Price – Mid-range Price ($1000): Average of Min.Price & Max.Price
6. Max.Price – Maximum Price ($1000): Price for a premium version
7. MPG.city – City MPG (miles per US gallon by EPA rating)
8. MPG.highway – Highway MPG
9. AirBags – Air Bags standard. Factor: none, driver only, or driver & passenger
10. DriveTrain – Drive train type: rear wheel, front wheel or 4WD; (factor)
11. Cylinders – No. of cylinders (missing for Mazda RX-7, which has a rotary engine)
12. EngineSize - Engine size (litres)
13. Horsepower - Horsepower (maximum)
14. RPM - RPM (revs per minute at maximum horsepower)
15. Rev.per.mile - Engine revolutions per mile (in highest gear)
16. Man.trans.avail - Is a manual transmission version available? (yes or no, Factor)
17. Fuel.tank.capacity - Fuel tank capacity (US gallons)
18. Passengers - Passenger capacity (persons)
19. Length - Length (inches)
20. Wheelbase - Wheelbase (inches)
21. Width - Width (inches)
22. Turn.circle - U-turn space (feet)
23. Rear.seat.room - Rear seat room (inches) (missing for 2-seater vehicles)
24. Luggage.room - Luggage capacity (cubic feet) (missing for vans)
25. Weight - Weight (pounds)
26. Origin - Of non-USA or USA company origins? (factor)
27. Make - Combination of Manufacturer and Model (character)
Assignment
1. Load the data set Cars93 with data(Cars93,package=“MASS”) and set randomly any
5 observations in the variables Horsepower and Weight to NA (missing values)
2. Calculate the arithmetic mean & the median of the variables Horsepower and
Weight
3. Calculate the standard deviation and the interquartile range of the variable Price

1. Load data and setting missing values NA


data(Cars93,package="MASS") Cars93[sample(1:nrow(Cars93),5), c("Horsepower","Weight")] <-
NA
2. Calculate mean & median
c(mean(Cars93$Horsepower, na.rm=TRUE), median(Cars93$Horsepower, na.rm=TRUE))
c(mean(Cars93$Weight, na.rm=TRUE), median(Cars93$Weight, na.rm=TRUE))
3. Calculate standard deviation & inter-quartile range
c(sd(Cars93$Price),IQR(Cars93$Price)) # no missing values
CASE STUDY
THE IMPORTANCE OF FOOD & NUTRITION
• Obesity Trends Among US Adults - USDA
Source: file:///C:/Users/npm/Desktop/IPE%20-%20Class%20of%202019%20-%20Reference%20Material/R%202019/R%20-%20MITOpenCourseware%20-
%2010.07.2019/contents/an-introduction-to-analytics/working-with-data-an-introduction-to-r/video-1-why-r/index.htm
More than 35% of US adults are obese. Obesity-related conditions are some of
the leading causes of preventable death (heart disease, stroke, type II
diabetes). Worldwide, obesity has nearly doubled since 1980. 65% of the world’s
population lives in countries where overweight and obesity kills more people
than underweight.
Good nutrition is essential for a person’s overall health and well-being, and is
now more important than ever. Hundreds of nutrition and weight-loss
applications. 15% of adults with cell phones use health applications on their
devices. These apps are powered by the USDA Food Database.
The United States Department of Agriculture distributes a database of nutritional
information for over 7,000 different food items. Used as the foundation for most
food and nutrient databases in the US. Includes information about all nutrients.
Calories, carbs, protein, fat, sodium, . . .

UNDERSTANDING FOOD
NUTRITIONAL EDUCATION WITH DATA
1. Lets learn about the structure of the dataset – str(USDA)
• There are 7058 observations collected on 16 variables
• Variables include: ID – Unique Identification no. starting with 1001; Description – Text description of
each food item studied; Calories – Amount of calories in 100 grams of food, measured in kilo calories;
Protein, TotalFat, Carbohydrate, SaturatedFat, Sugar – measured in grams; Sodium, Cholesterol,
Calcium, Iron, Potassium, VitaminC – measured in milli grams; VitaminE & VitaminD – measured in
standard national units

2. Obtain high level statistical information – summary(USDA)


• Displays min, max, mean, median, first & third quartile values
• Example: - Maximum amount of Cholesterol is 3100 mg while the mean is only 41.5 mg
• Information about NA’s is also available. 1910 NAs are recorded for the variable Sugar
• Startling fact – Maximum Sodium levels are recorded as 38758 mg, which is huge against
the daily recommended levels of 2300 mg
3. Investigate the food which has high Sodium level recorded as 38758 mg –
which.max(USDA$Sodium) – Gives index of the highest Sodium level
• 265th food in the dataset has the highest Sodium level
• To know which food is at the 265th index – USDA$Description[265] – SALT, TABLE
• Create a new dataframe HighSodium which consists of foods containing more than 10000
mg of Sodium – HighSodium = subset(USDA, Sodium>10000)
• How many foods contain more than 10000 mg of Sodium – nrow(HighSodium) – Output is 10
foods
• Display the names of the foods in the HighSodium dataframe – HighSodium$Description
4. Our understanding is that CAVIAR has high levels of Sodium, but doesn’t appear in
the list. Lets find how much Sodium it has in 100 gm. – match(“CAVIAR”,
USDA$Description)
• CAVIAR’s index number is 4154
• Find Sodium levels of CAVIAR – USDA$Sodium[4154]
• To find the level of Sodium in CAVIAR – USDA$Sodium[match(“CAVIAR”, USDA$Description)]
• Output: - 1500
• Check whether the Sodium level of 1500 for CAVIAR is high or low – Best way to draw
comparison of CAVIAR with other foods is to check Mean & Standard deviation across the
dataset – summary(USDA$Sodium), sd(USDA$Sodium, na.rm=TRUE)
• If we sum up mean and standard deviation value (322.1 + 1045.417 = 1367.515) we notice that
1500 is pretty high than 1400
• CAVIAR is pretty rich in Sodium in comparison to other foods in the dataset
5. Add new variable (HighSodium) to USDA - takes a value 1 if food has higher Sodium than
average & 0 if food has lower Sodium than average
• Check if the first food has higher level of Sodium than average – USDA$Sodium[1] >
mean(USDA$Sodium, na.rm=TRUE)
• Check the same for all the foods & store the output to the vector HighSodium - HighSodium=
USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE)
• Display the structure of the vector HighSodium which stores logical values - str(HighSodium)
• Change the logical values of HighSodium vector to 0’s and 1’s –
HighSodium=as.numeric(USDA$Sodium > mean(USDA$Sodium, nar.rm=TRUE))
• Examining its structure (values are represented in 0’s & 1’s now) – str(HighSodium)
• Add vector HighSodium to the dataframe USDA - USDA$HighSodium=as.numeric(USDA$Sodium >
mean(USDA$Sodium, nar.rm=TRUE))
• Examining structure of USDA – str(USDA)
6. Add new variables (HighProtein, HighFat, & HighCarbs) to the USDA dataframe which takes a
value 1 if the food has higher levels than average and 0 if the food has lower levels than
average
• USDA$HighProtein=as.numeric(USDA$Protein > mean(USDA$Protein, nar.rm=TRUE))
• USDA$HighFat=as.numeric(USDA$TotalFat > mean(USDA$TotalFat, nar.rm=TRUE))
• USDA$HighCarbs=as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, nar.rm=TRUE))
• str(USDA)
7. To figure out how many foods have higher sodium level than average
• Count the foods that have sodium level values as 1 - table(USDA$HighSodium)
Most of the food in the dataset have lower sodium (4884) while 2090 foods have higher sodium
levels
• High Sodium and High Fat - table(USDA$HighSodium, USDA$HighFat)
Rows belong to first input HighSodium & columns refer to second input HighFat. 3529 foods have
low sodium & low fat; 1355 foods have low sodium & high fat, 1378 foods have high sodium &
low fat; 712 foods have high sodium & high fat
8. Compute average amount of iron sorted by high & low protein– tapply(USDA$Iron, USDA$HighProtein,
mean, na.rm=TRUE)
• Foods with low protein content have 2.55mg of iron whereas foods with high protein content have 3.19mg of iron
• Compute maximum level of Vitamin C in foods with high & low carbs - tapply(USDA$VitaminC,
USDA$HighCarbs, max, na.rm=TRUE)
• Maximum Vitamin C (2400 mg) is actually present in a food that is high in carbs
• Is it true that foods that are high in carbs have maximum Vitamin C levels? - tapply(USDA$VitaminC,
USDA$HighCarbs, summary, na.rm=TRUE)
• On an average 6.36 mg of Vitamin c in foods with low carb levels & on an average16.31 mg of Vitamin c in foods
with high carb levels. So, it does seem like a general trend that foods with high carb levels on average are richer in
Vitamin C than foods with low carb levels
4. Let us create a scatter plot with Protein on x axis & TotalFat
on y-axis. – plot(USDA$Protein, USDA$TotalFat)
• It looks like foods higher in Protein are lower in Fats & vice versa
• We can add more aesthetics to the graph by adding arguments
to the plot function plot(USDA$Protein, USDA$TotalFat, xlab=“Protein,
ylab=“Total Fat”,main=“Protein vs. Fat”, col=“red”)

5. Other way to visualize data - plot a Histogram –


hist(USAD$VitaminC, xlab = ‘Vitamin C (mg) ’, main =
‘Histogram of Vitamin C Levels’)
• Even though maximum vitamin C is 2000 mg, most of the foods
more than 6000 of them have less than 200 mg of vitamin c and
histogram plots all of them in one cell
• Additional information - hist(USAD$VitaminC, xlab = ‘Vitamin C (mg) ’,
main = ‘Histogram of Vitamin C Levels’, xlim = c(0, 100), breaks = 2000)

file:///C:/Users/npm/Desktop/IPE%20-%20Class%20of%202019%20-
%20Reference%20Material/R%202019/R%20-%20MITOpenCourseware%20-
%2010.07.2019/contents/an-introduction-to-analytics/understanding-food-nutritional-education-
with-data-recitation/video-4-creating-plots-in-r/index.htm
CASE STUDY
AN ANALYTICAL DETECTIVE
• Motor Vehicle Theft crimes - City of Chicago, Illinois, United States

Source: file:///C:/Users/npm/Desktop/IPE%20-%20Class%20of%202019%20-%20Reference%20Material/R%202019/R%20-%20MITOpenCourseware%20-
%2010.07.2019/contents/an-introduction-to-analytics/assignment-1/index.htm
Crime is an international concern, but it is documented & handled in very different ways
in different countries. In the United States, violent crimes & property crimes are recorded
by the Federal Bureau of Investigation (FBI). Additionally, each city documents crime, &
some cities release data regarding crime rates. There are two main types of crimes:
violent crimes, and property crimes. In this problem, we'll focus on one specific type of
property crime, called "motor vehicle theft" (sometimes referred to as grand theft auto).
This is the act of stealing, or attempting to steal, a car.
Chicago is the third most populous city in the United States, with a population of over 2.7
million people. In this problem, we'll use some basic data analysis in R to understand the
motor vehicle thefts in Chicago.
Variables Description: ID: a unique identifier for each observation; Date: the date the crime occurred;
LocationDescription: the location where the crime occurred; Arrest: whether or not an arrest was made for
the crime (TRUE if an arrest was made, & FALSE if an arrest was not made); Domestic: whether or not the
crime was a domestic crime, meaning that it was committed against a family member (TRUE if it was
domestic, and FALSE if it was not domestic); Beat: the area, or "beat" in which the crime occurred. This is the
smallest regional division defined by the Chicago police department; District: the police district in which the
crime occured. Each district is composed of many beats, and are defined by the Chicago Police
Department; CommunityArea: the community area in which the crime occurred. Since the 1920s, Chicago
has been divided into what are called "community areas", of which there are now 77. The community areas
were devised in an attempt to create socially homogeneous regions; Year: the year in which the crime
occurred; Latitude: the latitude of the location at which the crime occurred; Longitude: the longitude of the
location at which the crime occurred.
1. Read the dataset into R. It may take a few minutes to read in the data, as the dataset
is pretty large.
2. How many rows of data (observations) are in this dataset? How many variables are
there in the dataset? (Hint: Use str() & summary() functions to answer the following
questions.)
3. What is the maximum value of the variable "ID"? (Hint: Use max() function)
4. What is the minimum value of the variable "Beat"?
5. How many observations have value TRUE in the Arrest variable (this is the number of
crimes for which an arrest was made)?
6. How many observations have a LocationDescription value of ALLEY?
7. Chicago Police Department wish to increase the number of arrests made for motor
vehicle thefts in 5 locations. Which locations should police focus (excluding others)?
8. How many observations are in Top5? Create a subset of your data, only taking
observations for which the theft happened in one of these five locations, and call this
new data set "Top5". (Hint: Alternately, you could create five different subsets, and
then merge them together into one data frame using rbind.)
9. On which day of the week do the most motor vehicle thefts at gas stations happen?
10.One of the locations has a much higher arrest rate than the other locations. Which is
it?
1. Read the dataset into R. It may take a few minutes to read in the data, as the
dataset is pretty large.
Import Dataset button
2. How many rows of data (observations) are in this dataset? How many variables are
there in the dataset? (Hint: Use str() & summary() functions to answer the following
questions.)
str(cc)
dim(cc)
3. What is the maximum value of the variable "ID"? (Hint: Use max() function)
max(cc$ID)
4. What is the minimum value of the variable "Beat"?
min(cc$Beat)
summary(cc)
5. How many observations have value TRUE in the Arrest variable (this is the number of
crimes for which an arrest was made)?
summary(cc)
6. How many observations have a LocationDescription value of ALLEY?
table(cc$LocationDescription)
match("ALLEY",cc$LocationDescription)
summary(cc)
7. Chicago Police Department wish to increase the number of arrests made for motor vehicle thefts in 5
locations. Which locations should police focus (excluding others)?
sort(table(mvt$LocationDescription))
Locations with the largest number of motor vehicle thefts are listed in the last. These are Street, Parking Lot/
Garage (Non. Resid.), Residential Yard (Front/ Back), Alley, & Vehicle Non-commercial
8. How many observations are in Top5? Create a subset of your data, only taking observations for which the theft
happened in one of these five locations, and call this new data set "Top5". Alternately, you could create five
different subsets, and then merge them together into one data frame using rbind.
Top5 = subset(cc, LocationDescription=="STREET" | LocationDescription=="PARKING
LOT/GARAGE(NON.RESID.)" | LocationDescription=="RESIDENTIAL YARD (FRONT/BACK)"|
LocationDescription=="ALLEY" | LocationDescription=="VEHICLE NON-COMMERCIAL")
str(Top5)

TopLocations = c("STREET", "PARKING LOT/GARAGE(NON.RESID.)", "RESIDENTIAL YARD (FRONT/BACK)", "ALLEY",


"VEHICLE NON-COMMERCIAL")
Top5 = subset(cc, LocationDescription %in% TopLocations)
REVIEW QUESTIONS
1. What are the different data types in R?
2. What are the different data structures in R?
3. How do I access data within the various data structures?
PROBABILITY
DISTRIBUTIONS
Introduction; Random variable – Discrete and Continuous
Variable; Types of Probability Distributions - Binomial,
Poisson, Exponential and Normal Distributions; Applications
Example: - On tossing a coin three times, what is the probability distribution for
no. of heads?
The possible results are: 0, 1, 2, and 3
Random variable - a quantity
resulting from an experiment
that, by chance, can assume
different values
Frequency Distribution Probability Distribution
Listing of observed frequencies of Listing of probabilities of all the
all the outcomes of an possible outcomes that could
experiment that actually result if the experiment were
occurred when the experiment done
was done
EXPERIMENT OUTCOME RANDOM VARIABLES
RANGE OF RANDOM
VARIABLES
Stock 50 Christmas Number of Christmas trees
X 0, 1, 2,…, 50
trees sold
0, 1, 2,…, 600
Inspect 600 items Number of acceptable items Y

Send out 5,000 Number of people 0, 1, 2,…, 5,000


Z
sales letters responding to the letters
Build an apartment Percent of building
R 0 ≤ R ≤ 100
building completed after 4 months
Test the lifetime of
Length of time the bulb lasts
a lightbulb S 0 ≤ S ≤ 80,000
up to 80,000 minutes
Discrete random variable Continuous random variable
 counts occurrences  measures (e.g.: height, weight, speed,
 has a countable number of possible value, duration, length)
values  has an uncountable infinite number of
 has discrete jumps between successive possible values
values  moves continuously from value to value
 Integers are Discrete  Real Numbers are Continuous
 probability is height  probability is area

X P(x)
0 0.125 Binomial:n=3p=.5
1 0.375 Minutes to Complete Task
0.4
2 0.375
0.3
3 0.125 0.3
1.000
0.2
P(x)

0.2

P(x)
0.1
0.1

0.0
0.0
1 2 3 4 5 6
0 1 2 3 Minute
C1
Example: Binomial Distribution Shaded area represents probability that
n=3 p=.5 task takes between 2 & 3 minutes
Theoretical Distributions
• Mathematical models – used to deduce mathematically what
the distribution of expected values should be based on the
basis of previous experiences or theoretical considerations
• There are 2 types of theoretical distributions:
• Discrete Theoretical distributions
• Continuous Theoretical distributions

Discrete Distributions Continuous Distributions

1. Binomial Distribution 1. Normal Distribution

2. Poisson Distribution 2. Exponential Distribution

3. Geometric distribution 3. Students t-Distribution

4. Hyper-Geometric distribution 4. Chi-square Distribution


5. F-Distribution
BINOMIAL DISTRIBUTION
• Also called as “Bernoulli’s The binomial distribution is used
Distribution” to find the probability of a
specific number of successes
 Pre-requisites/ Pre-conditions out of n trials
1. Discrete data Probabilit y of r successes in n trials
2. Dichotomy exists (Only 2  n!
p r q nr
outcomes) r!(n  r )!
3. Probability is the same for all Parameters n = number of trials
trials p = probability of
success
4. Trials areOutcomes statistically
Bernoulli Experiment
Possible
Probability of
“Success”
PDF
P( x) 
n!
p r q nr
independent
Flip a coin 1 = heads p = .50
r!(n  r )!
0 = tails Excel function =BINOMDIST(k,n,p,0)
5. Number 1of
Inspect a jet turbinetrials
= crack found is a positive
p = .001
Mean/ Expected μ = np
blade
integer 0 = no crack found value
Purchase a tank of 1 = pay by credit p = .78
gas card Std. Dev.
0 = do not pay by npq
credit card
Pat is taking statistics course. As Pat is weak in statistics, his exam
strategy is to rely on luck for the ROTe exam. ROTe consists of 10
multiple-choice questions. Each question has five possible
answers, only one of which is correct. Pat plans to guess the
answer to each question. What is the probability that Pat gets
1. No answers correct?
2. Two answers correct?
3. Pat fails the quiz.

Given n=10, and P(success) = 1/5 = .20


1. What is the probability that Pat gets no answers correct?
Pat has about an 11% chance of getting no answers correct using the
guessing strategy
2. What is the probability that Pat gets two answers correct?
Pat has about 30% chance of getting exactly 2 answers correct using the
guessing strategy
3. Find the probability that Pat fails the quiz
If grade on the quiz is less than 50% (i.e. 5 questions out of 10), that’s
considered failed quiz.
P(X ≤ 4) = P(0) + P(1) + P(2) + P(3) + P(4)
MS EXCEL

# successes

# trials

P(success)

cumulative
(i.e. P(X≤x)?)

P(X=2)=.3020
MSA Electronics is experimenting with the manufacture of a new
transistor. Every hour a sample of 5 transistors is taken. The
probability of one transistor being defective is 0.15.
What is the probability of finding 3, 4, or 5 defective? Also find the
expected value (or mean) & variance of a binomial distribution.
Given

n = 5, p = 0.15, & r = 3, 4, or 5

P(3 or more defects)

= P(3)+P(4)+P(5)

= 0.0244+0.0022+0.0001

= 0.0267
Expected value  np  5(0.15 )  0.75
Variance  np(1 p)  5(0.15 )(0.85 )  0.6375
On average, 20% of the emergency room patients at Greenwood
General Hospital lack health insurance. In a random sample of 4
patients, what is the probability that at least 2 will be uninsured?
1. What is the mean and standard deviation of this binomial distribution?
2. What is the probability that the sample of 4 patients will contain at least 2 uninsured
patients?
3. What is the probability that fewer than 2 patients have insurance?
4. What is the probability that no more than 2 patients have insurance?

Given PDF formula calculations PDF Excel formula


.4096 = BINOMDIST(0,4,.2,0)
P(uninsured) = p = 0.20 P(0) 
4!
(.2)0 (1  .2)40  1 .20  .84
0!(4  0)!
n = 4 patients
4! .4096 = BINOMDIST(1,4,.2,0)
P(1)  (.2)1 (1  .2)41  4  .21  .83
r = 0,1,2,3,4 patients 1!(4  1)!
Mean 4! 4 3
.1536 = BINOMDIST(2,4,.2,0)
P(3)  (.2) (1  .2)  4  .2  .8
3 3 1

 = np = 0.8 patients 3!(4  3)!


4! .0256 = BINOMDIST(3,4,.2,0)
Standard deviation P(2)  (.2)2 (1  .2)42  4  .22  .82
2!(4  2)!
σ = n(1  ) 4! .0016 = BINOMDIST(4,4,.2,0)
P(4)  (.2)4 (1  .2)44  1 .24  .80
4!(4  4)!
= 0.8 patients
What is the probability that the sample of 4 patients will contain at
least 2 uninsured patients?
P(X  2) = P(2) + P(3) + P(4)
= .1536+.0256+.0016
= .1808
What is the probability that fewer than 2 patients have insurance?
P(X < 2) = P(0) + P(1)
= .4096 + .4096
= .8192
What is the probability that no more than 2 patients have insurance?
P(X < 2) = P(0) + P(1) + P(2)
= .4096 + .4096 + .1536
= .9728
Applications
• Applicable only when samples are chosen from
• an infinite population
• with replacement so that the success probability remains
the same during all the trials
• Example: - Acceptance or rejection of the lot for defective
products based on the sample meeting or not meeting the
quality standards
Case Study

HOW MANY LOGIC


ANALYZERS TO SCHEDULE
FOR MANUFACTURING
You pay close attention to quality in your production facilities, but
the logic analyzers you make are so complex that there are still
some failures. In fact, based on past experience, about 95% of the
finished products are in good working order. Today you will have
to ship 16 of these machines.
How many should you schedule for production to be reasonably
certain that 16 working logic analyzers will be shipped? What if
you schedule n=16 units for production, What is the probability
that at-least 16 working analyzers will be shipped?

Given: n = 16; p = 0.95; r = 16 P(X=16) = 0.440


Thus, if you schedule the same number (16), that you need to ship, you will be taking
a big chance! There is only a 44% chance that you will meet the order, & a 56%
chance that you will fail to ship the entire order in working condition
P(X≥16) = P(X=17) + P(X=18) + P(X=19) + P(X=20) = 0.060+0.189+0.377+0.358 = 0.984
So, if you schedule 20 for production, you have a 98% chance of shipping 16 good
machines. It looks likely, but you would still be taking a 2% chance of failure.
POISSON DISTRIBUTION
Poisson distribution describes the number of occurrences within
a randomly chosen unit of time or space
Characteristics of Poisson Distribution
1. Discrete data
2. Dichotomy exists
3. Events are statistically independent
4. Applicable in those cases where n is very large (n ≥ 20) and probability
of success p is very small (p ≤ 0.05) but the mean np = λ is finite

Parameters l = mean arrivals per unit of time p(x) =probability of exactly X arrivals/occurrences

l x el l =average number of arrivals per unit of time


PDF P( x) 
x! e =2.718, the base of natural logarithms

x =desired event
Mean l
St. Dev.
l
A statistics instructor has observed
that the number of typographical
errors in new editions of textbooks
varies considerably from book to There is about 22% chance of
book. finding zero errors
1. After some analysis he concludes
that the number of errors is Poisson
distributed with a mean of 1.5 per
100 pages. The instructor randomly
selects 100 pages of a new book. There is a very small chance there
What is the probability that there are are no typos
no typos?
2. Suppose,instructor has just received P(X≤5) = P(0) + P(1) + … + P(5)
a copy of a new statistics book. He
notices that there are 400 pages
with a mean errors of 6 per page.
a) What is the probability that there are
no typos? For µ =6, P(X ≤ k) = .446
b) What is the probability that there are There is about a 45% chance there
five or fewer typos? are 5 or less typos
MS EXCEL
We are investigating the safety of dangerous intersection.
Past police records indicate the mean of 5 accidents per
month at the intersection. The number of accidents is
distributed according to a Poisson distribution, & the
Highway Safety Division wants us to Calculate the
probability in any month of exactly 0,1,2,3, or 4 accidents.
Highway Safety Division will take action to improve the
intersection if probability of more than 3 accidents per
month exceeds 0.65. Should they act?
Calculate the probability of having 0,1,2 or 3 accidents and then subtract
the sum from 1.0 to get the probability for more than 3 accidents
P(3 or fewer) = 0.03370 + 0.08425 + 0.14042 + 0.17552
= 0.26511
Probability of more than 3 accidents: P(more than 3) = 1 - 0.26511
= 0.73489
As 0.73489 exceeds 0.65, steps should be taken to improve the
intersection
Baggage is rarely lost by Northwest Airlines. Suppose a
random sample of 1,000 flights shows a total of 300 bags
were lost. Thus, the arithmetic mean number of lost bags per
flight is 0.3 (300/1,000). If the number of lost bags per flight
follows a Poisson distribution with u = 0.3.
Find the probability of not losing any bags.
On Thursday morning between 9 A.M. and 10 A.M. customers
arrive and enter the queue at the Oxnard University Credit
Union at a mean rate of 1.7 customers per minute.
• Find the mean and standard deviation
• What is the probability that there are exactly 4 Customers in the
bank

x l x 1.7
l e (1.7) e
PDF = P( x)  
x! x!
Applications
• It is used in quality control statistics to count the no. of
defected samples.
• It is used in insurance problems to count the no. of
casualties.
• No. of faulty blades in a packet of 100.
• No. of printing mistakes at each page of the book.
• No. of suicides reported in a particular city.
Note: - Poisson is a good approximation of the binomial
when n ≥ 20 and p ≤ 0.05
Case Study - IV

HOW MANY WARRANTY


RETURNS
Your firms quality is so high. You expect only 1.3 of your products
to be returned, on average, each day for warranty repairs.
1.What are the chances that no products will be returned
tomorrow?
2.That one will be returned?
3.How about two?
4.How about three?
5.What is the probability that two items or fewer will be returned?

1. P(X=0) = 0.27253
2. P(X=1) = 0.35429
3. P(X=2) = 0.23029
4. P(X=3) = 0.09979
5. P(X<3) = 0.27253 + 0.35429 + 0.23029 = 0.857 = 85.7%
NORMAL DISTRIBUTION
o Most popular & useful continuous probability distribution
o Defined by 2 parameters: standard deviation & mean
o Characteristics of Normal Probability Distribution: -
• Curve has a single peak; thus, it is uni-modal
• Mean of a normally distributed population lies at the center of its
normal curve
• For a normal curve, the mean, median & mode are the same value
• 2 tails of the normal probability distribution extend indefinitely & never
touch the horizontal axis

16% 68% 16% 2.3% 95.4% 2.3% 0.15% 99.7% 0.15%

–1 +1 –2 +2 –3 +3


a µ b a µ b a µ b

68% of the population would have Iqs 95.4% of the people have IQs 99.7% of the population have
between 85-115 points (±1). Only 16% between 70 IQs in the range from 55 to 145
of the people have IQs > 115 points and 130 (±2) points (±3)
• Standard normal distribution
Normal distribution whose mean
is zero & standard deviation is
one
• Increasing the mean shifts the
curve to the right
• Increasing the standard
deviation “flattens” the curve
• We can use the following
function to convert any normal
random variable to a standard
normal random variable

This changes This shifts


the shape of the mean of X 0
the curve to zero…
Step 1 - Convert normal distribution into standard normal
distribution

X 
Z

where
X = value of the random variable we want to measure
µ = mean of the distribution
 = standard deviation of the distribution
Z = number of standard deviations from X to the mean, µ

Step 2 - Use Z in finding X from Normal Distribution Table


The weekly incomes of shift foremen in the glass industry follow the
normal probability distribution with a mean of $1,000 and a
standard deviation of $100. What is the probability of selecting a
shift foreman in the glass industry whose income is
1.Between $1,100 and $900 What is the probability of selecting a
2.Between $1,150 and $1,250 shift foreman in the glass industry
3.Between $1,000 and $1,100 whose income is between $1,150
4.Between $790 and $1,000 and $1,250
5.Less than $790
6.Between $840 and $1,200

What is the likelihood of selecting


a foreman whose weekly income
is between $1,000, and $900 per
week
What is the likelihood of selecting a What is the probability of selecting a
foreman whose weekly income is shift foreman whose income is
between $1,000 and $1,100?
between $790 & $1,000?

What is the probability of selecting a What is the probability of selecting a


shift foreman whose income is less than shift foreman whose income is
$790? between $840 & $1,200?
Haynes is a construction company into building apartments. Total
construction time follows a normal distribution. For triplexes, µ =
100 days and  = 20 days. Contract calls for completion in 125
days.
1. Late completion will incur a severe penalty fee. What is the probability of
completing in 125 days & more than 125 days?
2. Completion in 75 days or less will earn a bonus of $5,000. What is the
probability of getting the bonus?
X  X 
Z  Z 
 P(X < 75 days) 
Area of
125  100 75  100

Interest 
20 20
 1.25  1.25

µ = 100 days X = 125 days


X = 75 days µ = 100 days
 = 20 days

 = 20 days
• From normal distribution table, for Z = 1.25
the area is 0.3944
• There are 89% (0.5+0.3944=0.8944) chances • For Z = 1.25 the area is 0.3944
that contract will not incur penalty • There are 10% (0.5-0.3944=0.1056)
• Probability of completing the triplexes in chances that contract will earn a
more than 125 days is 1 – 0.8944 = 0.1056 bonus of $5,000 on completion of
Layton Tire and Rubber Company wishes to set
a minimum mileage guarantee on its new
MX100 tire. Tests reveal the mean mileage is
67,900 with a standard deviation of 2,050 miles
and that the distribution of miles follows the
normal probability distribution. It wants to set
the minimum guaranteed mileage so that no
more than 4 percent of the tires will have to be
replaced.
What minimum guaranteed mileage should
Layton announce?
Case Study

THE OPTION VALUE OF AN


OIL LEASE
There’s an oil leasing opportunity that looks too good to be true, &
it probably is too good to be true: An estimated $1,500,000 barrels
of oil sitting underground that can be leased for three years for
just $1,300,000. It looks like a golden opportunity: Pay just over a
million, bring the oil to the surface, sell it at the current spot price
of $76.45 per barrel, & retire.
However, upon closer investigation, you come across facts
that explain why nobody else has snapped up this “Opportunity”.
Evidently, it is difficult to remove the oil from the ground due to
the geology & the remote location. A careful analysis shows that
estimated costs of extracting the oil are a whopping $120,000,000.
You conclude that by developing this oil field, you would actually
loose money. Oh Well.
During the next week, although you are busy investigating
other capital investment opportunities, your thoughts keep
returning to this particular project. In particular, the fact that the
lease is so cheap & that it lasts for 3 years inspires you to do a
What if scenario analysis, recognizing that there is no obligation to
extract the oil and that it could be extracted fairly, quickly (taking
But if the price of oil didn’t raise enough, you would let the term of the
lease expire in 3 years, leaving the oil still in the ground. You would let
the future price of oil determine whether or not to exercise the option to
extract the oil.
But such a proposition is risky! How much risk? What are the
potential rewards you have identified the following basic probability
structure for the source of uncertainty in this situation?

1. How much money would you make if there were no costs


of extraction? Would this be enough to retire?
2. Would you indeed loose money if you leased & extracted Future Probability
immediately, considering the costs of extraction? How price of oil
much money?
60 0.10
3. Continue the scenario analysis by computing the future net
payoff implied by each of the future prices of oil. To do this, 70 0.15
multiply the price of oil by the number of barrels; then
80 0.20
subtract the cost of extraction. If this is negative, you simply
wont develop the field, so change negative values to 0 (at 90 0.30
this point, do not subtract the least cost, because we are
assuming that it has already been paid). 100 0.15
4. Find average future net payoff, less the cost of the lease. 110 0.10
How much, on average, would you gain/ loose by leasing
this oil field (you may ignore the time, value of money).
5. How risky is this proposition?
Although the project would lose money if developed
immediately, the uncertainty of the future combined with your
ability to extract or not depending upon future conditions could
make the lease worthwhile. This case involves a careful analysis of
cash flows in five different future scenarios together with a risk and
return analysis of the resulting random variable.
EXPONENTIAL DISTRIBUTION
o Exponential distribution is the skewed
continuous probability distribution. Its rise
is vertical at 0, on the left, & it descends
gradually, with a long tail on the right
o If events happen independently &
randomly with a constant rate over time
between successive events then it
follows an exponential distribution The no. of events in any fixed time
period is Poisson, & the waiting
o It is widely used in the analysis of time between events is
queuing problems exponential
o Examples l = mean
• Amount of time taken by a repairman to repair Parameters arrivals per unit
machines of time or space
• Time taken by a bank teller in servicing P(x≤a) 1 – e-a/λ
customers Mean 1/l
• Length of time between successive breakdowns
of manufacturing equipment Standard Deviation 1/l
p(T  t )  e  t p(T  t )  1  e  t
Customer Arrivals
Suppose customers arrive independently to a car service centre
at a constant mean rate of 40 per hour. Find the probability that
at-least one customer arrives in next 5 minutes

Since 40 customers arrive each hour on an average,


Mean of the exponential distribution (λ) = (1/40) = 0.025 hrs
= 0.025 x 60 = 1.5 minutes
P(X ≤ 5) = 1 – e-a/λ = 1 - e-5/1.5 = 0.964
So, the chances are high (96.4 %) that atleast one customer will
arrive in the next 5 minutes
If jobs arrive every 15 seconds on average, = 4 per minute,
what is the probability of waiting less than or equal to 30
seconds, i.e .5 min?

p (T  t )  1  e  t
 4 ( 0.5 )
p (T  0.5)  1  e
1  e  2  0.86

Calls arrive at an average rate of 12 per hour. Find the


probability that a call will occur in the next 5 minutes given
that you have already waited 10 minutes for a call i.e. Find
P(T ≤ 15|T > 10)
p(T  5 | T  10 )  p(T  5)
5 ( 0.2 ) 1
 1 e  1  e  1  0.37  0.63
At Bluefountain, one of the largest restaurant in the city of
Ahmedabad, it takes 10 minutes to receive the order after
placing. If the service time is exponentially distributed. Find the
probability that the customer waiting time is
1.More than 10 minutes
2.10 minutes or less
3.3 minutes or less
Given Expected value = (1/λ) = average service time = (1/10) =
0.1 per minute
1. P[T > 10] = e-λ t = e-0.1 x 10 = e-10 = 0.368
2. P[T ≤ 10] = 1 - P[T > 10] = 1 – 0.368 = 0.632
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: shahmsc@ipeindia.org Mobile: + (91)98666 66620
PROBABILITY DISTRIBUTIONS
IN R
Basics of Probability Distributions - Normal Distribution, Binomial
Distribution, Poisson Distribution, Other Distributions. Inferential
Statistics - T-Test, F-Test, Z-Test, ANOVA, Chi-Square Test
• Probability distribution (PD) & statistical analysis are closely related
• Analysts make predictions based on a certain population, which is mostly
under a probability distribution
• If we find that the data selected for prediction doesn’t follow the exact
assumed PD, the expected results can be refuted
• In R, probability functions take the form: [dpqr] distribution_abbreviation()
• There is a root name, prefixed by one of the letters [p - Cumulative
distribution function (c. d. f.) – Probability, q - Inverse c. d. f. – Quantile, d -
Probability density function – Density, r - Random distribution – Random]

Distribution Abbreviation Distribution Abbreviation


Beta beta Logistic logis
Binomial binom Multinomial multinom
Cauchy cauchy Negative Binomial nbinom
Chi-squared (non- chisq Normal norm
central)
Exponential exp Poisson pois
F f Wilcoxon Signed Rank signrank
Gamma gamma T t
Geometric geom Uniform unif
Hypergeometric hyper Weibull weibull
When sampling from the same population, When sampling from same population,
using a fixed confidence level, the larger the using a fixed sample size, the higher
sample size, n, the narrower the confidence confidence level, the wider confidence
interval. interval.
S a m p lin g D is trib u tio n o f th e Me a n S t a n d a r d N o r m al Di s tri b u ti o n
0 .4
0 .4

0 .3
0 .3

f(z)
0 .2
f(x)

0 .2
0 .1

0 .1 0 .0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Z
0 .0

x 80% Confidence Interval:



95% Confidence Interval: n = 20 x  1.28
n

S a m p lin g D is trib u tio n o f th e Me a n S t a n d a r d N o r m al Di s tri b uti o n


0 .4
0 .9

0 .8
0 .3
0 .7

f(z)
0 .6 0 .2
0 .5
f(x)

0 .4 0 .1

0 .3
0 .0
0 .2 -5 -4 -3 -2 -1 0 1 2 3 4 5
0 .1 Z
0 .0
95% Confidence Interval:

x  1.96
x
n
95% Confidence Interval: n = 40
NORMAL DISTRIBUTION (ND)
• Four (4) functions to generate the values x, q vector of quantiles
associated with the normal distribution - p vector of probabilities
pnorm, qnorm, dnorm, & rnorm n number of
observations.
• To get a full list of them & their options using If length(n) > 1, the
length is taken to be
the help command the number required
help(Normal) mean vector of means
dnorm(x, mean = 0, sd = 1, log = FALSE) sd vector of standard
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p deviations
= FALSE) log, log.p logical; if TRUE,
probabilities p are
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p
given as log(p)
= FALSE)
lower.tail logical; if TRUE
rnorm(n, mean = 0, sd = 1) (default), probabilities
are P[X ≤
• If we don’t specify mean & standard x] otherwise, P[X > x]
deviation, the standard normal distribution
is assumed (mean=0 & standard
deviation=1)
dnorm() – Given a set of values, returns height of probability
distribution at each point - dnorm(x, mean = 0, sd = 1, log = FALSE)
 Returns height of normal curve - mean = zero (0) & standard deviation
= one (1)
dnorm(0)
 User can change the mean & standard deviation in the argument
dnorm(0,mean=4)
dnorm(0,mean=4, sd=10)
 Next, plot the graph of a normal distribution
curve(dnorm,-3,3)
x = seq(-3,3,by=0.1)
y = dnorm(x,mean=2.5,sd=0.8)
plot(x,y)
#For a normal curve with µ = 500 & σ = 100
## Loading required package: ggplot2
require(ggplot2)
ggplot(data.frame(x = c(200, 800)), aes(x = x)) + stat_function(fun =
dnorm, args = list(mean = 500, sd = 100)) + geom_segment(aes(x = 200,
pnorm() – Given a number/ a list, computes the probability that a
normally distributed random number will be less than (<) that
number - pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p =
FALSE)
 Returns the area under a given value
pnorm(0,mean=2,sd=3)
For a normal curve with µ = 500 & σ = 100, what is the probability that
P(X<400)
pnorm(400,mean=500,sd=100)
Output: - 0.16
 Alternatively, to get the area over a certain value, use the option
lower.tail=FALSE
What is the probability that P(X>650)
1 - pnorm(650,mean=500,sd=100) #OR
pnorm(650,mean=500,sd=100,lower.tail=FALSE)
Output: - 0.067
What is the probability that P(|X-500| > 150)
pnorm(350, 500, 100) + (1 - pnorm(650, 500, 100))
 To plot the graph of pnorm, use curve function: curve(pnorm(x),-3,3)

x = seq(-20,20,by=.1)
y = pnorm(x,mean=2,sd=3)
plot(x,y)

pnorm(0,lower.tail=FALSE)
pnorm(0,mean=2,sd=1,lower.tail=FALSE)
x = seq(-20,20,by=.1)
y = pnorm(x,mean=2,sd=3, lower.tail = FALSE)
plot(x,y)
qnorm() – Inverse of pnorm(). Returns the Z-score of a given
probability
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(0.5)
qnorm(0.5,mean=1,sd=2)
x = seq(0,1,by=.05)
y = qnorm(x,mean=3,sd=2)
plot(x,y)

#For a normal curve with µ = 500 & σ = 100


#Find the 0.05 and 0.95 quantiles
qnorm(c(0.05, 0.95), 500, 100)
Output: - 336 664
rnorm() – Generate random numbers from a normal distribution.
Number of random numbers you want has to be passed as an
argument. Mean & standard deviation are optional arguments
rnorm(n, mean = 0, sd = 1)
rnorm(4)
set.seed(50)
x = rnorm(200,mean=-2,sd=4)
hist(x)
qqnorm(x)
qqline(x)
Widely used test for normality of data - #Normal Distribution
Shapiro-Wilks Test set.seed(50)
shapiro.test(x) x=rnorm(100,mean=0,sd=1)
hist(x)
 p-value shows the changes, which show
set.seed(50)
that the sample comes from a normal
distribution #Uniform Distribution
y=runif(100,0,6)
 If p > 0.05, conclude that the sample
hist(y)
comes from a normal distribution
shapiro.test(x)
 On the other hand, if p <= 0.05, conclude
shapiro.test(y)
• Four (4) functions to generate the values
associated with the binomial distribution -
pbinom, qbinom, dbinom, & rbinom
BINOMIAL
• To get a full list of them & their options
DISTRIBUTION
using the help command (BD)
help(Binomial)
x, q vector of quantiles.
• BD requires 2 extra parameters – no. of p vector of probabilities.
trials & the probability of success for a n number of observations.
single trial If length(n) > 1, the length
is taken to be the number
• Commands follow the same kind of required.
naming convention, & the functions are size number of trials (zero or
more).
dbinom(x, size, prob, log = FALSE) prob probability of success on
each trial.
pbinom(q, size, prob, lower.tail = TRUE, log, logical; if TRUE,
log.p = FALSE) log.p probabilities p are given
as log(p).
qbinom(p, size, prob, lower.tail = TRUE, lower.tai logical; if TRUE (default),
log.p = FALSE) l probabilities are P[X ≤ x],
otherwise, P[X > x].
rbinom(n, size, prob)
 dbinom() – Given a set of values, returns height of probability
distribution at each point
 If given only the points, assumes mean of zero (0) & standard
deviation of one (1)
 There are options to use different values for mean & standard
deviation
#dbinom(x, size, prob, log = FALSE)
help(Binomial)
x = seq(0,50,by=1)
y = dbinom(x,50,0.2)
plot(x,y)
x = seq(0,50,by=1)
y = dbinom(x,50,0.6)
plot(x,y)
x = seq(0,100,by=1)
y = dbinom(x,100,0.6)
 pbinom() – Computes the cumulative probability distribution
function
#pbinom(p, size, prob, lower.tail = TRUE, log.p = FALSE)
pbinom(24,50,0.5)
# If the coin is tossed n = 50 and p = 0.3
#What is the probaility of x=14, P(X = 14)
dbinom(14, 50, 0.3)
#P(X = x) for x = 5,...,10
dbinom(5:10, 50, 0.3)
#P(X ≤18) - 2 ways
pbinom(18, 50, 0.3) #OR
sum(dbinom(0:18, 50, 0.3))
#P(14≤ X ≤18)
pbinom(18, 50, 0.3) - pbinom(13, 50, 0.3) #OR
sum(dbinom(14:18, 50, 0.3))
#P(X ≥20)
1 - pbinom(19, 50, 0.3) #OR
sum(dbinom(20:50, 50, 0.3))
 qbinom() – inverse cumulative probability distribution function
 If given only the points, assumes mean of zero (0) & standard
deviation of one (1)
 There are options to use different values for mean & standard
deviation
qbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
#Find the 0.1 and 0.9 quantiles
qbinom(c(0.1, 0.9), 50, 0.3)
#Find the 0.4 quantile
qbinom(0.4, 50, 0.3)
#P(X<=0.4)
pbinom(qbinom(0.4, 50, 0.3), 50, 0.3)
#P(X>0.4)
 rbinom() – Generate random numbers whose distribution is
1 - pbinom(qbinom(0.4, 50, 0.3) - 1, 50, 0.3)
Binomial
rbinom(n, size, prob)
rbinom(5,100,.2)
 binom.test() – Originates from the stats package. Performs an
exact test of a simple null hypothesis about the probability of
success in a Bernoulli experiment
 To seek help for the function: help(binom.test)
?binom.test
 Syntax - binom.test(x, n, p = 0.5, alternative = c("two.sided",
"less", "greater"), conf.level = 0.95)
Arguments: x – desired number of successes; n - number of trials;
p - hypothesized probability of success; alternative - indicates
the alternative hypothesis and must be one of "two.sided",
"greater" or "less". You can specify just the initial letter; conf.level -
confidence level for the returned confidence interval.
Caselet - There is a game where a gambler can win by rolling the
number 6 on a dice. As part of the rules, gamblers can bring their own
dice. If a gambler tried to cheat in a game, he would use a loaded dice
to increase his chance of winning. If we observe that the gambler won
92 out of 315 games, determine whether the dice was fair by
conducting an exact binomial test.
Null Hypothesis (H0) – The dice is un-biased
Alternative Hypothesis (H1) – The dice is biased
binom.test(x=92, n=315, p=1/6)
Test - Probability of rolling a 6 from an unbiased dice
Result of the test – Since p-value=3.45e-08 (i.e., p <= 0.05), reject Null Hypothesis
Interpretation - For significance, at the 5% (0.05) level, the null hypothesis (the
T DISTRIBUTION
• Most common activity in research is the comparison of 2 groups
• Students t-test statistic – Outcome variable is continuous &
assumed to follow a normal distribution
• Independent t-test: - It can be used to determine whether there
is a difference between 2 independent datasets
• Dependent t-test: - When observations in 2 groups are related
• One sample t-test: - Is the mean of the population different from
the null hypothesis?
• If p<=0.05, reject null hypothesis
• Two samples t-test: - Is there any difference in the mean of the
populations under consideration?
• Example: - Which of the 2 teaching methods is most cost-
effective?
UScrime Dataset (MASS package)
Criminologists are interested in the effect of punishment regimes
on crime rates. This has been studied using aggregate data on 47
states of the USA for 1960 given in this data frame. The variables
seem to have been re-scaled to convenient numbers. This data
frame contains the following columns:
M percentage of males aged 14–24.
So indicator variable for a Southern state (1-southern state, 0-non-southern
state).
Ed mean years of schooling.
Po1 police expenditure in 1960.
Po2 police expenditure in 1959.
LF labour force participation rate.
M.F number of males per 1000 females.
Pop state population.
NW number of non-whites per 1000 people.
U1 unemployment rate of urban males 14–24.
U2 unemployment rate of urban males 35–39. Outcome variables of interest will
GDP gross domestic product per head. be Prob, U1, & U2
Ineq income inequality. The categorical variable So will
Prob probability of imprisonment. serve as the grouping variable
Time average time served in state prisons.
Y rate of crimes in a particular category per head of population.
Independent t-test
• Are you more likely to be imprisoned if you commit a crime in the
south?
• Comparison of interest is southern vs. non-southern states, & the
dependent variable is the probability of incarceration
• Two (2) group independent t-test - Used to test the hypothesis that two
population means are equal
• Assumptions
• 2 groups are independent
• Data are sampled from normal populations
• The format is either
t.test(y ~ x, data) where y is numeric & x is a dichotomous variable, or
t.test(y1, y2) where y1 and y2 are numeric vectors (the outcome variable for
each group)
• Default test assumes unequal variance - applies Welsh degrees of
freedom modification. Add var.equal=TRUE option to specify equal
variances & a pooled variance estimate
• By default, two-tailed alternative is assumed. Add the option
alternative="less" or alternative="greater" to specify a directional test
• Compare Southern (group 1) and non-Southern (group 0) states
on the probability of imprisonment using a two-tailed test
without the assumption of equal variances
• Null Hypothesis: - Southern & Non-southern states have equal
probabilities of imprisonment
• Alternative Hypothesis: - Southern & Non-southern states do not
have equal probabilities of imprisonment
library(MASS)
t.test(Prob ~ So, data=UScrime)
• Interpretation: - Since (p < 0.05 i.e., .0006506 < 0.05), reject null
hypothesis. Therefore, Southern states & non-Southern states do
not have equal probabilities of imprisonment
• Note: - Because the outcome variable (Prob) is a proportion,
you might try to transform it to normality before carrying out the
t-test
Dependent t-test
• Is unemployment rate for younger males (14–24) greater than
for older males (35–39)?
• In this case, the two groups aren’t independent
• We wouldn’t expect the unemployment rate for younger &
older males in Alabama to be unrelated
• Assumption: - Difference between groups is normally distributed
• The format is
t.test(y1, y2, paired=TRUE) where y1 & y2 are the numeric vectors for the
two dependent groups
• Null Hypothesis: - The mean unemployment rate for older &
younger males is the same
• Alternative Hypothesis: - The mean unemployment rate for older
& younger males is not the same
• Interpretation: - The mean difference (61.5) is large enough to
warrant rejection of the hypothesis that the mean
unemployment rate for older and younger males is the same.
Younger males have a higher rate
• Four (4) functions to generate the
values associated with the t
distribution - pt, qt, dt, & rt
• Commands assume that the
values are normalized to mean
zero and standard deviation one
x, q vector of quantiles.
• You have to specify the number p vector of probabilities.
of degrees of freedom n number of observations.
If length(n) > 1, the length is
• To get a full list of them & their taken to be the number
options using the help command required.
df degrees of freedom (> 0,
help(TDist) maybe non-integer). df =
Inf is allowed.
dt(x, df, ncp, log = FALSE) ncp non-centrality
pt(q, df, ncp, lower.tail = TRUE, parameter delta; currently
except for rt(), only
log.p = FALSE) for abs(ncp) <= 37.62. If
qt(p, df, ncp, lower.tail = TRUE, omitted, use the central t
distribution.
log.p = FALSE)
log, logical; if TRUE, probabilities
rt(n, df, ncp) log.p p are given as log(p).
 dt() – Given a set of values, returns height of probability
distribution at each point
#dt(x, df, ncp, log = FALSE)
x = seq(-20,20,by=.5)
y = dt(x,df=10)
plot(x,y)
x = seq(-20,20,by=.5)
y = dt(x,df=50)
plot(x,y)

 pt() – Computes the cumulative probability distribution function


#pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)
pt(-3,df=10)
pt(3,df=10)
1-pt(3,df=10)
x = c(-3,-4,-2,-1)
pt((mean(x)-2)/sd(x),df=20)
pt((mean(x)-2)/sd(x),df=40)
 qt () – inverse cumulative probability distribution function
#qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)
qt(0.05,df=10)
qt(0.95,df=10)
qt(0.05,df=20)
qt(0.95,df=20)
v = c(0.005,.025,.05)
qt(v,df=253)
qt(v,df=25)

 rt() – random numbers can be generated according to the


binomial distribution
#rt(n, df, ncp)
rt(3,df=10)
rt(3,df=20)
rt(3,df=20)
CHI SQUARE DISTRIBUTION
• R provides several methods of testing the independence of the
categorical variables. Three (3) widely used tests include:
• Chi-square test of independence
• Fisher exact test
• Cochran-Mantel–Haenszel test
• Chi-squared test – Examines whether the distributions of
categorical variables of 2 groups differ (or) tests there is a
relationship between 2 categorical variables
• chisq.test() function can be applied to a 2 way table in order to
produce a chi-square test of independence of the row &
column variables
• Assumptions: - Input samples should satisfy 2 assumptions
1. 2 input variables should be categorical
2. Variable should include 2 or more independent groups
• Null Hypothesis: - Variables A & B are independent
• Alternative Hypothesis: - Variables A & B are not independent
mtcars (Motor Trend Car Road Tests) Dataset
The data is extracted from the 1974 Motor Trend US magazine,
comprising of fuel consumption & 10 related aspects of automobile
design & performance for 32 automobiles (1973–74 models). The
description of the 11 numeric variables with the 32 observations in the
data frame are as follows:
[,1] mpg – Miles/ (US) gallon
[,2] cyl – Number of Cylinders
[,3] disp – Displacement (cu.in.)
[,4] hp – Gross Horsepower
[,5] drat – Rear Axle Ratio
[,6] wt – Weight (1000 lbs)
[,7] qsec – ¼ Mile Time
[,8] vs – V/S Engine Shape
[,9] am –Transmission (0=automatic, 1=manual)
[,10] gear – Number of Forward Gears
[,11] carb – Number of Carburetors

Prepare a managerial report.


Determine whether the gears in automatic & manual transmission
cars are the same
Null Hypothesis: - There is no difference in number of gears in automatic
& manual transmission cars
Alternative Hypothesis: - There is a difference between the number of
gears in automatic & manual transmission cars
1. Build a contingency table with the inputs of the transmission type &
number of forward gears
ctable = table(mtcars$am,mtcars$gear)
ctable
2. Plot the contingency table (mosaic plot)
mosaicplot(ctable,main="No. of forward gears within automatic &
manual cars", color=2:3, las=1,shade = TRUE)
Interpretation: - The number of forward gears is less in automatic
transmission cars than in manual transmission cars
3. Test whether the number of gears in automatic & manual transmission
cars is the same – Perform chi-squared test
chisq.test(ctable)
Interpretation: - Since p=2.831e-05 i.e., p<=0.05, reject null hypothesis.
The number of forward gears is different in automatic & manual
transmission cars
Arthritis Dataset - vcd Package
The data is extracted from Koch & Edwards (1988) from a double-blind clinical trial investigating a new
treatment for rheumatoid arthritis. The description of the 5 variables with 84 observations in the data
frame is as follows:
ID patient ID.

Treatment factor indicating treatment (Placebo, Treated).

Sex factor indicating sex (Female, Male).

Age age of patient.

Improved ordered factor indicating treatment outcome (None, Some, Marked).


Is there any relationship between treatment received & level of
improvement
Null Hypothesis 1: - There is no relationship between treatment received
and level of improvement
Null Hypothesis 2: - There is no relationship between patient sex and
improvement
1. Build a contingency table with the inputs of the transmission type &
number of forward gears
ctable1 = xtabs(~Treatment+Improved, data=Arthritis)
ctable1
chisq.test(ctable1)
ctable2 = xtabs(~Improved+Sex, data=Arthritis)
ctable2
chisq.test(ctable2)
Interpretation: - The number of forward gears is less in automatic
transmission cars than in manual transmission cars
3. Since p=2.831e-05 i.e., p<=0.05, reject null hypothesis. The number of
forward gears is different in automatic & manual transmission cars
• Four (4) functions to generate the
values associated with the chi-
square distribution - pchisq, qchisq, CHI SQUARE
DISTRIBUTION
dchisq, & rchisq
• Assumed that value is normalized, so
no mean is specified
x, q vector of quantiles.
• Need to specify the number of p vector of probabilities.
degrees of freedom
n number of observations.
• To get a full list of them & their If length(n) > 1, the length
options using the help command is taken to be the number
required.
help(Chisquare) df degrees of freedom (non-
negative, but can be non-
dchisq(x, df, ncp = 0, log = FALSE) integer).
pchisq(q, df, ncp = 0, lower.tail = ncp non-centrality parameter
TRUE, log.p = FALSE) (non-negative).
log, log.p logical; if TRUE,
qchisq(p, df, ncp = 0, lower.tail = probabilities p are given
TRUE, log.p = FALSE) as log(p).
lower.tail logical; if TRUE (default),
rchisq(n, df, ncp = 0) probabilities are P[X ≤ x],
otherwise, P[X > x].
 dchisq() – Given a set of values, returns height of probability
distribution at each point
#dchisq(x, df, ncp = 0, log = FALSE)
x = seq(-20,20,by=.5)
y = dchisq(x,df=10)
plot(x,y)
x = seq(-20,20,by=.5)
y = dchisq(x,df=12)
plot(x,y)
 pchisq() – Computes the cumulative probability distribution
function
# pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
pchisq(2,df=10)
pchisq(3,df=10)
1-pchisq(3,df=10)
pchisq(3,df=20)
x = c(2,4,5,6)
 qchisq () – inverse cumulative probability distribution function
# qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
qchisq(0.05,df=10)
qchisq(0.95,df=10)
qchisq(0.05,df=20)
qchisq(0.95,df=20)
v = c(0.005,.025,.05)
qchisq(v,df=253)
qchisq(v,df=25)

 rchisq() – random numbers can be generated according to the


binomial distribution
#rchisq(n, df, ncp = 0)
rchisq(3,df=10)
rchisq(3,df=20)
rchisq(3,df=20)
EXPLORATORY DATA ANALYSIS

• After loading & cleaning the data in R, we proceed to model


building
• In Exploratory Data Analysis (EDA), the role of the researcher is
to explore the data in as many ways as possible until a
“plausible story” of the data emerges
• What is EDA?
An approach to analyzing data sets to summarize their main
characteristics, often with visual methods
• Data exploration can include
• Spotting outliers
• Checking distributions & other assumptions
• Examining relationships
• Comparing mean differences etc.
Caselet - mtcars (Motor Trend Car Road Tests) Dataset
The data was extracted from the 1974 Motor Trend US
magazine., It comprises of fuel consumption and 10 other
related aspects of automobile design & performance for 32
automobiles (1973–74 models). The description of the 11
numeric variables with the 32 observations in the data frame
are as follows:
[,1] mpg – Miles/ (US) gallon
[,2] cyl – Number of Cylinders
[,3] disp – Displacement (cu.in.)
[,4] hp – Gross Horsepower
[,5] drat – Rear Axle Ratio
[,6] wt – Weight (1000 lbs)
[,7] qsec – ¼ Mile Time
[,8] vs – V/S Engine Shape
[,9] am –Transmission (0=automatic, 1=manual)
[,10] gear – Number of Forward Gears
[,11] carb – Number of Carburetors

Prepare a managerial report.


Data Preparation

1.Import Data - To visualize


data, import data from an
external source & convert
it into a useful format

R can import data from


almost any source (text files,
excel spreadsheets, statistical
packages, & database
management systems)

2.Cleaning Data - most time-


consuming part of any
data analysis

While there are many


approaches - dplyr & tidyr
packages - quickest & easiest
starwars Dataset – tidyverse Package
A tibble with 87 rows & 13 variables
name - Name of the character
height - Height (cm)
mass - Weight (kg)
hair_color,skin_color,eye_color - Hair, skin, and eye colors
birth_year - Year born (BBY = Before Battle of Yavin)
gender - male, female, hermaphrodite, or none.
homeworld - Name of homeworld
species - Name of species
films - List of films the character appeared in
vehicles - List of vehicles the character has piloted
starships - List of starships the character has piloted

Cleaning Data
1. Selecting Variables – Allows to limit dataset to specified variables (columns)
2. Selecting Observations - Limits observations (rows - meeting a specific criteria. Multiple
criteria can be combined with the & (AND) and | (OR) symbols)
3. Creating/ recoding Variables - Creates new variables/ transforms existing ones
4. Summarizing data - Reduces multiple values to a single value
5. Using Pipes - Operator passes the result on the left to the first parameter of the function
on the right
6. Reshaping Data
7. Missing Data
Selecting Variables – select() function limits dataset to specified
variables (columns) – starwars Dataset
o Keep only the variables name, height & gender
s1 = select(starwars, name, height, gender)
o Keep the variables name, & all variables between mass & species inclusive
s2 = select(starwars, name, mass:species)
o Keep all variables except birth_year & gender - s3 = select(starwars, -
birth_year, -gender)
Selecting Observations – filter() function allows to limit
observations (rows) meeting a specific criteria. Multiple criteria
can be combined with the & (AND) and | (OR) symbols – starwars
Dataset
o Select only females - s4 = filter(starwars, gender == “female”)
o Select females from Alderaan - s5 = filter(starwars, gender == “female” &
homeworld == “Alderaan”)
o Select individuals from Alderaan, Coruscant, or Endor
s6 = filter(starwars, homeworld==“Alderaan” | homeworld==“Coruscant” |
homeworld==“Endor”)
s7 = filter(starwars, homeworld %in% c(“Alderaan”, “Coruscant”, “Endor”))
Creating/ Recoding Variables – mutate() function creates new
variables or transforms existing ones
o Convert height in cms to inches & mass in kgs to pounds
s8 = mutate(starwars, height = height * 0.394, mass = mass * 2.205)
o If height > 180 then heightcat = “tall”, otherwise heightcat = “short”
s9 = mutate(starwars, heightcat = ifelse (height > 180, “tall”, “short”)
o Convert eye color that is not black, blue or brown to other
s10=mutate(starwars,eye_color=ifelse(eyecolor %in%
c(“black”,”blue”,”brown”), eye_color,”other”)
o Set heights > 200 or heights < 75 to missing
s11 = mutate(starwars, height = ifelse(height < 75 | height > 200), NA, height)

Summarizing Data – summarize() function reduces multiple values to a


single value (such as mean). Often used in conjunction with the by_group()
function, to calculate statistics by group. na.rm=TRUE option is used to drop
missing values – starwars Dataset
o Calculate mean height & mass
s12=summarize(starwars,mean_ht=mean(height,na.rm=TRUE),mean_mass=mean(mass,
na.rm=TRUE))
o Calculate mean height & weight by gender
s13 = group_by(starwars, gender)
s14 = summarize(s13, mean_ht=mean(height, na.rm=TRUE), mean_mass=mean(mass,
na.rm=TRUE))
Using Pipes – Packages like dplyr & tidyr allow to write the code in
a compact format using the pipe %>% operator – starwars
Dataset
o Calculate the mean height for women by species
s15=starwars %>% filter(gender==“female”) %>% group_by(species) %>%
summarize(mean_ht=mean(height, na.rm = TRUE))
Note: - %>% operator passes the result on the left to the first parameter of the
function on the right
Reshaping Data – Some graphs require the data to be in wide format, while
some graphs require the data to be in long format
 2 sets of methods can be used for reshaping the data using R Studio
 gather() & spread() - tidyr package
 melt() & dcast() - reshape2 package
 Reshaping data with tidyr package - package built for the purpose of
simplifying the process of creating tidy data
 gather() - makes “wide” data longer
 spread() - makes “long” data wider
 separate() - splits a single column into multiple columns
 unite() - combines multiple columns into a single column
 gather() - tidyr package - Reshaping wide format to long format -
 Sometimes data is unstacked with common attribute spread out across columns
 To reformat data, such common attributes are gathered together as a single
variable
 gather() function takes multiple columns & collapses them into key-value pairs,
duplicating all other columns as needed
o You can convert a wide dataset to a long dataset
using
library(tidyr)
long_data = gather(wide_data,
key=“variable”, value=“value”, sex:income)
o Conversely, you can convert a long dataset to a
wide dataset using
library(tidyr)
wide_data = spread(long_data, variable,
value)
msleep Dataset – ggplot2 Package
This is an updated and expanded version of the mammals sleep
dataset. Updated sleep times and weights were taken from V. M.
Savage and G. B. West. A quantitative, theoretical framework for
understanding mammalian sleep. Proceedings of the National
Academy of Sciences, 104 (3):1051-1056, 2007. A data frame
with 83 rows and 11 variables
A data frame with 83 rows and 11 variables
name - common name
genus vore - carnivore, omnivore or herbivore?
order conservation - the conservation status of the animal
sleep_total - total amount of sleep, in hours
sleep_rem - rem sleep, in hours
sleep_cycle - length of sleep cycle, in hours
awake - amount of time spent awake, in hours
brainwt - brain weight in kilograms
bodywt - body weight in kilograms
Missing Data – Missing values are considered to be the first
obstacle in predictive modeling
o Choice of method to impute missing values, largely
influences the model’s predictive ability
o In most statistical analysis methods, listwise deletion is the
default method used to impute missing values
o But, it not as good since it leads to information loss
Robust packages for missing value imputations: - R Users are
endowed with some incredible R packages for missing
values imputation. These packages arrive with some inbuilt
functions & a simple syntax to impute missing data at once
5 R packages popularly known for missing value imputation
1. MICE
2. Amelia
3. missForest
4. Hmisc
5. mi
Multivariate Imputation via Chained Equations (MICE)
Package – Imputes data on a variable by variable basis by
specifying an imputation model per variable
o Suppose we have X1, X2….Xk variables. If X1 has missing values,
then it will be regressed on variables X2 to Xk. The missing values
in X1 will be replaced by predictive values obtained. Similarly, if
X2 has missing values, then X1, X3 to Xk variables will be used in
prediction model as independent variables. Later, missing
values will be replaced with predicted values
o By default, linear regression is used to predict continuous missing
values. Logistic regression is used for categorical missing values
o Precisely, the methods used by this package are:
o PMM (Predictive Mean Matching) – For numeric variables
o logreg(Logistic Regression) – For Binary Variables( with 2 levels)
o polyreg(Bayesian polytomous regression) – For Factor Variables (>= 2
levels)
o Proportional odds model (ordered, >= 2 levels)
Missing Data – Real data are likely to contain missing values. There
are three basic approaches to dealing with missing data: feature
selection, listwise deletion, & imputation – msleep Dataset. msleep
dataset describes the sleep habits of mammals and contains
missing values on several variables – msleep dataset
1. Feature Selection: - In feature selection, you delete variables
(columns) that contain too many missing values
What is the proportion of missing data for each variable?
s16 = colSums(is.na(msleep))/nrow(msleep)
round(s16, 2)
Sixty-one percent of the sleep_cycle values are missing. You may decide to
drop it

2. Listwise Deletion – Involves deleting observations (rows) that


contain missing values in any of the variables of interest
Create a dataset containing genus, vore & conservation. Delete any rows
containing missing data
s17 = select(msleep, genus, vore, conservation)
s18 = na.omit(s17)
BostonHousing Dataset – mlbench Package
The dataset (Boston Housing Price) was taken from the StatLib library which is
maintained at Carnegie Mellon University. Housing data for 506 census tracts of
Boston from the 1970 census is collected in BostonHousing dataset. The
dataframe BostonHousing contains the original data by Harrison and Rubinfeld
(1979), a corrected version of the dataframe BostonHousing2 with additional
spatial information.
crim per capita crime rate by town
proportion of residential land zoned for lots over 25,000
zn
sq.ft
indus proportion of non-retail business acres per town
Charles River dummy variable (= 1 if tract bounds river; 0
chas
otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per USD 10,000
ptratio pupil-teacher ratio by town
1000(B - 0.63)^2 where B is the proportion of blacks by
b
town
Original
lstat data has 506 observations
percentage of lower status on 14 population
of the variables
3. Imputation: - Imputation involves replacing missing values with
“reasonable” guesses about what the values would have been
if they had not been missing
Imputation with mean/ median/ mode - Replacing the missing
values with the mean / median / mode is a crude way of
treating missing values
library(Hmisc)
impute(BostonHousing$ptratio, mean) #Replace with mean
impute(BostonHousing$ptratio, median) #Replace with median
impute(BostonHousing$ptratio, 20) #Replace with specific value
BostonHousing$ptratio[is.na(BostonHousing$ptratio)]=mean(BostonHousing$ptratio,
na.rm=TRUE) #Impute Manually
4. Prediction: - kNN Imputation Source: -
https://rkabacoff.github.io/datavis/DataPrep.html#DataPrep
• DMwR::knnImputation uses k-Nearest Neighbours approach to impute
missing values. For every observation to be imputed, it identifies ‘k’ closest
observations based on the euclidean distance & computes the weighted
average (weighted based on distance) of these ‘k’ obs.
• Advantage: - Can impute missing values in all variables with one call to the
function
• Takes the whole data frame as the argument
3. Imputation: - Imputation involves replacing missing values with
“reasonable” guesses about what the values would have been
if they had not been missing
Imputation with mean/ median/ mode - Replacing the missing
values with the mean / median / mode is a crude way of
treating missing values
library(Hmisc)
impute(BostonHousing$ptratio, mean) #Replace with mean
impute(BostonHousing$ptratio, median) #Replace with median
impute(BostonHousing$ptratio, 20) #Replace with specific value
BostonHousing$ptratio[is.na(BostonHousing$ptratio)]=mean(BostonHousing$ptratio,
na.rm=TRUE) #Impute Manually

4. Prediction: - kNN Imputation


2. Prediction - rpart

Source: - http://r-statistics.co/Missing-Value-Treatment-With-
• http://r-statistics.co/Missing-Value-Treatment-With-R.html
• https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-
imputing-missing-values/
• https://www.rdocumentation.org/packages/VIM/versions/4.8.0/topics/aggr
• http://r-statistics.co/Missing-Value-Treatment-With-R.html
• https://ocw.mit.edu/courses/sloan-school-of-management/15-071-the-
analytics-edge-spring-2017/lecture-and-recitation-notes/
Dr. Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: shahmsc@ipeindia.org Mobile: + (91)98666 66620
INTRODUCTION TO PYTHON

3. BUSINESS ANALYTICS USING PYTHON


• Among top 4/ 5 most widely used PL’s in the world
• Developed by Guido van Rossum (late 1980s) - administered by Python Software Foundation –
derived from ABC language
• Python - General-purpose, open source scripting language - makes it easy to utilize & direct
other software components – program meant to use in small/ medium projects
• Python programs are executed by an interpreter with focus on shrinking development time
• Current user base - Google, YouTube, Industrial Light & Magic, ESRI, the BitTorrent file sharing
system, NASA’s Jet Propulsion Lab, the game Eve Online, and the National Weather Service
• Application domains - system administration, website development, cell phone scripting, &
education to hardware testing, investment analysis, computer games, & spacecraft control
• Built-in high level data types: strings, lists, dictionaries, etc.
• Usual control structures: if, ifelse, ifelifelse, while, & a powerful collection iterator (for)
• Multiple levels of organizational structure: functions, classes, modules, & packages. These assist
in organizing code. Example: - Python standard library
• Important features of Python – Supports all 3 programming models – Procedural, Object
Oriented (OOPs) & Functional
• Who uses Python today?
• Google – In web search system

• YouTube – Video sharing service

• Bit-torrent – Peer to peer file sharing system

• Intel, HP, Seagate, IBM, Qualcomm – Hardware testing

• Pixar, Industrial light & magic - Movie Animation

• JP Morgan, Chase, UBS - Financial market forecasting

• NASA, FermiLab – Scientific programming

• iRobot – Commercial robot vaccum cleaners

• NSA – Cryptographic & intelligence analysis

• IronPort – Email servers

• Getting Python Source Code


• Python official website – www.python.org

• Documentation webite – www.python.org/doc


Optimized for software quality, developer
productivity, program portability, & component
integration
1. Readability: - Clear, simple, & concise instructions
are used. Programs written in Python are easier to
maintain, debug, or enhance
2. Higher Productivity: - Code in Python is shorter,
simpler, & less verbose than other high-level PL’s.
Well-designed built-in features & standard library.
Access to third party modules & source libraries
makes Python more efficient
3. Less Learning Time: - Python is relatively easy to
learn because of its simple syntax and shorter
codes
4. Runs Across Different Platforms - Python works on
Windows, Linux/UNIX, Mac, other operating
systems. It also runs on microcontrollers used in
appliances, toys, remote controls, embedded
devices, and other similar devices
• Jupyter Notebook - is an open-source
web application that allows you to
create & share documents that
contain live code, equations,
visualizations & narrative text
• Uses include: Data cleaning &
transformation, numerical simulation,
statistical modeling, data visualization,
machine learning, & much more

• Anaconda Navigator - is a desktop


graphical user interface (GUI)
included in Anaconda® distribution
that allows you to launch applications
& easily manage conda packages,
environments & channels without
using command-line commands
• Navigator can search for packages on
Anaconda Cloud or in a local
Anaconda Repository. It is available for
Windows, macOS, & Linux
NUMPY & PANDAS
• NumPy – Python package – Important library for data manipulation, wrangling, & analysis
• Pandas - Python package – Allow to work with both cross-sectional data & time series data
• Initial work for pandas was done by Wes McKinney in 2008 while he was a developer at AQR
Capital Management
• Data Retrieval – Pandas provide numerous ways to retrieve & read in data. We can convert
data from csv files, databases, flat files & so on into dataframes
• CSV Files to Dataframe – CSV (Comma Separated Files) are perhaps one of the most widely
used ways of creating a dataframe. We can easily read in a CSV, or any delimited file (TSV),
using pandas & convert into a dataframe
• Data Access – Most important part after reading in our data is accessing the data using data
structures access mechanisms
• Head & Tail
• head(n=10) – gives the first 10 rows of the dataframe; gives first 5 rows by default
• tail() – gives us the last few rows of the dataframe
5. DATA ANALYSIS – DESCRIPTIVE ANALYTICS
Univariate Analysis – Summary statistics of single numbers. 4 key characteristics of
numerical data include
DESCRIBING DATA NUMERICALLY – DESCRIPTIVE ANALYSIS

Central Tendency Kurtosis Variation Shape

Mean Range Skewness

Median Coefficient of Variation

Mode Variance

Percentiles Standard Deviation

Quartiles Mean Deviation


Characteristic Interpretation
Central Tendency Where are the data values concentrated?
How much variation is there in the data?
Dispersion How spread out are the data values?
Are there unusual values?
Are the data values distributed symmetrically? Skewed? Sharply
Shape
peaked? Flat? Bimodal?
Is the height and sharpness of the peak relative to the rest of the
Kurtosis
data?

• Easy way to assimilate summaries of data - Tables, diagrams & graphs


• To interpret significance of data - a concise numerical description is preferred
• Key statistical measures that describe data are:
1.Measures of central tendency (or location)
2.Measures of dispersion (or spread)
3.Measures of shape (skewness and kurtosis)
5.1 Measures of Central Tendency/ Location
o Average or typical observed value of a variable in a data set
o Center of the frequency distribution of the data
o Commonly used measures of central tendency appropriate for three different levels of
measurement (nominal, ordinal, & interval) – arithmetic mean, mode & median
o Geometric & harmonic means, are appropriate only for ratio variables
Central Tendency

Arithmetic Mean Median Mode Geometric Mean

X i
X  i 1
XG  ( X1  X2    Xn )1/ n
n
Mid point of ranked Most frequently
values observed value
• Interval/ ratio variables - Mode,
CONSIDERATIONS FOR Median, & Mean
CHOOSING A MEASURE OF • Nominal variables - Mode
CENTRAL TENDENCY • Ordinal variables – Mode & Median
5.2 Measures of Dispersion/ Variation
• Measures of location only describe the center of the data. It doesn’t tell us
anything about the spread of the data
• Variation/ Dispersion - “spread” of data points about the center of the
distribution in a sample, that is, the extent to which the observations are
scattered
• Enables to study & compare the spread in two or more distributions
 Example: - in choosing supplier A or supplier B we can measure the variability in delivery time for each

Variation

Range Mean Deviation Variance Standard Coefficient of


Deviation Variation
Same center, different variation
Statistic Formula Excel Pro Con
=MAX(Data)- Sensitive to extreme
Range xmax – xmin Easy to calculate
MIN(Data) data values

Variance (s2)   xi  x 2 =VAR(Data)


Plays a key role in mathematical
Non-intuitive meaning
i 1
statistics
n 1

n
Standard deviation   xi  x 2
=STDEV(Data)
Uses same units as the raw data
Non-intuitive meaning
i 1
(s) ($ , £, ¥, etc.)
n 1

Coefficient of s Measures relative variation in Requires non-negative


variation (CV)
100  None
percent so can compare data sets data
x
n
Mean absolute  xi  x
=AVEDEV(Data) Easy to understand
Lacks “nice” theoretical
deviation (MAD) i 1 properties
n
6
5.3 Measures of Skewness Cri  2 
N
• Measure of the degree of asymmetry of a distribution
• Excel uses Fisher’s measure of skewness. From Excel, skew
= 0.4410. This value can be compared with a lower and upper
value to see if 0.4410 suggests a significantly skewed
distribution
• Example: - Lower/upper limit is ± 1.36 and 0.4410 lies
between -1.36 and +1.36. The value of 0.4410 suggests the
data values are not significantly skewed
Excel Function Method
Fisher’s skew = SKEW(B4:B16) = 0.4410
24
5.4 Measures of Kurtosis Cri  2 
N
• Measure of whether the data are peaked or flat relative to a
normal distribution
• Excel uses Fisher’s measure of Kurtosis
• From Excel, kurtosis = -0.4253. This value can be compared
with a lower and upper value to see if -0.4253 suggests a
significantly ‘peaked’ distribution
• Example:- Lower/upper limit is ± 2.72 and -0.4253 lies
between -2.72 and + 2.72. The value of -0.4253 suggests the Excel Function Method
data values do not have a significant kurtosis problem Fisher’s kurtosis =KURT(B4:B16) =- 0.4253
DESCRIPTIVE ANALYSIS - DESCRIBE()
Descriptive Analytics - Give insights about the shape of each attribute
• Often you can create more summaries than you have time to review.
• describe() function - pandas library – Gives the mean, std & IQR values -
function excludes the character columns & gives summary about numeric
columns
• include argument - Pass columns that need to be considered for summarizing.
Takes the list of values; By default, it is number
• object − Summarizes String columns
• number − Summarizes Numeric columns
• all − Summarizes all columns together (Should not pass it as a list value). Lists 8
statistical properties of each attribute – Count, Mean, Standard Deviation, Minimum
Value, 25th Percentile, 50th Percentile (Median), 75th Percentile, Maximum Value
THE IMPORTANCE OF FOOD & NUTRITION
OBESITY TRENDS AMONG US ADULTS - USDA
1. Read the dataset into Python.
2. Display the structure of the dataset. How many rows of data (observations)
are in this dataset? How many variables are there in the dataset?
3. Perform exploratory analysis and answer the following questions. What is the
maximum value of the dimension “VitaminC"? What is the minimum value of
the dimension “VitaminE"? How much is the mean of the dimension
“VitaminD"?
4. Get the maximum value of the column 'Sodium‘. Get the entire row which has
the maximum value for Sodium.
5. Perform exploratory analysis on select columns (columns dealing with vitamins
– VitaminC, VitaminE, VitaminD).
6. Select columns (Description & Sodium) for rows (1,2,3,4, & 5) using loc.
7. Investigate the food which has high Sodium level recorded as 38758 mg.
1. Read the dataset into Python
import pandas as pd
data = pd.read_excel("D:/Mumbai - NMIMS - 24.01.2020/Case_Study_USDA.xls")
2. Display the structure of the dataset. How many rows of data (observations) are in this
dataset? How many variables are there in the dataset?
print("Display the dimensions of the dataset:",data.shape)
print("Display the column names \n",data.columns)
print("Initial Information about the dataset")
data.info()
Observations
• There are 7058 observations collected on 16 variables
• Variables include: ID – Unique Identification no. starting with 1001; Description – Text
description of each food item studied; Calories – Amount of calories in 100 grams of food,
measured in kilo calories; Protein, TotalFat, Carbohydrate, SaturatedFat, Sugar – measured in
grams; Sodium, Cholesterol, Calcium, Iron, Potassium, VitaminC – measured in milli grams;
VitaminE & VitaminD – measured in standard national units
3. Perform exploratory analysis and answer the following questions. What is the maximum
value of the dimension “VitaminC"? What is the minimum value of the dimension
“VitaminE"? How much is the mean of the dimension “VitaminD"?
#Exploratory Analysis - describe()
description = data.describe(include='all')
print(description)
Observations
• Displays Count, Unique, Top, Frequency, Mean, Standard Deviation, Minimum Value, 25th
Percentile, 50th Percentile (Median), 75th Percentile, & Maximum Value
• Maximum amount of Cholesterol is 3100 mg while the mean is only 41.55 mg
• Startling fact – Maximum Sodium levels are recorded as 38758 mg, which is huge against
the daily recommended levels of 2300 mg
4. Get the maximum value of the column 'Sodium‘. Get Interpretation
the entire row which has the maximum value for
Maximum Sodium levels are
Sodium.
recorded as 38758 mg, against
print(data['Sodium'].max()) the daily recommended levels
of 2300 mg
data.loc[data['Sodium'].idxmax()]
5. Perform exploratory analysis on select columns
(columns dealing with vitamins – VitaminC,
VitaminE, VitaminD).

data_partial=data.iloc[:,[13,14,15]]
data_partial
USDA_partial= data_partial.describe(include='all')
USDA_partial
6. Select columns (Description & Sodium) for rows
(1,2,3,4, & 5) using loc.

data.loc[[1,2,3,4,5],['Description','Sodium']]

7. Investigate the food which has high Sodium level Interpretation


recorded as 38758 mg.
265th food in the dataset has the
highest Sodium level
data.loc[data['Sodium']==38758]
CASE STUDY – STOP AND FRISK
An analysis by the New York Civil Liberties Union (NYCLU) revealed that
innocent New Yorkers have been subjected to police stops and street
interrogations more than 5 million times since 2002, and that black and Latino
communities continue to be the overwhelming target of these tactics. Nearly
nine out of 10 stopped-and-frisked New Yorkers have been completely
innocent, according to the NYPD's own reports:
• In 2002, New Yorkers were stopped by the police 97,296 times. 80,176 were totally
innocent (82 percent).
• In 2003, New Yorkers were stopped by the police 160,851 times. 140,442 were totally
innocent (87 percent). 77,704 were black (54 percent). 44,581 were Latino (31 percent).
17,623 were white (12 percent). 83,499 were aged 14-24 (55 percent).
• In 2004, New Yorkers were stopped by the police 313,523 times. 278,933 were totally
innocent (89 percent). 155,033 were black (55 percent). 89,937 were Latino (32
percent). 28,913 were white (10 percent). 152,196 were aged 14-24 (52 percent).
• In 2005, New Yorkers were stopped by the police 398,191 times. 352,348 were totally
innocent (89 percent). 196,570 were black (54 percent). 115,088 were Latino (32
percent).
• 40,713 were white (11 percent). 189,854 were aged 14-24 (51 percent).
• In 2006, New Yorkers were stopped by the police 506,491 times. 457,163 were totally
innocent (90 percent). 267,468 were black (53 percent). 147,862 were Latino
(29percent).
• 53,500 were white (11 percent). 247,691 were aged 14-24 (50 percent).
• In 2007, New Yorkers were stopped by the police 472,096 times. 410,936 were totally innocent (87
percent). 243,766 were black (54 percent). 141,868 were Latino (31 percent).
52,887 were white (12 percent). 223,783 were aged 14-24 (48 percent).
• In 2008, New Yorkers were stopped by the police 540,302 times. 474,387 were totally innocent (88
percent). 275,588 were black (53 percent). 168,475 were Latino (32 percent). 57,650 were white (11
percent). 263,408 were aged 14-24 (49 percent).
• In 2009, New Yorkers were stopped by the police 581,168 times. 510,742 were totally innocent (88
percent). 310,611 were black (55 percent). 180,055 were Latino (32 percent). 53,601 were white (10
percent). 289,602 were aged 14-24 (50 percent).
• In 2010, New Yorkers were stopped by the police 601,285 times. 518,849 were totally innocent (86
percent). 315,083 were black (54 percent). 189,326 were Latino (33 percent). 54,810 were white (9
percent). 295,902 were aged 14-24 (49 percent).
• In 2011, New Yorkers were stopped by the police 685,724 times. 605,328 were totally innocent (88
percent). 350,743 were black (53 percent). 223,740 were Latino (34 percent). 61,805 were white (9
percent). 341,581 were aged 14-24 (51 percent).
• In 2012, New Yorkers were stopped by the police 532,911 times. 473,644 were totally innocent (89
percent). 284,229 were black (55 percent). 165,140 were Latino (32 percent). 50,366 were white (10
percent).
• In 2013, New Yorkers were stopped by the police 191,558 times. 169,252 were totally innocent (88
percent). 104,958 were black (56 percent). 55,191 were Latino (29 percent). 20,877 were white (11
percent).
• In 2014, New Yorkers were stopped by the police 45,787 times. 37,744 were totally innocent (82
percent). 24,319 were black (53 percent). 12,489 were Latino (27 percent). 5,467 were white (12
percent).
• In 2015, New Yorkers were stopped by the police 22,939 times. 18,353 were totally innocent (80
percent). 12,223 were black (54 percent). 6,598 were Latino (29 percent). 2,567 were white (11
percent).
• In the first three quarters of 2016 (January – September), New Yorkers were stopped by the police
10,171 times. 7,758 were totally innocent (76 percent). 5,401 were black (54 percent). 2,944 were
Latino (29 percent). 1,042 were white (10 percent).
About the Data
Every time a police officer stops a person in NYC, the officer is supposed
to fill out a form to record the details of the stop. Officers fill out the forms
by hand, and then the forms are entered manually into a database.
There are 2 ways the NYPD reports this stop-and-frisk data: a paper
report released quarterly and an electronic database released annually.
The paper reports – which the NYCLU releases every three months –
include data on stops, arrests, and summonses. The data are broken
down by precinct of the stop and race and gender of the person
stopped. The paper reports provide a basic snapshot on stop-and-frisk
activity by precinct. The electronic database includes nearly all of the
data recorded by the police officer after a stop. The data include the
age of person stopped, if a person was frisked, if there was a weapon or
firearm recovered, if physical force was used, and the exact location of
the stop within the precinct. Having the electronic database allows
researchers to look in greater detail at what happens during a stop.
Below are CSV files containing data from the 2012 electronic database.
This file contains 101 variables and 45,787 observations, each of which
represents a stop conducted by an NYPD officer.

Source: https://www.nyclu.org/en/stop-and-frisk-data
http://michael.hahsler.net/research/arules_RUG_2015/demo/
year - year of stop yyyy; pct - precinct of stop 1 through 123; ser_num -UF-250 serial number nnn; datestop - date of stop mmddyyyy;
timestop - time of stop hhmm; city - location of stop city (1 – Manhattan, 2 – Brooklyn, 3 – Bronx, 4 – Queens, 5 - Staten Island); sex - suspect's
sex (0 – female, 1 – male); race - suspect's race (1 – black, 2 - black Hispanic, 3 - white Hispanic, 4 – white, 5 - Asian/Pacific Islander, 6 - Am.
Indian/Native); dob - suspect's date of birth mmddyyyy; age - suspect's age nnn; height - suspect's height in inches nnn; weight - suspect's
weight in pounds nnn; haircolor - suspect's haircolor (1 – black, 2 – brown, 3 – blonde, 4 – red, 6 – white, 7 – bald, 8 – sandy, 9 - salt and
pepper, 10 – dyed, 11 – frosted); eyecolor – suspect’s eyecolor (1 – black, 2 – brown, 3 – blue, 4 – green, 5 – hazel, 6 – gray, 7 - maroon,
pink, violet, 8 - two different); build - suspect's build (1 – heavy, 2 – musuclar, 3 – medium, 4 – thin); othfeatr - suspect's other features
[string];frisked - was suspect frisked?; searched - was suspect searched?; contrabn - was contraband found on suspect?; pistol - was a
pistol found on suspect?; riflshot - was a rifle found on suspect?; asltweap - was an assault weapon found on suspect?; knifcuti - was a knife
or cutting instrument found on suspect?; machgun - was a machine gun found on suspect?; othrweap - was another type of weapon
found on suspect; arstmade - was an arrest made?; arstoffn - offense suspect arrested for [string]; sumissue - was a summons issued?;
sumoffen -offense suspect was summonsed for [string]; crimsusp - crime suspected [string]; detailcm - crime code description see
attached; perobs - period of observation in minutes nnn; perstop - period of stop in minutes nnn; pf_hands - physical force used by officer –
hands; pf_wall - physical force used by officer - suspect on ground; pf_grnd - physical force used by officer - suspect against wall; pf_drwep
- physical force used by officer - weapon drawn; pf_ptwep - physical force used by officer - weapon pointed; pf_baton - physical force
used by officer – baton; pf_hcuff - physical force used by officer – handcuffs; pf_pepsp - physical force used by officer - pepper spray;
pf_other - physical force used by officer – other; cs_objcs - reason for stop - carrying suspicious object; cs_descr - reason for stop - fits a
relevant description; cs_casng - reason for stop - casing a victim or location; cs_lkout - reason for stop - suspect acting as a lookout;
cs_cloth - reason for stop - wearing clothes commonly used in a crime; cs_drgtr - reason for stop - actions indicative of a drug transaction;
cs_furtv - reason for stop - furtive movements; cs_vcrim - reason for stop - actions of engaging in a violent crime; cs_bulge - reason for stop -
suspicious bulge; cs_other - reason for stop – other; rf_vcrim - reason for frisk - violent crime suspected; rf_othsw - reason for frisk - other
suspicion of weapons; rf_attir - reason for frisk - inappropriate attire for season; rf_vcact - reason for frisk- actions of engaging in a violent
crime; rf_rfcmp - reason for frisk - refuse to comply w officer's directions; rf_verbl - reason for frisk - verbal threats by suspect; rf_knowl -
reason for frisk - knowledge of suspect's prior crim behav; rf_furt - reason for frisk - furtive movements; rf_bulg - reason for frisk - suspicious
bulge; sb_hdobj - basis of search - hard object; sb_outln - basis of search - outline of weapon; sb_admis - basis of search - admission by
suspect; sb_other - basis of search – other; ac_proxm - additional circumstances - proximity to scene of offense; ac_evasv - additional
circumstances - evasive response to questioning; ac_assoc - additional circumstances - associating with known criminals; ac_cgdir -
additional circumstances - change direction at sight of officer; ac_incid - additional circumstances - area has high crime incidence;
ac_time - additional circumstances - time of day fits crime incidence; ac_stsnd - additional circumstances - sights or sounds of criminal
activity; ac_rept - additional circumstances - report by victim/witness/officer; ac_inves - additional circumstances - ongoing investigation;
ac_other - additional circumstances – other; forceuse - reason for force (1 - defense of other, 2 - defense of self, 3 - overcome resistence, 4
– other, 5 - suspected flight, 6 - suspected weapon); inout - was stop inside or outside (0 – outside, 1 – inside); trhsloc - was location housing
or transit authority (0 – neighter, 1 - housing authority, 2 - transit authority); premname - location of stop premise name [string]; addrnum -
location of stop address number [string]; stname - location of stop street name [string]; stinter - location of stop intersection [string]; crossst -
location of stop cross street [string]; addrpct - location of stop address precinct [string]; sector - location of stop sector 1 through 21; beat -
location of stop beat nnn; post - location of stop post nnn; xcoord - location of stop x coord nnnnnnn (1983 state plane, feet, long island);
ycoord - location of stop y coord nnnnnnn (1983 state plane, feet, long island); typeofid - stopped person's identification type (1 - photo
id, 2 - verbal id; 3 - refused to provide id); othpers - were other persons stopped, questioned or frisked ?; explnstp - did officer explain
reason for stop?; repcmd - reporting officer's command 1 through 999; revcmd - reviewing officer's command 1 through 999; offunif - was
officer in uniform?; offverb - verbal statement provided by officer (if not in uniform); officrid - id card provided by officer (if not in uniform);
offshld - shield provided by officer (if not in uniform); radio - radio run; recstat - record status (0 - original value A, 1 - original value 1); linecm
count >1 additional details. All variables with no values listed have the following values: (0 – no, 1 - yes)
Dr.Shaheen M.Sc. (CS), Ph. D. (CSE)
Institute of Public Enterprise, Shamirpet Campus, Hyderabad – 500 101
Email: shahmsc@ipeindia.org Mobile: + (91)98666 66620
INTRODUCTION TO GRAPHICS
• One of the greatest powers of R is its graphical capabilities
• The Graphics Window
• When pictures are created in R, they are presented in the active graphical window. If
no such window is open when a graphical function is executed, R will open one.
• Some features of the graphics window:
• You can print directly from the graphics window, or choose to copy the graph to the clipboard and
paste it into a word processor. There, you can also resize the graph to fit your needs.
• A graph can also be saved in many other formats, including pdf, bitmap, metafile, jpeg, or
postscript.
• Each time a new plot is produced in the graphics window, the old one is lost. In MS Windows, you
can save a “history” of your graphs by activating the Recording feature under the History menu
(seen when the graphics window is active). You can access old graphs by using the “Page Up” and
“Page Down” keys. Alternatively, you can simply open a new active graphics window (by using the
function x11() in Windows/Unix and quartz() on a Mac).
• There are many functions in R that produce graphs, and they range
TWO BASIC GRAPHING FUNCTIONS
from the very basic to the very advanced and intricate.
• The plot() Function
• The most common function used to graph anything in R is the plot()
function.
• This is a generic function that can be used for scatterplots, time-series
plots, function graphs, etc.
• If a single vector object is given to plot(), the values are plotted on the
y-axis against the row numbers or index.
• If two vector objects (of the same length) are given, a bivariate
scatterplot is produced.
• For example, consider again the dataset trees in R. To visualize the
relationship between Height and Volume, we can draw a scatterplot:
 plot(Height, Volume) # object trees is in the search path
• The first variable is plotted along the horizontal axis and the second
variable is plotted along the vertical axis. By default, the variable
names are listed along each axis.
• plot() function can allow for some pretty snazzy window dressing by
changing the function arguments from the default values. These
include adding titles/subtitles, changing the plotting character/color
(over 600 colors are available!), etc.
• See ?par for an overwhelming lists of these options.
• The curve() Function
CONTD...
• To graph a continuous function over a specified range of
values, the curve() function can be used (although interestingly
curve() actually calls the plot() function).
• The syntax of this function is:
curve(expr, from, to, add = FALSE, ...)
Where, expr: an expression written as a function of 'x'
from, to: the range over which the function will be plotted.
add: logical; if 'TRUE' add to already existing plot.
• Note that it is necessary that the expr argument is always
written as a function of 'x'.
• If the argument add is set to TRUE, the function graph will be
overlaid on the current graph in the graphics window
• For example, the curve() function can be used to plot the sine
function from 0 to 2π:
> curve(sin(x), from = 0, to = 2*pi)
GRAPH EMBELLISHMENTS

• In addition to standard graphics functions, there are a host of other functions that
can be used to add features to a drawn graph in the graphics window.
• These include:

Function Operation
abline() adds a straight line with specified intercept and slope
arrows() adds an arrow at a specified coordinate
lines() adds lines between coordinates
points() adds points at specified coordinates
rug() adds a “rug” representation to one axis of the plot
segments() similar to lines() above
text() adds text (possibly inside the plotting region)
title() adds main titles, subtitles, etc. with other options
CHANGING GRAPHICS PARAMETERS

• There is still more fine tuning available for altering the graphics settings. To make changes to
how plots appear in the graphics window itself, or to have every graphic created in the
graphics window follow a specified form, the default graphical parameters can be changed
using the par() function.
• There are over 70 graphics parameters that can be adjusted, so only a few will be mentioned
here. Some very useful ones are given below:
> par(mfrow = c(2, 2)) # gives a 2 x 2 layout of plots
> par(lend = 1) # gives "butt" line end caps for line plots2
> par(bg = "cornsilk") # plots drawn with this colored background
> par(xlog = TRUE) # always plot x axis on a logarithmic scale
• Any or all parameters can be changed in a par() command, and they remain in effect until
they are changed again (or if the program is exited). You can save a copy of the original
parameter settings in par(), and then after making changes recall the original parameter
settings.
• To do this, type
> oldpar <- par(no.readonly = TRUE)
UNI-VARIATE DATA
• There is a distinction between types of data in statistics and R.
• In particular, initially, data can be of three basic types:
categorical, discrete numeric and continuous numeric.
• Categorical data is data that records categories.
• Example: -
• in a doctor's chart which records data on a patient. the gender or the
history of illnesses might be treated as categories.
• the age of a person and their weight are numeric quantities.
• Categorical data
• We often view categorical data with tables but we may also
look at the data graphically with bar graphs or pie charts.
• Using tables - Its simplest usage looks like table(x) where x is a
categorical variable. The table command simply adds up the
frequency of each unique value of the data.
• Example: - Smoking survey - A survey asks people if they smoke or not.
The data is Yes, No, No, Yes, Yes. We can enter this into R with the c()
command, and summarize with the table command as follows
> x=c("Yes","No","No","Yes","Yes")
> table(x)
x
No Yes
23
• Factors CONTD...
• Categorical data is often used to classify data into various levels or factors.
• For example, the smoking data could be part of a broader survey on
student health issues.
• R has a special class for working with factors which is occasionally
important to know as R will automatically adapt itself when it knows it has
a factor.
• To make a factor is easy with the command
factor or as.factor.

• Notice the difference in how R treats factors with this example


> x=c("Yes","No","No","Yes","Yes")
>x # print out values in x
[1] "Yes" "No" "No" "Yes" "Yes"

> factor(x) # print out value in factor(x)


[1] Yes No No Yes Yes
Levels: No Yes # notice levels are printed.
BASIC GRAPHICS - BAR CHARTS
• A bar chart draws a bar with a height
proportional to the count in the table.
• The height could be given by the frequency, or the proportion.
• The graph will look the same, but the scales may be dierent.
• Suppose, a group of 25 people are surveyed as to their beer-drinking
preference. The categories were: Domestic can, Domestic bottle,
Microbrew and Imported. The raw data is
3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1
• Let's make a bar-plot of both frequencies and proportions. First, we use
the scan function to read in the data then we plot it
> beer = scan()
1: 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1
26:
Read 25 items
> barplot(beer) # this isn't correct
> barplot(table(beer)) # Yes, call with summarized data
> barplot(table(beer)/length(beer)) # divide by n for proportion
>barplot(trees, main=“Tree Structure”, xlab=“Height", ylab=“Girth“)

barplot(hh, beside = TRUE, col = c("lightblue", "mistyrose", "lightcyan", "lavender"), legend =


colnames(VADeaths), ylim = c(0,100), main = "Death Rates in Virginia", font.main = 4, sub =
"Faked upper 2*sigma error bars", col.sub = mybarcol, cex.names = 1.5)
BASIC GRAPHICS - PIE CHARTS
• The same data can be studied with pie charts using the pie
function.
• The usage is similar to barplot(), but with some added features.
> beer.counts = table(beer) # store the table result
> pie(beer.counts) # first pie -- kind of dull
> names(beer.counts) = c("domestic\n can","Domestic\n bottle",
"Microbrew","Import")
# give names
> pie(beer.counts) # prints out names
> pie(beer.counts,col=c("purple","green2","cyan","white")) #now with colors
BASIC GRAPHICS - HISTOGRAM
• Histogram is typically used to display continuous-type data.
hist(D$wg, main=“Weight Gain”,
xlab=“Weight Gain”, ylab =“Frequency”, col=“blue”)
Where “main” statement will give the plot an overall heading, “xlab” and
“ylab” to label the X and Y axes, respectively.
• ?colors will give you help on the colors.
• hist(islands)
• utils::str(hist(islands, col = "gray", labels = TRUE))
• hist(sqrt(islands), breaks = 12, col = "lightblue", border = "pink")
##-- For non-equidistant breaks, counts should NOT be graphed unscaled:
• r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140), col = "blue1")
• text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = "blue3")
• sapply(r[2:3], sum)
• require(utils) # for str
• str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks
• str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))
• hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE, main = "WRONG histogram")
• require(stats)
• set.seed(14)
• x <- rchisq(100, df = 4)
Comparing data with a model distribution should be done with qqplot()!
• qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)
## if you really insist on using hist() ...
hist(x, freq = FALSE, ylim = c(0, 0.2))
curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)
BASIC GRAPHICS – COLORS
BASIC GRAPHICS - BOXPLOTS
• For discrete or categorical data, we can display the information
using the barplot() function.
boxplot(trees)

• I want to draw the variable names vertically, instead of horizontally.


This can be easily done with the argument las. So now the call to
the function boxplot()becomes:
boxplot(trees, las = 2)

• If you want to add colours to your box plot, you can use the option
col and specify a vector with the colour numbers or the colour
names.
boxplot(trees, las = 2, col =
c("red","sienna","palevioletred1","royalblue2","red","sienna
","palevioletred1", "royalblue2","red","sienna","palevioletre
d1","royalblue2"))
• Now, for the finishing touches, we can put some labels to plot.

• The common way to put labels on the axes of a plot is by using the
arguments xlab and ylab.
• boxplot(trees, ylab =“Value", xlab =“Attributes", las = 2,
col=c("red","sienna","palevioletred1","royalblue2","red","
sienna","palevioletred1","royalblue2","red","sienna","pale
violetred1","royalblue2“))
• Produce box-and-whisker plot(s) of the given (grouped) values.
boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch =
FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL,
log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
horizontal = FALSE, add = FALSE, at = NULL)
where
• X for specifying data from which the boxplots are to be produced.
• range determines how far the plot whiskers extend out from the box. If range is
positive, the whiskers extend to the most extreme data point which is no more
than range times the interquartile range from the box. A value of zero causes
the whiskers to extend to the data extremes.
• width a vector giving the relative widths of the boxes making up the plot.
• varwidthif varwidth is TRUE, the boxes are drawn with widths proportional to the
square-roots of the number of observations in the groups.
• notchif notch is TRUE, a notch is drawn in each side of the boxes. If the notches
of two plots do not overlap this is ‘strong evidence’ that the two medians differ
• outlineif outline is not true, the outliers are not drawn
• names group labels which will be printed under each boxplot.
• boxwex a scale factor to be applied to all boxes. When there are only a few
groups, the appearance of the plot can be improved by making the boxes
narrower.
• staplewexstaple line width expansion, proportional to box width.
• outwexoutlier line width expansion, proportional to box width.
plotif TRUE (the default) then a boxplot is produced. If not, the summaries which
the boxplots are based on are returned.
borderan optional vector of colors for the outlines of the boxplots. The values in
border are recycled if the length of border is less than the number of plots.
colif col is non-null it is assumed to contain colors to be used to colour the bodies
of the box plots. By default they are in the background colour.
logcharacter indicating if x or y or both coordinates should be plotted in log
scale.
parsa list of (potentially many) more graphical parameters, e.g., boxwex or
outpch; these are passed to bxp (if plot is true); for details, see there.
horizontallogical indicating if the boxplots should be horizontal; default FALSE
means vertical boxes.
addlogical, if true add boxplot to current plot.
atnumeric vector giving the locations where the boxplots should be drawn,
particularly when add = TRUE; defaults to 1:n where n is the number of boxes.
CONTD…
• Do it by shift

wg.7a <- D[D$Shift=="7am",]


wg.8a <- D[D$Shift=="8am",]
wg.9a <- D[D$Shift=="9am",]
wg.10a <- D[D$Shift=="10am",]
wg.11a <- D[D$Shift=="11am",]
wg.12p <- D[D$Shift=="12pm",]

boxplot(wg.7a$wg, wg.8a$wg, wg.9a$wg, wg.10a$wg, wg.11a$wg, wg.12p$wg, main='Weight


Gain', ylab='Weight Gain (lbs)',
xlab='Shift‘,names=('7am','8am','9am','10am','11am','12pm'))
BOX-PLOTS - GROUPINGS
• What if we want several box plots side by side to be able to compare them.
• First Subset the Data into separate variables.
wg.m <- D[D$Gender=="M",]
wg.f <- D[D$Gender=="F",]
• Then Create the box plot.
boxplot(wg.m$wg,wg.f$wg)
boxplot(wg.m$wg,wg.f$wg, main='Weight Gain (lbs)', ylab='Weight Gain', names =
c('Male','Female'))
CONTD…
• Do it by shift

wg.7a <- D[D$Shift=="7am",]


wg.8a <- D[D$Shift=="8am",]
wg.9a <- D[D$Shift=="9am",]
wg.10a <- D[D$Shift=="10am",]
wg.11a <- D[D$Shift=="11am",]
wg.12p <- D[D$Shift=="12pm",]

boxplot(wg.7a$wg, wg.8a$wg, wg.9a$wg, wg.10a$wg, wg.11a$wg, wg.12p$wg, main='Weight


Gain', ylab='Weight Gain (lbs)',
xlab='Shift‘,names=('7am','8am','9am','10am','11am','12pm'))
BASIC GRAPHICS -
SCATTER PLOTS
• Suppose we have two
variables and we wish to see
the relationship between them.
A scatter plot works very well.

• Examples
plot(D$metmin,D$wg,main='Met
Minutes vs. Weight
Gain',xlab='Mets(min)',
ylab='Weight Gain (lbs)')

plot(D$metmin,D$wg,main='Met
Minutes vs. Weight Gain',
xlab='Mets(min)',ylab='Weigh
t Gain (lbs)',pch=2)
PROBABILITY, DISTRIBUTIONS, AND
SIMULATION
• Distribution Functions in R
• R allows for the calculation of probabilities, the evaluation of probability
density/mass functions, percentiles, and the generation of pseudo-random variables
following a number of common distributions.
• The following table gives examples of various function names in R along with
additional arguments.
CONTD...
• Prefix each R name given above with ‘d’ for the density or mass
function, ‘p’ for the CDF, ‘q’ for the percentile function (also called the
quantile), and ‘r’ for the generation of pseudorandom variables.
• The syntax has the following form – we use the wildcard rname to
denote a distribution above:
> drname(x, ...) # the pdf/pmf at x (possibly a vector)
> prname(q, ...) # the CDF at q (possibly a vector)
> qrname(p, ...) # the pth (possibly a vector) percentile/quantile
 rrname(n, ...) # simulate n observations from this distribution

 The following are examples with these R functions:


> x <- rnorm(100) # simulate 100 standard normal RVs, put in x
> w <- rexp(1000,rate=.1) # simulate 1000 from Exp( = 10)
> dbinom(3,size=10,prob=.25) # P(X=3) for X ~ Bin(n=10, p=.25)
> dpois(0:2, lambda=4) # P(X=0), P(X=1), P(X=2) for X ~ Poisson
> pbinom(3,size=10,prob=.25) # P(X 3) in the above distribution
> pnorm(12,mean=10,sd=2) # P(X 12) for X~N(mu = 10, sigma = 2)
> qnorm(.75,mean=10,sd=2) # 3rd quartile of N(mu = 10,sigma = 2)
> qchisq(.10,df=8) # 10th percentile of 2(8)
> qt(.95,df=20) # 95th percentile of t(20)
SCATTERPLOT MATRICES
• Determines if there is a linear correlation between multiple variables.
• R comes with some various pre-saved datasets for practice. First, load or open these
datasets.
data(trees)
data(ChickWeight)
• To see the actual data contained by these datasets, just write the title of the
dataset.
trees
ChickWeight
• The trees dataset seems to contain three columns of measurements: Girth, Height
and Volume.
• The ChickWeight dataset seems to involve little chicklets getting fed different diets
and being weighed at various time points.
• To find out more information about the datasets and to confirm our observations,
put a question mark before the title of the dataset.
• ?trees
• ?dataset
CONTD…
• Ready for the scatterplot? pairs(trees)
• The variables are written in a diagonal line from top left to bottom right. Then each
variable is plotted against each other. The boxes on the upper right hand side of
the whole scatterplot are mirror images of the plots on the lower left hand.
• For example, the middle square in the first column is an individual scatterplot of Girth and
Height, with Girth as the X-axis and Height as the Y-axis. This same plot is replicated in the middle
of the top row.
• In this scatterplot, it is probably safe to say that there is a correlation between Girth and Volume
because the plot looks like a line. There is probably less of a correlation between Height and
Girth in addition to Height and Volume.
• More statistical analyses would be needed to confirm or deny this.
CONTD…
• Now for ChickWeight pairs(ChickWeight)
• This scatterplot matrix is unfortunately not as clean as the last plot because it contains discrete
data points for Time, Chick and Diet. However, much can still be extracted from this scatterplot
matrix (think about BS exercises you might have done for English or Art) about experimental
design and possible outcomes.
• Scatterplots related to Time are evenly distributed into columns or rows, suggesting that data
was actually collected in a regimented fashion. (As in, data was collected at the times it should
have been for all the Chick samples).
• There were about 50 chicks. The first 20 were on diet 1 and then the next three groups of 10
were given diet 2, 3 or 4.
• Looking at Row 4, Column 1, there is a possibility that chicks on diet 3 gained more weight than
chicks on diets 1, 2 or 4.
• Looking at Row 2, Column 1, it seems that chicks weighed about the same amount at the
beginning of the experiment but variation increased as time passed on. In general, there is an
increase in weight.
CONCLUSION
• Scatterplot matrices are good for determining rough linear correlations of
metadata that contain continuous variables.
• Scatterplot matrices are not so good for looking at discrete variables.
TRELLIS PLOTS
• Trellis Graphics is a family of techniques for viewing
complex, multi-variable data sets.
• In order to produce Trellis plots you must load the
“Lattice” library and start a “trellis aware” device.
> library(lattice)
> trellis.device()

• Conditioning
• Trellis plots are based on the idea of conditioning on the
values taken on by one or more of the variables in a data
set.
• In the case of a categorical variable, this means carrying out
the same plot for the data subsets corresponding to each of
the levels of that variable.
• In the case of a numeric variable, it means carrying out the
same plots data subsets corresponding to intervals of that
variable.
EXAMPLE # 1
EARTHQUAKE LOCATIONS
• R contains a data set
called quakes which gives
the location and
magnitude of
earthquakes under the
Tonga Trench, to the
North of New Zealand.
• The spatial distribution of
earthquakes in the area is
of major interest, because
this enables us to “see”
the structure of the
earthquake faults. Here is
a plot from the Geology
department at Berkeley,
which tries to present the
spatial structure.
Tonga Trench Earthquakes
Yellow: 0 − 70 km
Orange: 71 − 300 km
Red: 300 − 800 km
Problems with this Presentation
CONTD…
• There is a good deal of over-plotting and this makes it
hard to see all of the structure present in the data.
• The map makes it clear that we are looking down from
above on the scene, but deeper quakes appear to be
plotted on top of shallower ones.
• The division of depths into three intervals and presentation
using colour is relatively crude.

A Trellis Plot
• We can overcome many of the problems of the previous
plot by using a trellis display.
• We create the display by producing a sequence of
graphs, each of which presents a different range of
depths.
• In this case we will have a slight overlap of the intervals
being plotted.
• Explanation
• The plot is read left-to-right and bottom- CONTD…
to-top.
• Depth increases progressively through
the plot.
• There are eight different depth intervals,
each containing approximately the
same number of earthquakes.
• Consecutive depth intervals overlap by
a small amount.
• The range of depths covered by each
interval is indicated in the bar above
each plot.
• Interpretation
• The shallower earthquakes are
concentrated on two inclined fault
planes.
• The most easterly of these fault planes is
the one which bisects New Zealand.
• The Westerly fault plane has mainly
shallow earthquakes, while the Easterly
fault plane has both shallow and deep
earthquakes.
• The deep earthquakes show distinct
EXAMPLE # 2
BARLEY
• The data in the example shows the yields obtained
from field trials of barley seed. The data YIELDS
comes from
the 1930s where there was no direct genetic
modification. The trials were conducted in 1931 and
1932, using 10 different strains of barley and 6
different growing sites. There are 2 × 10 × 6 = 120
observations. It was suspected for a long time that
there was something odd about this data set.
The Trellis Plot CONTD…
• The plot we will look at
shows that barley yields
for each of the 10 strains
at the 6 sites and for
each year.
• The results for each site
are plotted on a
separate graph i.e. we
are working conditional
on the site.
• The yields from the two
years are superimposed
on each of the plots.
THE TRELLIS TECHNOLOGY
• There are a variety of displays which can be produced by Trellis, including:
• Bar Charts
• Dot Charts
• Box and Whisker Plots
• Histograms
• Density Traces
• QQ Plots
• Scatter Plots
• A common framework is used to produce all these plots.
• Every Trellis display consists of a series of rectangular panels, laid out in a regular row-
by-column array.
• The indexing of the array is left-to-right, bottom-to-top.
• The x axes of all the panels are identical. This is also true for the y axes.
• Each panel of the a display corresponds to conditioning, either on the levels of a
CONTD…
Shingles
• The conditioning carried out in the earthquake plot is described by a shingle.
• A shingle consists of a number of overlapping intervals (like the shingles on a
roof of a house).
• Assuming that the earthquake depths are contained in the variable depth,
the shingle is created as follows.
> depth = quakes$depth
> Depth = equal.count(depth, number=8, overlap=.1)
• The shingle assigned to Depth has 8 intervals with adjacent intervals having
10% of their values in common.
• A shingle contains the numerical values it was created from and can be
treated like a copy of that variable. For example:
> range(Depth)
[1] 40 680
> range(depth)
[1] 40 680
CONTD…
• A shingle also has the
information attached to it. This
can be displayed by printing
or plotting the shingle.
• > plot(Depth) Producing the
Plot
• The display of the
earthquakes is produced by
the function xyplot, which is
the Trellis variant of a scatter
plot function.
• The plot was produced as
follows:
> Depth =
equal.count(quakes$depth,
number = 8, overlap = .1)
 xyplot(lat ~ long | Depth, data
= quakes, xlab = "Longtitude",
ylab = "Lattitude")
 lat ~ long | Depth - is an instruction
to plot lat on the y axis against long
on the x axis with conditioning
intervals as described in Depth.
• The second argument to xyplot
CONTD…
Unconditional Plots
• The xyplot function can be used to produce an unconditional plot by
omitting the conditioning specification from the plot formula.
> xyplot(lat ~ long, data = quakes, xlab = "Longtitude", ylab = "Lattitude")
EXAMPLE # 2
BARLEY YIELD PLOT
• The barley yield plot is produced
by the function dotchart which
can be used to numeric values
against a categorical variable.
• In this case, the numeric variable is
the barley yield and the
categorical variable is the seed
strain.
• We also condition on the value of
another variable, the growing site.
• A First Attempt - creating a dot chart
> dotplot(variety ~ yield | site, data
= barley)
• A Second Attempt - conditioning on
both site and year
> dotplot(variety ~ yield | site * year,
data = barley)
• A Third Attempt - What we need
is to superimpose the two years CONTD…
for each site on a single panel
> dotplot(variety ~ yield | site, data
= barley, panel = panel.superpose,
group = year, pch = c(1, 3))

• A Fourth Attempt - add a


legend indicating the year
> dotplot(variety ~ yield | site, data =
barley, panel = panel.superpose,
group = year, pch = c(1, 3), key =
list(space = "right", transparent = TRUE,
points = list(pch = c(1, 3), col = 1:2),
text = list(c("1932", "1931"))))
Choice of Colour Scheme CONTD…
• The default colour scheme
used by Trellis uses light
colours on a medium-gray
background.

• It is a good idea to use an


alternative colour scheme
which uses a dark colours on
a white background.
>trellis.par.set(theme=col.whitebg())
> xyplot(lat ~ long | Depth, data =
quakes)
Titles and Axis Annotation CONTD…
• As with all graphics it is possible to
add a title and axis annotation.
> xyplot(lat ~ long | Depth, data = quakes,
main = "Tonga Trench Earthquakes",
xlab = "Longtitude",
ylab = "Lattitude")

Layout Control
• The layout argument should be a
vector of three values: number of
rows, number of columns and
number of pages desired for the
display.
• For example, we can rearrange the
earthquake plot as follows:
> xyplot(lat ~ long | Depth, data = quakes,
layout = c(4, 2, 1),
xlab = "Longtitude",
ylab = "Lattitude")
Aspect Ratio Control CONTD…
• The panels in the previous plot are
rather too tall relative to their
widths.
• By default, plots are sized so that
they can occupy the full surface
of the output window.

• This can be changed by


specifying the aspect ratio for the
plots.
> xyplot(lat ~ long | Depth, data =
quakes,
aspect = 1,
layout = c(4, 2, 1),
xlab = "Longtitude",
ylab = "Lattitude")
EXAMPLE # 3
DEATH RATES BY GENDER
In this example we’ll look at the Virginia death rate data.
AND
LOCATION
The data values are death rates per 1000 of population
cross-classified by age and population group. We are
interested in how death rate changes with age and how
the death rates in the different population groups
compare.
DATA
MANIPULATION
• The data values are stored by R as a
matrix.
• We first have to turn the death rates into
a vector and create the cross-classifying
factors.
> rate = as.vector(VADeaths)
> age = row(VADeaths, as.factor = TRUE)
> group = col(VADeaths, as.factor = TRUE)

Dotchart 1
• We start by displaying deaths against
age, conditional on population group.
dotplot(group ~ rate | age, xlab = "Death
Rate (per 1000)", layout = c(1, 5, 1))

Dotchart 2
• The first display is hard to read because
the variation within each age group is
“noisy.”
• Here we order the population categories
differently.
• Alternatively we can interchange the
roles of the cross-classifying variables.
> dotplot(age ~ rate | group, xlab = "Death
CONTD…
Dotchart 3
• The second display is better than the
first, but can improve it with a
different ordering of the panels.
• We’ll arrange the panels in a 2 × 2
array.
• This will allow us to make direct
male/female and urban/rural
comparisons.
> dotplot(age ~ rate | group, xlab =
"Death Rate (per 1000)", layout =
c(2, 2, 1))

Alternative Displays
• The previous displays presented the
data in “dotchart” displays.
• There are other alternatives, barcharts
for example.
> barchart(age ~ rate | group, xlab
= "Death Rate (per 1000)", layout =
c(2, 2, 1))
EXAMPLE # 4
• In this example we’ll examine
the heights of the members of
HEIGHTS OF SINGERS
a large choral society. • The
values are in a data set called
singer which is in the Lattice
data library. They can be
loaded with the data
command once the Lattice
library is loaded. The variables
are named height (inches) and
voice.part.
> bwplot(voice.part ~ height,
data=singer, xlab="Height (inches)")
> qqmath(~ height | voice.part,
aspect = 1, data = singer)
EXAMPLE # 5
MEASUREMENT OF EXHAUST FROM BURNING ETHANOL

• The ethanol data frame records 88 measurements (rows)


for three variables (columns) NOx , C , and E from an
experiment in which ethanol was burned in a single
cylinder automobile test engine. NOx gives the
concentration of nitric oxide (NO) and nitrogen dioxide
(NO2) in engine exhaust, normalised by the work done by
the engine. C gives the compression ratio of the engine. E
gives the equivalence ratio at which the engine was run –
a measure of the richness of the air/ethanol mix.
CONTD…
• Exploring the Relationship
• We can get a basic idea of the form
of the relationship between the
variables using a simple conditioning
plot.
> EE = equal.count(ethanol$E, number = 9,
overlap = 1/4)
> xyplot(NOx ~ C | EE, data = ethanol)

• A More Complex Plot


• We can enhance the previous plot
by adding a smooth line through the
points in each panel. This is done
using the lowess smoother.
> xyplot(NOx ~ C | EE, data = ethanol,
xlab = "Compression Ratio",
ylab = "NOx (micrograms/J)",
panel = function(x, y) {
panel.grid(h = -1, v = 2)
panel.xyplot(x, y)
llines(lowess(x, y))
})

Das könnte Ihnen auch gefallen