Sie sind auf Seite 1von 6

STATS 105 Lab #1:Intro to Lab Activities Introduction to R and Exploratory Data Analysis Dr.

Akram Almohalwas

1. Introduction to the lab section:

The goal of lab section is to teach you how to "do statistics" with statistical software. We want you to get your hands dirty with some real data and apply the concepts you learn in class to conduct statistical analysis on it. For this course, we will use the open-source (and FREE!) statistical software package called R. To more easily deliver results and keep track of our work, we will use a program called Rstudio. For this course, you will not need to be an expert programmer in R, nor will you need to generate your own code. You will, however, need to understand how to use some commands, but that is it. The goal is to get you to learn statistical concepts, and not necessarily statistical software.

2. So what is R?
2.1. Background: R is statistical computing and analysis software that allows the user to import, explore and analyze data. Its a simple interface that is driven by either command line code or writing a program. In this class we will use the command line, although it maybe helpful to use programs to store your code. Let me know if you want to learn how to do this. 2.2. Getting Rstudio at home: (1) (2) (3) (4) (5) Go to Click the "Download Rstudio" button in the middle of the screen Click "Download Rstudio Desktop" Choose your platform (probably either Mac OS X 10.5+ or Windows XP/Vista/7) Download and open...

2.3. The interface: When you open Rstudio, you should see 3 separate windows: (1) LEFT: The Console Window. This is where we enter our commands, and see output. All commands are entered after the ">"

(2) UPPER RIGHT: Workspace and History. Workspace displays all of your imported datasets and generated variables. History keeps a list of all of the commands you have written. (3) LOWER RIGHT: Files, Plots, Packages and Help. Files allows you to easily locate and import data files, or open programs. Plots display all plots generated. Packages show the list of packages you have installed (More on this later...). Help is...well the help menu... NOTE: When you open a "program" or double click a data file from the "Files" tab, a NEW window will open on the UPPER LEFT that shows you your program that you can edit, or a table that shows the data...

3. Exploratory Data Analysis

The result of the command read.table is a data frame" (it looks like a table). In our example we give the name data to our data frame. The columns of a data frame are variables. This file contains data on percentage of body fat determined by underwater weighing and various body circumference measurements for 251 men. Here is the variable description: Variable x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 Description Density determined from underwater weighing Percent body fat from Siri's (1956) equation Age (years) Weight (lbs) Height (inches) Neck circumference (cm) Chest circumference (cm) Abdomen 2 circumference (cm) Hip circumference (cm) Thigh circumference (cm) Knee circumference (cm) Ankle circumference (cm) Biceps (extended) circumference (cm) Forearm circumference (cm) Wrist circumference (cm)

You can use workspace, then import data to open the data file that you are planning to work on.

If the data file is saved on your computer (e.g. on your desktop), first you need to change the working directory then read the data either from a saved text file or a web URL files as follows: To read data from a website use the following command: (example)

Make sure to name your data "data" and make sure you click on header because this data has a header.(R-Studio is case sensitive) > View(data)

Note: the expression <- is an assignment operator.


Once we read the data we can display them by simply typing at the command line < data. Or if we want we can display the first 6 rows of the data by typing:
> head(data)

A. B. C. D. E. F.

Some Useful commands Using R-Studio: Inspecting the data: Always check to make sure that your data imported correctly Displaying the dimensions of the data set: > dim(data) Display the first 3 rows and 5 columns of the dataset:> data[1:3, 1:5] Displaying the variable names in the data set: > names(data) Attaching/detaching data frames in R: >attach(data) Display the first 6 rows of the dataset: >head(data)

Extracting one variable from the data frame (e.g. the second variable): > data[,2] Another way to extract a variable: > data$x2

Similarly if we want to access a particular row in our data (e.g. first row): > data[1,] To list all the data simply type: (Do not include this output in your lab report) > data To compute the mean of all the variables in the data set: > mean(data) To compute the mean of just one variable: > mean(data$x2) To compute the mean of variables 2 and 3: > mean(data[,c(2,3)]) To compute the variance of one variable: > var(data$x2) To compute summary statistics for all the variables: > summary(data) To compute summary statistics for the second variable:
> summary(data[,2])

To construct stem-and-leaf plot, histogram, boxplot: > stem(data$x2) > boxplot(data$x2) > hist(data$x2)
>hist(data$x2, main="Percent of Body Fat",xlab="Body Fat")

To plot variable x2 against variable x10: > plot(data$x2,data$x10) And you can give names to the axes and to your plot: > plot(data$x2,data$x10, main="Scatterplot of percent body fat against thigh circumference", xlab="Percent body fat", ylab="Thigh circumference") Displaying quantitative data > attach(data) >mean(x4) >sd(x4) >median(x4) >IQR(x4) >summary(x4) >hist(x4, main="Histogram of Weight (in lbs)",10) {10 is the number of classes on your histogram, change it and see what happen} >stem(x4) >boxplot(x4, main="Boxplot of Weight (lbs)") Defining a vector: Let's try to enter a 20 data points and call it "a"

> a<-c(2,3,4,6,8,9,10,4,7,9,8,6,7,7,8,9,5,7,9,10) > > > > > mean(a) hist(a) summary(a) boxplot(a) stem(a)

We can get the following statistical summaries for the vector "a"

The last part is suppose to help you solve questions (Q23, Q43 and Q68) on your homework #1 instead of doing it by hand. (I've typed the data for you).
> Q23 2130, 1102, 1890, 1820, 1330, <-c(1115, 1310, 1540, 1502, 1258, 1315, 1085, 798, 1020, 865, 1421, 1109, 1481, 1567, 1883, 1203, 1270, 1015, 845, 1674, 1016, 1605, 706, 2215, 785, 885, 1223, 375, 2265, 1910, 1018, 1452, 2100, 1594, 2023, 1315, 1269, 1260, 1888, 1782, 1522, 1792, 1000, 1940, 1120, 910, 1730, 1102, 1578, 758, 1416, 1560, 1055, 1764, 1608, 1535, 1781, 1750, 1501, 1238, 990, 1468, 1512, 1750, 1642)

> Q68<- c(17.0, 16.7, 17.1, 17.5, 17.6, 16.6, 17.4, 17.4, 18.1, 17.5, 16.3, 17.2, 17.4, 17.5, 16.5, 16.1, 17.4, 17.5, 17.4, 17.8, 17.1, 17.4, 17.4, 17.4, 17.3, 16.9, 17.0, 17.6, 17.1, 17.3, 16.8, 17.3, 17.4, 17.6, 17.1, 17.4, 17.2, 17.3, 17.7, 17.4, 17.1, 17.4, 17.0, 17.4, 16.9, 17.0, 16.8, 17.8, 17.8, 17.3)

To be able to construct a normality plot using R, please use the following the code:
> qqnorm(Q68) > qqline(Q68)

All lab data files will be posted on ccle, you just need to download the files and follow the instruction for each lab activity. The lab write up must have concise, yet coherent answers for each question and include all relevant plots/tables. The lab write up must be typed in a word processing program of your choice (Lab computers have Pages or NeoOffice or Word. No late Lab work is accepted, and can NOT be emailed.