Sie sind auf Seite 1von 17

District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

How to Transition from Excel to R


Introduction

In today's increasingly data-driven world, business people are


constantly talking about how they want more powerful and flexible
analytical tools, but are usually intimidated by the programming
knowledge these tools require and the learning curve they must
overcome just to be able to reproduce what they already know how
to do in the programs they've become accustomed to using. For
most business people, the go-to tool for doing anything analytical
is Microsoft Excel.

If you're an Excel user and you're scared of diving into R, you're in


luck. I'm here to slay those fears! With this post, I'll provide you
with the resources and examples you need to get up to speed doing
some of the basic things you're used to doing in Excel in R. I'm
going to spare you the countless hours I spent researching how to
do this stuff when I first started so that you feel comfortable
enough to continue using R and learning about its more
sophisticated capabilities.

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 1 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

Excited? Let's jump in!

Quick note before we do: There are usually multiple ways to do


everything in R. I'm going to show you the way I do it, but if you
know of a better/shorter/faster/easier way to do something, please
leave it in the comments!

The Basics
Let's start with the basics. You'll want to make sure you have
downloaded and installed R. I'm also using RStudio as my IDE, so
you should install that as well. You'll be glad you did; it's awesome.

You'll also want to install and load the ggplot2 library, which not
only contains the data set we want to use but will also come in
handy when we get to creating charts and graphs later. We will also
install and load the dplyr library to help with manipulating the
data.

install.packages("ggplot2")
install.packages("dplyr")

library(ggplot2)
library(dplyr)

We are going to use the diamonds data set that comes with
ggplot2. The data set contains prices and other attributes of over
50,000 diamonds.

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 2 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

OK, so let's take an initial look at the data. You can type diamonds
into the R console and it will print out the data set in the console
screen, but I advise against doing this. If you're an Excel user,
you're used to viewing data in a tabular format. You can do that in
one line of code.

diamonds<- data.frame(diamonds)

This should have created a data frame object in RStudio's upper


right hand pane (the one called "Environment") and it should say
"53940 obs. of 10 variables" right next to it. This means the table
has 53,940 rows and 10 columns. Click on diamonds in that pane,
and RStudio will show you the table.

It should look like this:

The first 7 columns are pretty well labeled, so we won't mess with
those, but the last 3 aren't labeled very well. So let's rename
columns 8, 9, and 10. We'll call them length, width, and depth

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 3 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

respectively.

names(diamonds)[8]<-"length"
names(diamonds)[9]<-"width"
names(diamonds)[10]<-"depth"

You'll notice that now we have two columns named depth. Let's
rename the first one (column 5) to "depthperc."

names(diamonds)[5]<-"depthperc"

Now the data frame should look like this:

Calculated Columns

One of the most common things people do in Excel is perform


calculations. For example, if we wanted to multiply length, width,
and depth together; we would type =H2*I2*J2 in cell K2 in Excel
and then copy that formula all the way down to the last row of the
data set. Then you would probably name the column "cubic" or

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 4 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

something like that.

In R, you can do perform all 3 of these actions by multiplying the


columns of your data frame.

diamonds <- mutate(diamonds,


cubic=length*width*depth)

Here we are using the mutate function on the diamonds data set
to multiply length, width, and depth. It's assigning the outcome of
that to a new column called cubic.

Another way of doing this, which may be more approachable to


someone coming from Excel, is as follows.

diamonds$cubic <- diamonds$length *


diamonds$width * diamonds$depth

The data frame should now look like this:

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 5 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

Challenge: Try creating another column called "total" where you


add the columns together instead of multiplying them.

Summaries

The next most common task that Excel is used for is summarizing
data. These tasks range from simply calculating column totals to
the more intermediate pivot tables. I'll show you how to do both in
R.

First, let's say that we want to summarize our data set and calculate
the overall averages for all the numeric fields (carat, depthperc,
table, price, length, width, depth, and cubic). This would be the
equivalent of going to the bottom of each column in Excel and
typing =AVERAGE(A2:A53940) and then copying that formula
over to the bottom of all the other columns you wanted to average.

colMeans(diamonds[,c(1,5:11)])

Here, we are telling R to calculate the column-wise means for


column 1 and then columns 5 through 11. This will print the
averages for those column numbers in the R console.

carat depthperc table price


length width
0.7979397 61.7494049 57.4571839
3932.7997219 5.7311572 5.7345260
depth cubic
3.5387338 129.8494033

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 6 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

Let's say you wanted to add carat to the non-numeric fields and
then calculate the averages for each combination of the new group
of non-numeric fields. This would take a bit of work in Excel
(maybe even some pivot-tabling), but is pretty easy in R.

First, let's round the carat values to the nearest 0.25 carat so that
our numbers are not all over the place.

diamonds$carat2 <- round(diamonds$carat/.25)*.25

Now, let's create our summary.

Summary <- aggregate(cbind(depthperc, table,


price, length, width, depth,
cubic)~cut+color+clarity+carat2, data=diamonds,
mean)

You're basically telling R to aggregate the diamonds data frame,


take the mean of all the numeric fields, and group by the non-
numeric fields. The result is a Summary data frame that looks like
this:

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 7 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

Next, we'll replicate Excel's useful pivot table functionality. Now


that we have the averages for each of our numerical fields, let's
choose one and see how it tends to change based on some of the
non-numerical attributes. Say we wanted to analyze the difference
in average prices of diamonds of different color & clarity
combinations. In Excel, you might create a pivot table with color as
a Row Label, clarity as a Column Label, and average price in the
Values section.

In R, you can use the reshape2 package to do the exact same


thing. First, install and load the package.

install.packages("reshape2")
library(reshape2)

Then, we'll use the dcast function to get our data into the same
pivot table format.

pivot_table <-
dcast(diamonds[,c('color','clarity','price')],

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 8 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

color~clarity, mean)

Here, we're taking the color, clarity, and price columns from the
diamonds data frame, casting (pivoting) them out by color (rows)
and clarity (columns), and calculating the average price for each
combination.

Challenge: Try casting/pivot-tabling using a different


combination of non-numerical fields and the averages of one of the
other numerical fields.

VLookups
Another very common thing people do in Excel are vlookups. The
scenario arises where you have two related data sets and you want
to pull some values from data set B over to their appropriate place
in data set A. So you type something like
=VLOOKUP(A2,K2:K50,2,0) and Excel looks up the value in A2
in column K and returns the value in the column next to the
matching value.

In R, we can do this using the merge function. So let's say we


wanted to calculate how far above or below a diamond's price was

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 9 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

compared to the average for their cut,color, clarity, and carat. In


this case, our data set A will be the diamonds data frame and data
set B will be the Summary data frame.

First, let's change the name of the price column in the Summary
data frame to avgprice. This way, we won't have two price fields
when we bring it over.

names(Summary)[7]<-"avgprice"

Next, let's merge the data sets and bring over the average price.

diamonds <- merge(diamonds, Summary[,c(1:4,7)],


by.x=c("cut","color","clarity","carat"),
by.y=c("cut","color","clarity","carat"))

We merged the diamonds data frame with just the columns that we
needed from the Summary data frame and the result was that it
added the avgprice field to our diamonds data frame.

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 10 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

Challenge: Merge the averages of some of the other numerical


fields in the Summary table over to the diamonds data frame.

Bonus Challenge: Calculate what percentage over/under priced


each diamond is compared to the average. Hint: Remember how to
do column calculations?

Conditional Statements
Excel users also periodically use conditional (IF) statements for
filling in values according to whether certain conditions are met. R
is also very good for doing this.

Let's say we wanted to categorize diamonds into size categories


such as small, medium, and large based on their carat weight.

diamonds$size[diamonds$carat < 0.5] <- "Small"


diamonds$size[diamonds$carat >=0.5 &
diamonds$carat < 1] <- "Medium"
diamonds$size[diamonds$carat >= 1] <- "Large"

Here we've set anything less than 0.5 carat to small, anything
between 0.5 and 1 carat to medium, and anything 1 carat and above
to large.

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 11 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

Charts and Graphs


The last group of Excel common tasks we'll cover here are the
creation of charts and graphs. Excel has a very "drag-and-drop"
method of graph creation, whereas R has a very "type out what you
want" method. This may be a little daunting at first, but once you
get the hang of it, you'll start to find it easier to customize charts
and graphs in R than having to fish around in Excel for the right
menu option to make the change you want.

Based on my experience in a business environment, I'm going to


cover the 3 most common graphs I've seen people create in Excel -
the bar/column chart, the line chart, and the scatterplot.

Bar Charts

Taking a look at our diamonds data set, let's say we want to create
a chart that shows how many diamonds of each size
(small/medium/large) are in our data. Here's how you would do
that in R.

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 12 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

barplot(table(diamonds$size), main="Diamond Size


Distribution", xlab="Size Category", ylab="Number
of Diamonds", col="blue")

This produces a bar chart that looks like this:

Line Charts

The second type of chart we're going to create is a line chart. These
are usually used when you have data that changes over some
period of time and you want to see the magnitude and velocity of
those changes. Since our diamonds data set doesn't have any time
series data in it, we'll do something a little different. We will create
https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 13 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

a line for each color and see how the number of diamonds of that
color change across clarity categories.

Here's how to do that with the ggplot function.

ggplot(diamonds, aes(clarity)) +
geom_freqpoly(aes(group = color, colour = color))
+
labs(x="Clarity", y="Number of Diamonds",
title="Clarity by Color")

And here's what it looks like.

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 14 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

It looks like most diamonds fall into the middle clarity categories.
Also, pretty interesting that there are more G color diamonds in
the higher clarity categories than any other color.

Scatterplots

Now let's do a fairly simple scatter plot so you can get a sense of
how to do one in R. For this, we are going to use the ggplot
command that comes with the ggplot2 package.

ggplot(diamonds, aes(carat, price,


color=clarity)) + geom_point() +
labs(x="Carat Weight", y="Price", title="Price
by Carat Weight")

Alternatively, you can produce the same thing with the qplot
function as well.

qplot(carat, price, data=diamonds, color=clarity,


xlab="Carat Weight", ylab="Price", main="Price by
Carat Weight")

The resulting plot shows the relationship between the carat weight
and the price of the diamonds in our data set, and we've also set
the points to be different colors according to the clarity of the
diamond. The graph below shows us that the larger the diamond
and the better the clarity, the more expensive it tends to be.

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 15 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

To create other types of charts and graphs, the ggplot2 index site is
a wonderful resource that has code and visuals for different types
of graphs.

Conclusion
Well, there you have it - a guide to get almost any Excel user
started in R. Two things I want to mention before I leave you to
explore some more on your own.

First, I've found that if you want to get good at R (or anything
really), the trick is to find a reason to use it every day. It doesn't
matter if it's something small, just open up R Studio and go
https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 16 of 17
District Data Labs - How to Transition from Excel to R 09/08/2014 14:03

through some of the exercises you've learned or try to stretch your


knowledge and do one new thing in R each day. You'll find that
after a month or so you'll be much more comfortable with it and
hopefully you'll keep getting better at it and never look back!

Second, the Internet is your friend. Part of R's appeal is the


community of people that use it, write about it, and ask/answer
questions about it online. There are a vast number of websites just
a click away that can guide you in the right direction when you're
stuck on something (this blog included). All it takes is a little
searching.

Finally, if you found this post useful, go to the blog's home page
and click the Subscribe button. We'll be cranking out a lot more
great, educational data science content in the near future and I'd
hate for you to miss any of it.

May the learning be with you!

Tony Ojeda - Founder of District Data Labs

August 08, 2014

https://districtdatalabs.silvrback.com/intro-to-r-for-microsoft-excel-users Page 17 of 17

Das könnte Ihnen auch gefallen