Sie sind auf Seite 1von 124

Lecture 12: Debugging and Databases

STAT GR5206 Statistical Computing & Introduction to Data Science

Cynthia Rush
Columbia University

December 9, 2016

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 1 / 107
Course Notes

Final is next Friday, December 16, 1:10pm - 4:00pm in this room.


Homework is due on Monday.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 2 / 107
Last Time

Split/Apply/Combine: A model for working with data.


plyr Package: Similar to the apply() family, but more consistent.
PCA and K-Means Clustering

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 3 / 107
Topics for Today

Databases: What are databases? Intro to SQL and interfacing R


with SQL.
Debugging: Simple techniques for correcting buggy code.
Clustering: Clustering with the K-means algorithm.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 4 / 107
Section I

Databases: SQL and Querying

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 5 / 107
Databases

A record is a collection of fields (likes rows and columns).


A table is a collection of records which all have the same fields with
different values. These are like dataframes in R.
A database is a collection of tables.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 6 / 107
Databases vs. Dataframes

Rs dataframes are actually tables


R Jargon Database Jargon
column field
row record
dataframe table
types of the columns table schema
bunch of related dataframes database

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 7 / 107
Databases

So, Why Do We Need Database Software?


Size
R keeps its dataframes in memory
Industrial databases can be much bigger
Work with selected subsets
Speed
Clever people have worked very hard on getting just what you want fast
Concurrency
Many users accessing the same database simultaneously
Lots of potential for trouble (two users want to change the same record
at once)

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 8 / 107
Databases

So, Why Do We Need Database Software?


Databases live on a server, which manages them
Users interact with the server through a client program
Lets multiple users access the same database simultaneously

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 9 / 107
Databases

So, Why Do We Need Database Software?


Databases live on a server, which manages them
Users interact with the server through a client program
Lets multiple users access the same database simultaneously
SQL (structured query language) is the standard for database
software
Mostly about queries, which are like doing row/column selections on
a dataframe in R

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 9 / 107
SQL

Connecting R to SQL
SQL is its own language, independent of R (similar to regular
expressions). But were going to learn how to run SQL queries
through R.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 10 / 107
SQL

Connecting R to SQL
SQL is its own language, independent of R (similar to regular
expressions). But were going to learn how to run SQL queries
through R.
First, install the packages DBI, RSQLite.
Also, we need a database file: download the file baseball.db and
save it in your working directory.
> library(DBI)
> library(RSQLite)
> drv <- dbDriver("SQLite")
> con <- dbConnect(drv, dbname="baseball.db")
The object con is now a persistent connection to the database
baseball.db.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 10 / 107
SQL

Listing Whats Available


> dbListTables(con) # List tables in our database
[1] "AllstarFull" "Appearances"
[3] "AwardsManagers" "AwardsPlayers"
[5] "AwardsShareManagers" "AwardsSharePlayers"
[7] "Batting" "BattingPost"
[9] "Fielding" "FieldingOF"
[11] "FieldingPost" "HallOfFame"
[13] "Managers" "ManagersHalf"
[15] "Master" "Pitching"
[17] "PitchingPost" "Salaries"
[19] "Schools" "SchoolsPlayers"
[21] "SeriesPost" "Teams"
[23] "TeamsFranchises" "TeamsHalf"
[25] "sqlite_sequence" "xref_stats"
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 11 / 107
SQL
Listing Whats Available
> dbListFields(con, "Batting") # Fields in Batting table
[1] "playerID" "yearID" "stint" "teamID"
[5] "lgID" "G" "G_batting" "AB"
[9] "R" "H" "2B" "3B"
[13] "HR" "RBI" "SB" "CS"
[17] "BB" "SO" "IBB" "HBP"
[21] "SH" "SF" "GIDP" "G_old"
> dbListFields(con, "Pitching") # Fields in Pitching table
[1] "playerID" "yearID" "stint" "teamID" "lgID"
[6] "W" "L" "G" "GS" "CG"
[11] "SHO" "SV" "IPouts" "H" "ER"
[16] "HR" "BB" "SO" "BAOpp" "ERA"
[21] "IBB" "WP" "HBP" "BK" "BFP"
[26] "GF" "R"
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 12 / 107
SQL

Importing a Table as a Data Frame


> batting <- dbReadTable(con, "Batting")
> class(batting)
[1] "data.frame"
> dim(batting)
[1] 93955 24

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 13 / 107
SQL

Importing a Table as a Data Frame


> batting <- dbReadTable(con, "Batting")
> class(batting)
[1] "data.frame"
> dim(batting)
[1] 93955 24

Now we can perform R operations on batting, since its a data frame


In lecture today, well use this route primarily to check our work in
SQL; in general, want to do as much in SQL as possible, since its
more efficient and likely simpler

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 13 / 107
Check Yourself

Tasks
Using dbReadTable(), grab the table named Salaries and save it as a
data frame called salaries. Using the salaries data frame and ddply(),
compute the payroll (total of salaries) for each team in the year 2010. Find
the 3 teams with the highest payrolls, and the team with the lowest payroll.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 14 / 107
Check Yourself

Solutions
> library(plyr)
> salaries <- dbReadTable(con, "Salaries")
> my.sum.func <- function(team.yr.df) {
+ return(sum(team.yr.df$salary))
+ }
> payroll <- ddply(salaries, .(yearID, teamID), my.sum.func)

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 15 / 107
Check Yourself

Solutions
> payroll <- payroll[payroll$yearID == 2010, ]
> payroll <- payroll[order(payroll$V1, decreasing = T), ]
> payroll[1:3, ]
yearID teamID V1
733 2010 NYA 206333389
719 2010 BOS 162447333
721 2010 CHN 146609000
> payroll[nrow(payroll), ]
yearID teamID V1
737 2010 PIT 34943000

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 16 / 107
SQL

SELECT
Main tool in the SQL language: SELECT, which allows you to perform
queries on a particular table in a database. It has the form:

SELECT columns or computations


FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset, count;

WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 17 / 107
Examples

Pick out five columns from the table Batting, and look at the first 10 rows:

> dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",


+ "FROM Batting",
+ "LIMIT 10"))

playerID yearID AB H HR
1 aardsda01 2004 0 0 0
2 aardsda01 2006 2 0 0
3 aardsda01 2007 0 0 0
4 aardsda01 2008 1 0 0
5 aardsda01 2009 0 0 0
6 aaronha01 1954 468 131 13
7 aaronha01 1955 602 189 27
8 aaronha01 1956 609 200 26
9 aaronha01 1957 615 198 44
10 aaronha01 1958 601 196 30

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 18 / 107
Examples

This is our first successful SQL query (congrats!)


We can replicate the command on the imported data frame:

> batting[1:10, c("playerID", "yearID", "AB", "H", "HR")]


playerID yearID AB H HR
1 aardsda01 2004 0 0 0
2 aardsda01 2006 2 0 0
3 aardsda01 2007 0 0 0
4 aardsda01 2008 1 0 0
5 aardsda01 2009 0 0 0
6 aaronha01 1954 468 131 13
7 aaronha01 1955 602 189 27
8 aaronha01 1956 609 200 26
9 aaronha01 1957 615 198 44
10 aaronha01 1958 601 196 30

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 19 / 107
Examples

To reiterate: the previous call was simply to check our work, and we
wouldnt actually want to do this on a large database, since itd be much
more inefficient to first read into an R data frame, and then call R
commands

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 20 / 107
SQL

ORDER BY
We can use the ORDER BY option in SELECT to specify an ordering for the
rows
Default is ascending order; add DESC for descending

> dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",


+ "FROM Batting",
+ "ORDER BY HR DESC",
+ "LIMIT 5"))

playerID yearID AB H HR
1 bondsba01 2001 476 156 73
2 mcgwima01 1998 509 152 70
3 sosasa01 1998 643 198 66
4 mcgwima01 1999 521 145 65
5 sosasa01 2001 577 189 64

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 21 / 107
Check Yourself

Tasks
Run the following queries and determine what theyre doing. Write R code to do
the same thing on the batting data frame.
> dbGetQuery(con, paste("SELECT playerID, yearID, AB, H, HR",
+ "FROM Batting",
+ "WHERE yearID >= 1990
+ AND yearID <= 2000",
+ "ORDER BY HR DESC",
+ "LIMIT 5"))
> dbGetQuery(con, paste("SELECT playerID, yearID, MAX(HR)",
+ "FROM Batting"))

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 22 / 107
Check Yourself

Solutions
> bat.ord <- batting[order(batting$HR, decreasing = TRUE), ]
> subset <- bat.ord$yearID >= 1990 & bat.ord$yearID <= 2000
> columns <- c("playerID", "yearID", "AB", "H", "HR")
> head(bat.ord[subset, columns], 5)

playerID yearID AB H HR
54613 mcgwima01 1998 509 152 70
78578 sosasa01 1998 643 198 66
54614 mcgwima01 1999 521 145 65
78579 sosasa01 1999 625 180 63
31877 griffke02 1997 608 185 56

> batting[which.max(batting$HR), c("playerID","yearID","HR")]

playerID yearID HR
7514 bondsba01 2001 73

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 23 / 107
Section II

Databases: SQL Computations

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 24 / 107
Databases vs. Dataframes

Rs dataframes are actually tables


R Jargon Database Jargon
column field
row record
dataframe table
types of the columns table schema
collection of related dataframes database
conditional indexing SELECT, FROM, WHERE, HAVING
d*ply() GROUP BY
order() ORDER BY

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 25 / 107
SQL

SELECT
Main tool in the SQL language: SELECT, which allows you to perform
queries on a particular table in a database. It has the form:

SELECT columns or computations


FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset, count;

WHERE, GROUP BY, HAVING, ORDER BY, LIMIT are all optional.
Importantly, in the first line of SELECT we can directly specify
computations that we want performed.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 26 / 107
Examples

To calculate the average number of homeruns, and average number of hits:


> dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
+ "FROM Batting"))
AVG(HR) AVG(H)
1 2.970549 40.67684

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 27 / 107
Examples

To calculate the average number of homeruns, and average number of hits:


> dbGetQuery(con, paste("SELECT AVG(HR), AVG(H)",
+ "FROM Batting"))
AVG(HR) AVG(H)
1 2.970549 40.67684

We can replicate this simple command on an imported data frame:


> mean(batting$HR, na.rm = TRUE)
[1] 2.970549
> mean(batting$H, na.rm = TRUE)
[1] 40.67684

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 27 / 107
GROUP BY

We can use the GROUP BY option in SELECT to define aggregation groups


> dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
+ "FROM Batting",
+ "GROUP BY playerID",
+ "ORDER BY AVG(HR) DESC",
+ "LIMIT 5"))
playerID AVG(HR)
1 pujolal01 40.80000
2 howarry01 36.14286
3 rodrial01 36.05882
4 bondsba01 34.63636
5 mcgwima01 34.29412

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 28 / 107
GROUP BY

We can use the GROUP BY option in SELECT to define aggregation groups


> dbGetQuery(con, paste("SELECT playerID, AVG(HR)",
+ "FROM Batting",
+ "GROUP BY playerID",
+ "ORDER BY AVG(HR) DESC",
+ "LIMIT 5"))
playerID AVG(HR)
1 pujolal01 40.80000
2 howarry01 36.14286
3 rodrial01 36.05882
4 bondsba01 34.63636
5 mcgwima01 34.29412

Note: the order of commands here matters; try switching the order of
GROUP BY and ORDER BY above, and youll get an error.
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 28 / 107
WHERE

We can use the WHERE option in SELECT to specify a subset of the rows to
use (pre-aggregation/pre-calculation)
> dbGetQuery(con, paste("SELECT yearID, AVG(HR)",
+ "FROM Batting",
+ "WHERE yearID >= 1990",
+ "GROUP BY yearID",
+ "ORDER BY AVG(HR) DESC",
+ "LIMIT 5"))
yearID AVG(HR)
1 1996 5.073620
2 1999 4.692699
3 2000 4.525437
4 2004 4.490115
5 2001 4.412288

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 29 / 107
Check Yourself

Tasks
Run the following queries and determine what theyre doing. Write R code
to do the same thing on the batting data frame. Hint use daply().
> dbGetQuery(con, paste("SELECT teamID, AVG(HR)",
+ "FROM Batting",
+ "WHERE yearID >= 1990",
+ "GROUP BY teamID",
+ "ORDER BY AVG(HR) DESC",
+ "LIMIT 5"))

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 30 / 107
Check Yourself

Solutions
> bat.sub <- batting[batting$yearID >= 1990, ]
> my.mean.func <- function(team.df) {
+ return(mean(team.df$HR, na.rm = TRUE))
+ }
> avg.hrs <- daply(bat.sub, .(teamID), my.mean.func)
> avg.hrs <- sort(avg.hrs, decreasing = TRUE)
> head(avg.hrs, 5)
CHA NYA TOR CAL TEX
6.164251 5.986486 5.760937 5.625731 5.563961

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 31 / 107
AS

We can use AS in the first line of SELECT to rename computed columns


> dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
+ "FROM Batting",
+ "GROUP BY yearID",
+ "ORDER BY avgHR DESC",
+ "LIMIT 5"))
yearID avgHR
1 1987 5.300832
2 1996 5.073620
3 1986 4.730769
4 1999 4.692699
5 1977 4.601010

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 32 / 107
HAVING

We can use the HAVING option in SELECT to specify a subset of the rows
to display (post-aggregation/post-calculation)
> dbGetQuery(con, paste("SELECT yearID, AVG(HR) as avgHR",
+ "FROM Batting",
+ "WHERE yearID >= 1990",
+ "GROUP BY yearID",
+ "HAVING avgHR >= 4.5",
+ "ORDER BY avgHR DESC"))
yearID avgHR
1 1996 5.073620
2 1999 4.692699
3 2000 4.525437

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 33 / 107
Check Yourself

Tasks
Recompute the payroll for each team in 2010, but now with
dbGetQuery() and an appropriate SQL query. In particular, the output of
dbGetQuery() should be a data frame with two columns, the first giving
the team names, and the second the payrolls, just like your output from
daply() before. (Hint: your SQL query here will have to use GROUP BY.)

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 34 / 107
Check Yourself

Solutions
> dbGetQuery(con, paste("SELECT teamID, SUM(salary) as SUMsal",
+ "FROM Salaries",
+ "WHERE yearID == 2010",
+ "GROUP BY teamID",
+ "ORDER BY SUMsal DESC",
+ "LIMIT 3"))

teamID SUMsal
1 NYA 206333389
2 BOS 162447333
3 CHN 146609000

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 35 / 107
Section III

Databases: Join

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 36 / 107
Databases vs. Dataframes

Rs dataframes are actually tables


R Jargon Database Jargon
column field
row record
dataframe table
types of the columns table schema
collection of related dataframes database
conditional indexing SELECT, FROM, WHERE, HAVING
d*ply() GROUP BY
order() ORDER BY
merge() INNER JOIN or just JOIN

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 37 / 107
JOIN

Sometimes we need to combine information from many tables.

patient_last patient_first physician_id complaint


Morgan Dexter 37010 insomnia
Soprano Anthony 79676 malaise
Swearengen Albert NA healthy
Garrett Alma 90091 nerves
Holmes Sherlock 43675 addiction

physician_last physician_first physicianID plan


Meridian Emmett 37010 UPMC
Melfi Jennifer 79676 BCBS
Cochran Amos 90091 UPMC
Watson John 43675 VA

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 38 / 107
JOIN

Suppose we want to know which doctors are treating patients for


insomnia.
Complaints are in one table and physicians in another.
In R, we use merge() to link the tables by physicianID.
Here physicianID or physician_id is acting as the key or the
identifier.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 39 / 107
JOIN

In all weve seen so far with SELECT, the FROM line has just specified one
table. But sometimes we need to combine information from many tables.
Use the JOIN option for this

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 40 / 107
JOIN

In all weve seen so far with SELECT, the FROM line has just specified one
table. But sometimes we need to combine information from many tables.
Use the JOIN option for this
There are 4 options for JOIN:
1. INNER JOIN or just JOIN: retain just the rows each table that match
the condition.
2. LEFT OUTER JOIN or just LEFT JOIN: retain all rows in the first
table, and just the rows in the second table that match the condition.
3. RIGHT OUTER JOIN or just RIGHT JOIN: retain just the rows in the
first table that match the condition, and all rows in the second table.
4. FULL OUTER JOIN or just FULL JOIN: retain all rows in both tables
Fields that cannot be filled in are assigned NA values

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 40 / 107
INNER JOIN

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 41 / 107
LEFT JOIN

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 42 / 107
RIGHT JOIN

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 43 / 107
FULL JOIN

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 44 / 107
Examples
Suppose we want to find the average salaries of the players with the top 10
highest homerun averages. We need to combine the two tables.
> dbGetQuery(con, paste("SELECT *",
+ "FROM Salaries",
+ "ORDER BY playerID",
+ "LIMIT 8"))
yearID teamID lgID playerID salary
1 2004 SFN NL aardsda01 300000
2 2007 CHA AL aardsda01 387500
3 2008 BOS AL aardsda01 403250
4 2009 SEA AL aardsda01 419000
5 2010 SEA AL aardsda01 2750000
6 1986 BAL AL aasedo01 600000
7 1987 BAL AL aasedo01 625000
8 1988 BAL AL aasedo01 675000
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 45 / 107
Examples

> query <- paste("SELECT yearID, teamID,


+ lgID, playerID, HR",
+ "FROM Batting",
+ "ORDER BY playerID",
+ "LIMIT 7")
> dbGetQuery(con, query)
yearID teamID lgID playerID HR
1 2004 SFN NL aardsda01 0
2 2006 CHN NL aardsda01 0
3 2007 CHA AL aardsda01 0
4 2008 BOS AL aardsda01 0
5 2009 SEA AL aardsda01 0
6 2010 SEA AL aardsda01 0
7 1954 ML1 NL aaronha01 13

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 46 / 107
Examples
We can use a JOIN on the pair: yearID, playerID.
> dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
+ "FROM Batting JOIN Salaries
+ USING(yearID, playerID)",
+ "ORDER BY playerID",
+ "LIMIT 7"))

yearID playerID salary HR


1 2004 aardsda01 300000 0
2 2007 aardsda01 387500 0
3 2008 aardsda01 403250 0
4 2009 aardsda01 419000 0
5 2010 aardsda01 2750000 0
6 1986 aasedo01 600000 NA
7 1987 aasedo01 625000 NA

Note that here were missing one of David Aardsmas records from the Batting
table (i.e., the JOIN discarded 1 record)

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 47 / 107
Examples

We can replicate this using merge() on imported data frames:


> merged <- merge(x = batting, y = salaries,
+ by.x = c("yearID","playerID"),
+ by.y = c("yearID","playerID"))
> merged[order(merged$playerID)[1:8],
+ c("yearID", "playerID", "salary", "HR")]
yearID playerID salary HR
16708 2004 aardsda01 300000 0
19378 2007 aardsda01 387500 0
20277 2008 aardsda01 403250 0
21164 2009 aardsda01 419000 0
21990 2010 aardsda01 2750000 0
585 1986 aasedo01 600000 NA
1360 1987 aasedo01 625000 NA
2033 1988 aasedo01 675000 NA
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 48 / 107
Examples

For demonstration purposes, we can use a LEFT JOIN on the pair: yearID,
playerID:
> dbGetQuery(con, paste("SELECT yearID, playerID, salary, HR",
+ "FROM Batting LEFT JOIN Salaries
+ USING(yearID, playerID)",
+ "ORDER BY playerID",
+ "LIMIT 7"))

yearID playerID salary HR


1 2004 aardsda01 300000 0
2 2006 aardsda01 NA 0
3 2007 aardsda01 387500 0
4 2008 aardsda01 403250 0
5 2009 aardsda01 419000 0
6 2010 aardsda01 2750000 0
7 1954 aaronha01 NA 13

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 49 / 107
Examples

Now we can see that we have all 6 of David Aardsmas original


records from the Batting table (i.e., the LEFT JOIN used them all,
and just filled in an NA value when it was missing his salary)
Currently, RIGHT JOIN and FULL JOIN are not implemented in the
RSQLite package

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 50 / 107
Examples
Now, as to our original question (average salaries of the players with the top 10
highest homerun averages):
> dbGetQuery(con, paste("SELECT playerID, AVG(HR), AVG(salary)",
+ "FROM Batting JOIN Salaries
+ USING(yearID, playerID)",
+ "GROUP BY playerID",
+ "ORDER BY Avg(HR) DESC",
+ "LIMIT 10"))

playerID AVG(HR) AVG(salary)


1 howarry01 45.80000 9051000.0
2 pujolal01 40.80000 8953204.1
3 fieldpr01 38.00000 3882900.0
4 rodrial01 36.05882 15553897.2
5 reynoma01 34.66667 550777.7
6 bondsba01 34.63636 8556605.5
7 mcgwima01 34.29412 4814020.8
8 gonzaca01 34.00000 406000.0
9 dunnad01 33.50000 6969500.0
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 51 / 107
Check Yourself

Tasks
Using the Fielding table, list the 10 worst (highest) number of error
(E) commited by a player in one season, only considering years 1990
and later. In addition to the number of errors, list the year and player
ID for each record.
By appropriately merging the Fielding and Salaries tables, list the
salaries for each record that you extracted in the last question.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 52 / 107
Check Yourself
Solutions
> dbGetQuery(con, paste("SELECT yearID, playerID, E",
+ "FROM Fielding",
+ "WHERE yearID >= 1990",
+ "ORDER BY E DESC",
+ "LIMIT 10"))

yearID playerID E
1 1992 offerjo01 42
2 1993 offerjo01 37
3 1996 valenjo03 37
4 2000 valenjo03 36
5 1998 carusmi01 35
6 1995 offerjo01 35
7 2008 reynoma01 34
8 2010 desmoia01 34
9 1993 cordewi01 33
10 2000 glaustr01 33

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 53 / 107
Check Yourself
Solutions
> dbGetQuery(con, paste("SELECT yearID, playerID, E, salary",
+ "FROM Fielding LEFT JOIN Salaries
+ USING(yearID, playerID)",
+ "WHERE yearID >= 1990",
+ "ORDER BY E DESC",
+ "LIMIT 10"))

yearID playerID E salary


1 1992 offerjo01 42 135000
2 1993 offerjo01 37 300000
3 1996 valenjo03 37 300000
4 2000 valenjo03 36 1320000
5 1998 carusmi01 35 170000
6 1995 offerjo01 35 1600000
7 2008 reynoma01 34 396500
8 2010 desmoia01 34 400000
9 1993 cordewi01 33 126500
10 2000 glaustr01 33 275000
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 54 / 107
Section IV

Debugging

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 55 / 107
Debugging

Bug is the original name for glitches and unexpected defects in code:
dates back to at least Edison in 1876.
Debugging is a the process of locating, understanding, and removing
bugs from your code.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 56 / 107
Debugging

Bug is the original name for glitches and unexpected defects in code:
dates back to at least Edison in 1876.
Debugging is a the process of locating, understanding, and removing
bugs from your code.
Why should we care to learn about this?
The truth: youre going to have to debug, because youre not perfect
(none of us are!) and so you cant write perfect code.
Debugging is frustrating and time-consuming, but essential.
Writing code that makes it easier to debug later is worth it, even if it
takes a bit more time (lots of our design ideas support this).
Simple things you can do to help: use lots of comments, use
meaningful variable names!

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 56 / 107
Debugging

How?
Debugging is (largely) a process of differential diagnosis. Stages of
debugging:
1. Reproduce the error: can you make the bug reappear?
2. Characterize the error: what can you see that is going wrong?
3. Localize the error: where in the code does the mistake originate?
4. Modify the code: did you eliminate the error? Did you add new ones?

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 57 / 107
Debugging

Reproduce the bug


Step 0: make if happen again
Can we produce it repeatedly when re-running the same code, with
the same input values?
And if we run the same code in a clean copy of R, does the same
thing happen?

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 58 / 107
Debugging

Reproduce the bug


Step 0: make if happen again
Can we produce it repeatedly when re-running the same code, with
the same input values?
And if we run the same code in a clean copy of R, does the same
thing happen?

Characterize the bug


Step 1: figure out if its a pervasive/big problem
How much can we change the inputs and get the same error?
Or is it a different error?
And how big is the error?

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 58 / 107
Debugging

Localize the bug


Step 2: find out exactly where things are going wrong
This is most often the hardest part!
Today, well learn how to understand errors, using print().
There are many more sophisticated debugging tools like traceback()
or the R tool browser() which lets you interactively debug.
Unfortunately dont have time for these.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 59 / 107
Localizing the Bug

Sometimes error messages are easier to decode, sometimes theyre harder;


this can make locating the bug easier or harder.
> my.plotter = function(x, y, my.list=NULL) {
+ if (!is.null(my.list))
+ plot(my.list, main="A plot from my.list!")
+ else
+ plot(x, y, main="A plot from x, y!")
+ }
> my.plotter(x=1:8, y=1:8)
> my.plotter(my.list=list(x=-10:10, y=(-10:10)^3))

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 60 / 107
Localizing the Bug

> my.plotter() # Easy to understand error message

Error in plot(x, y, main = "A plot from x, y!") :


argument "x" is missing, with no default

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 61 / 107
Localizing the Bug

> my.plotter() # Easy to understand error message

Error in plot(x, y, main = "A plot from x, y!") :


argument "x" is missing, with no default

> my.plotter(my.list=list(x=-10:10, Y=(-10:10)^3))

Error in xy.coords(x, y, xlabel, ylabel, log) :


'x' is a list, but does not have components 'x' and 'y'

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 61 / 107
Localizing the Bug

> my.plotter() # Easy to understand error message

Error in plot(x, y, main = "A plot from x, y!") :


argument "x" is missing, with no default

> my.plotter(my.list=list(x=-10:10, Y=(-10:10)^3))

Error in xy.coords(x, y, xlabel, ylabel, log) :


'x' is a list, but does not have components 'x' and 'y'

Who called xy.coords()? (Not us, at least not explicitly!) And why is it
saying x is a list? (We never set it to be so!)

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 61 / 107
Localizing the Bug
Lets modify the function by calling print() at various points, to print
out the state of variables, to help localize the error.
> my.plotter = function(x, y, my.list=NULL) {
+ if (!is.null(my.list)) {
+ print("Here is my.list:")
+ print(my.list)
+ print("Now about to plot my.list")
+ plot(my.list, main="A plot from my.list!")
+ }
+ else {
+ print("Here is x:"); print(x)
+ print("Here is y:"); print(y)
+ print("Now about to plot x, y")
+ plot(x, y, main="A plot from x, y!")
+ }
+ }
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 62 / 107
Localizing the Bug

> my.plotter(my.list=list(x=-10:10, Y=(-10:10)^3))


[1] "Here is my.list:"
$x
[1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

$Y
[1] -1000 -729 -512 -343 -216 -125 -64 -27 -8
[20] 729 1000

[1] "Now about to plot my.list"

Error in xy.coords(x, y, xlabel, ylabel, log) :


'x' is a list, but does not have components 'x' and 'y'

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 63 / 107
Localizing the Bug

> my.plotter(x="hi", y="there")


[1] "Here is x:"
[1] "hi"
[1] "Here is y:"
[1] "there"
[1] "Now about to plot x, y"

Error in plot.window(...) : need finite 'xlim' values In addi


1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by
2: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf
6: In max(x) : no non-missing arguments to max; returning -Inf

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 64 / 107
Check Yourself
Tasks
Below is a random.walk() function like the one you wrote in homework.
Unfortunately, this one has some bugs find them and fix them! If you forget the
random walk algorithm, its written in the next slide.
> random.walk = function(x.start = 5, seed = NULL) {
+ if (!is.null(seed)) set.seed(seed)
+ x.vals <- x.start
+ while (TRUE) {
+ r <- runif(1, -2, 1)
+ if (tail(x.vals + r, 1) <= 0) break
+ else x.vals <- c(x.vals, x.vals + r)
+ }
+ return(x.vals = x.vals, num.steps = length(x.vals))
+ }
>
> # random.walk(x.start=5, seed=3)$num.steps # Should print 8
> # random.walk(x.start=10, seed=7)$num.steps # Should print 14

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 65 / 107
Check Yourself

The Random Walk Procedure


1. Start with an initial value for x.
2. Draw a random number r uniformly between -22 and 1.
3. Replace x with x + r
4. Stop if x <= 0
5. Else repeat.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 66 / 107
Check Yourself

Solutions
> random.walk = function(x.start=5, seed=NULL) {
+ if (!is.null(seed)) set.seed(seed)
+ x.vals <- x.start
+ while (TRUE) {
+ r <- runif(1, -2, 1)
+ if (tail(x.vals + r, 1) <= 0) break
+ else x.vals <- c(x.vals, x.vals[length(x.vals)] + r)
+ print(x.vals)
+ }
+ ret.val <- list(x.vals = x.vals, num.steps = length(x.vals))
+ print(ret.val)
+ return(ret.val)
+ }

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 67 / 107
Check Yourself
Solutions
> random.walk(x.start = 5, seed = 3)$num.steps

[1] 5.000000 3.504125


[1] 5.000000 3.504125 3.926674
[1] 5.000000 3.504125 3.926674 3.081501
[1] 5.000000 3.504125 3.926674 3.081501 2.064704
[1] 5.000000 3.504125 3.926674 3.081501 2.064704 1.871006
[1] 5.000000 3.504125 3.926674 3.081501 2.064704 1.871006
[7] 1.684188
[1] 5.0000000 3.5041246 3.9266738 3.0815008 2.0647038
[6] 1.8710058 1.6841880 0.0580883
$x.vals
[1] 5.0000000 3.5041246 3.9266738 3.0815008 2.0647038
[6] 1.8710058 1.6841880 0.0580883

$num.steps
[1] 8
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 68 / 107
Check Yourself
Solutions
> random.walk(x.start = 10, seed = 7)$num.steps

[1] 10.00000 10.96673


[1] 10.00000 10.96673 10.15996
[1] 10.000000 10.966728 10.159964 8.507058
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583 4.843770
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583 4.843770 5.759958
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583 4.843770 5.759958 4.257524
[1] 10.000000 10.966728 10.159964 8.507058 6.716304
[6] 5.447552 5.823583 4.843770 5.759958 4.257524
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 69 / 107
Section V

K -means Clustering

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 70 / 107
Clustering

Clustering refers to a broad set of techniques for finding subgroups,


or clusters, in a dataset.
The idea is to partition the observations into distinct groups so that
observations within each group are similar, while observations in
different groups are different from each other.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 71 / 107
Clustering

Clustering refers to a broad set of techniques for finding subgroups,


or clusters, in a dataset.
The idea is to partition the observations into distinct groups so that
observations within each group are similar, while observations in
different groups are different from each other.

Similarities to PCA
Both clustering and PCA seek to simplify the data via a small number of
summaries:
PCA looks to find a low-dimensional representaiton of the
observations that explain most of the variance.
Clustering looks to find homogenous subgroups among the
observations.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 71 / 107
K -Means Clustering

K -Means Clustering
Simple approach for partitioning a data set into K distinct,
non-overlapping clusters. Note K is pre-specified.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 72 / 107
K -Means Clustering

K -Means Clustering
Simple approach for partitioning a data set into K distinct,
non-overlapping clusters. Note K is pre-specified.

How Do We Do It?
First we specify the number of desired clusters K .
The K -means algorithm then assigns each observation to exactly one
of the K clusters.
The algorithm boils down to a simple and intuitive mathematical
problem.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 72 / 107
1
Example of K -means
A simulated dataset with 150 observations in two-dimensional space. We
see the results of the K -mean algorithm using different values of K .

K=2 K=3 K=4

0
Some figures taken from An Introduction to Statistical Learning (Springer, 2013)
with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 73 / 107
K -means Clustering

Notation
Let C1 , C2 , . . . , CK denote sets containing the indices of the observations
in each cluster. These sets satisfy the following properties:
1. C1 C2 CK = {1, . . . , n}. (Each observation belongs to at
least one of the K clusters.)
2. Ck Ck 0 = . (The clusters are non-overlapping.)

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 74 / 107
K -means Clustering

Main Idea
The idea behind K -means clsutering is that a good clustering is one for
which the within-cluster variation is small as possible.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 75 / 107
K -means Clustering

Main Idea
The idea behind K -means clsutering is that a good clustering is one for
which the within-cluster variation is small as possible.

Within-Cluster Variation
For cluster Ck , denote the within-cluster variation by W (Ck ).
W (Ck ) measures the amount by which the observations within a
cluster differ within each other.
The algorithm is then an optimization problem:
( K )
X
min W (Ck ) .
C1 ,...,Ck
k=1

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 75 / 107
K -means Clustering
Optimization Task
To solve the optimization problem, we need to define W (Ck ).
The most common choice is to use squared Euclidean distance:
p
1 X X
W (Ck ) = (xij xi 0 j )2 ,
|Ck | 0
i,i Ck j=1

where |Ck | denotes the number of observations in the k th cluster.


In words this is the sum of the pairwise squared Euclidean distances
between observations in the k th cluster, divided by the total number
of observations in the k th cluster.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 76 / 107
K -means Clustering
Optimization Task
To solve the optimization problem, we need to define W (Ck ).
The most common choice is to use squared Euclidean distance:
p
1 X X
W (Ck ) = (xij xi 0 j )2 ,
|Ck | 0
i,i Ck j=1

where |Ck | denotes the number of observations in the k th cluster.


In words this is the sum of the pairwise squared Euclidean distances
between observations in the k th cluster, divided by the total number
of observations in the k th cluster.
Consequently, we want to solve the optimization problem

K
X p 
1 X X 2
min (xij xi 0 j )
C1 ,...,CK |Ck | 0
k=1 i,i Ck j=1

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 76 / 107
K -means Clustering

How do we Solve this Problem?


Want to come up with an algorithm that partitions the data such that
the following is minimized:
K
X p 
1 X X
min (xij xi 0 j )2
C1 ,...,CK |Ck | 0
k=1 i,i Ck j=1

This is a hard problem there are around K n ways to partition n


observations in to K clusters!
The K-means clustering algorithm is very simple and can be shown to
provide a pretty good solution.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 77 / 107
K -means Clustering

K -Means Clustering Algorithm


1. Randomly assign a number, from 1 to K , to each observation. This
serves as initial cluster assignments for the observations.
2. Iterate until the cluster assignments stop changing:
a. For each of the K clusters, compute the cluster centriod. The k th
centroid is the vector of the p feature means (covariate means) for the
observations in the k th cluster.
b. Assign each observation to the cluster whose centriod is closest (where
closest is defined using Euclidean distance).

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 78 / 107
K -means Clustering

K -Means Clustering Algorithm


1. Randomly assign a number, from 1 to K , to each observation. This
serves as initial cluster assignments for the observations.
2. Iterate until the cluster assignments stop changing:
a. For each of the K clusters, compute the cluster centriod. The k th
centroid is the vector of the p feature means (covariate means) for the
observations in the k th cluster.
b. Assign each observation to the cluster whose centriod is closest (where
closest is defined using Euclidean distance).

This algorithm provides a local minimum not global.


We must decide on how many clusters K exist in the data beforehand.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 78 / 107
K -means Clustering

Data Step 1 Iteration 1, Step 2a

Iteration 1, Step 2b Iteration 2, Step 2a Final Results

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 79 / 107
K -means Clustering

Why does the Algorithm Work?


Notice the following identity
p p
1 X X 2
XX
(xij xi 0 j ) = 2 (xij xkj )2 ,
|Ck | 0
i,i Ck j=1 iCk j=1

1 P
where xkj = |Ck | iCk xij is the mean of feature j in cluster Ck .
Reallocating the observations (step 2b) can only improve the above,
thereby always decreasing the value of the objective function in the
optimization problem.
As the algorithm runs, the clustering obtained will continually improve
until the result no longer changes.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 80 / 107
K -means Clustering

The Initial Cluster Matters


Recall, the algorithm finds a local rather than a global optimum.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 81 / 107
K -means Clustering

The Initial Cluster Matters


Recall, the algorithm finds a local rather than a global optimum.
Therefore the results will depend on the initial (random) cluster
assignment of each observation in step 1.
Its important to run the algorithm multiple times from different
random initial clusterings. Then select the best solution as
determined by the objective function.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 81 / 107
K -means Clustering
K -means performed six times with different starting assignments gets six
differnet values of the objective.
390 10. Unsupervised Learning

320.9 235.8 235.8

235.8 235.8 310.9

FIGURE 10.7. K-means clustering performed six times on the data from Fig-
ure 10.5 with K = 3, each time with a dierent random assignment of the ob-
Cynthia Rush servations in Step 1Lecture 12: Debugging
of the K-means and Databases
algorithm. December
Above each plot is the value of 9, 2016 82 / 107
Example of K -means

Recall that iris dataset:

> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 83 / 107
Example of K -means
> library(ggplot2)
> ggplot(data = iris) +
+ geom_point(aes(Petal.Length, Petal.Width,
+ color = Species))

2.5

2.0
Species
Petal.Width

1.5 setosa
versicolor
1.0 virginica

0.5

0.0
2 4 6
Petal.Length

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 84 / 107
Example of K -means

Run the K -means algorithm with K = 2.

> km.out <- kmeans(iris[, 3:4], centers = 2, nstart = 20)


> km.out$centers
Petal.Length Petal.Width
1 4.925253 1.6818182
2 1.492157 0.2627451
> km.out$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
[55] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[82] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
[109] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[136] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 85 / 107
Example of K -means

Run the K -means algorithm with K = 2.

> table(km.out$cluster, iris$Species)


setosa versicolor virginica
1 0 49 50
2 50 1 0
> iris$cluster <- as.factor(km.out$cluster)
> ggplot(data = iris) +
+ geom_point(mapping = aes(Petal.Length, Petal.Width,
+ color = iris$cluster))

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 86 / 107
Example of K -means

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 87 / 107
Example of K -means

Run the K -means algorithm with K = 3.

> km.out <- kmeans(iris[, 3:4], centers = 3, nstart = 20)


> table(km.out$cluster, iris$Species)
setosa versicolor virginica
1 50 0 0
2 0 2 46
3 0 48 4

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 88 / 107
Example of K -means

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 89 / 107
Example of K -means

Initial Cluster Matters


To run the kmeans() function in R with multiple initial cluster
assignments, we use the nstart argument.
If a value of nstart greater than one is used, then K-means
clustering will be performed using multiple random assignments in
Step 1 of the K -means algorithm.
km.out$tot.withinss is the total within-cluster sum of squares.
This is what we seek to minimize.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 90 / 107
Example of K -means

> set.seed(3)
> km.out <- kmeans(iris[, 3:4], centers = 3, nstart = 1)
> km.out$tot.withinss
[1] 31.41289
> km.out <- kmeans(iris[, 3:4], centers = 3, nstart = 20)
> km.out$tot.withinss
[1] 31.37136

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 91 / 107
Example of K -means

Lets actually code the algorithm ourselves!


> data <- iris[, 3:4]
> # First we randomly assign clusters
>
> clusters <- sample(1:3, nrow(data), replace = TRUE)
> # Next we calculate the centers
>
> centers <- apply(data, 2, tapply, clusters, mean)
> centers
Petal.Length Petal.Width
1 3.517647 1.078431
2 4.078261 1.315217
3 3.711321 1.215094

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 92 / 107
Example of K -means

Finally we need to calculate the new cluster assignments


> dist <- function(p1, p2) {
+ return(sum((p1 - p2)^2))
+ }
> point.assign <- function(point, centers) {
+ # Input: one point
+ # Output: which cluster center is closest
+ return(which.min(c(dist(point, centers[1, ]),
+ dist(point, centers[2, ]),
+ dist(point, centers[3, ]))))
+ }
> new.clusters <- function(points, centers) {
+ # Input: points and centers
+ # Output: new cluster assignment
+ return(apply(points, 1, point.assign, centers))
+ }
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 93 / 107
Example of K -means

> new.clus <- new.clusters(points = data, centers = centers)

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 94 / 107
Example of K -means

> new.clus <- new.clusters(points = data, centers = centers)

Step 1 is complete, now we iterate!


> while(any(new.clus != clusters)) {
+ clusters <- new.clus
+ centers <- apply(data, 2, tapply, clusters, mean)
+ new.clus <- new.clusters(points = data, centers = centers)
+ }
> table(new.clus, iris$Species)
new.clus setosa versicolor virginica
1 50 0 0
2 0 2 46
3 0 48 4

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 94 / 107
Example of K -means

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 95 / 107
Check Yourself

Tasks
Use the previous code to write a function K.means which takes as input
two arguments data and K and returns the final clustering the algorithm
finds. Note that in the previous work we assumed K = 3, so generalize
your function in terms of K .

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 96 / 107
Check Yourself

Solution
> K.means <- function(data, K) {
+ clusters <- sample(1:K, nrow(data), replace = TRUE)
+ centers <- apply(data, 2, tapply, clusters, mean)
+ new.clus <- new.clusters(points = data,
+ centers = centers)
+ while(any(new.clus != clusters)) {
+ clusters <- new.clus
+ centers <- apply(data, 2, tapply, clusters, mean)
+ new.clus <- new.clusters(points = data,
+ centers = centers)
+ }
+ return(list(clusters = new.clus, centers = centers))
+ }

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 97 / 107
Check Yourself

Solution
> K.means(data, K = 3)
$clusters
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[28] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
[55] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1
[82] 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 1 3
[109] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3
[136] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3

$centers
Petal.Length Petal.Width
1 4.269231 1.342308
2 1.462000 0.246000
3 5.595833 2.037500
Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 98 / 107
Check Yourself

Solution
> km.out$centers
Petal.Length Petal.Width
1 1.462000 0.246000
2 5.595833 2.037500
3 4.269231 1.342308

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 99 / 107
Section VI

Exam Review

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 100 / 107
Format

All written: will last from 1:10pm - 3:30pm.


For no reason should you be on your cell phone/laptop/smart
watch/whatever.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 101 / 107
Topics

Topics covered before midterm will be covered implicitly.


Focus on topics covered week 8 - last week.
Anything from lecture, lab, homework is fair game.
Look to homework assignments (and check yourself questions) to get
a feel for what I think are important topics for you to understand
from each lecture.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 102 / 107
Topics

Randomness in R/Generating Random Numbers


Simulations: dfoo(), rfoo(), qfoo(), pfoo().
Simulations: sample()
Inverse Transform Method
Rejection Sampling

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 103 / 107
Topics

Distributions as Models
MLE or MOM estimation.
Testing fit (visually and otherwise).
Permutation tests.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 104 / 107
Topics

Optimization
Constrained vs. Unconstrained
Gradient Descent: Generally, how/why does it work? How is it
different than Newtons Method and what are the
stregnths/weaknesses of each.
(I will not, for example, ask you to code up your own gradient descent
algorithm, though I may ask you to properly use something already
coded.)
Built-in R optimization functions (like we used in homework).

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 105 / 107
Topics

Transforming Data
Functions like sort() and order()
Obviously apply(), tapply(), sapply(), lapply() but also
form of the output when we use different functions like range() vs.
mean().
Built-in R functions split(), aggregate(), and merge().
Functions in the plyr package to replace the apply() family.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 106 / 107
Topics

Unsupervised Learning
The difference between supervised and unsupervised learning.
How to find principle components and iterpret them.
Familiarity and understnading with clustering generally.

Cynthia Rush Lecture 12: Debugging and Databases December 9, 2016 107 / 107

Das könnte Ihnen auch gefallen