Sie sind auf Seite 1von 11

1

Griffin Barich

QAC 241

Spotify’s Network of Related Artists

For my final project in this class I looked at the network of artists based

on the related artists from Spotify. Spotify is a music streaming company that

holds almost all commercial music available in the market. It works as a

service to consumers by making available millions of songs and playlists for

any occasion. It also allows smaller artists to get their music listened to more. I

have used Spotify for many years and loved the freedom it has given me to

listen to all types of music from Bach to Britney. When listening to music, I

find that Spotify does a really good job of finding related music so I can always

find new and interesting music. This sparked my interest in how Spotify finds

these artists. Do they use genres? Do they use user data? Do they use data

from the artist’s songs? Using a network of related artists, I made the first step

to finding out.

The first step in creating this network was finding the data. Luckily,

Spotify has a pretty easy to use API. Their API works by submitting GET

requests with a token at certain endpoints in order to get data about artists,

their albums, songs, users, and some relevant metadata. All the data is

returned in JSON format and is relatively easy to parse. I had some trouble

understanding how to optimally request information, so after some research I


2

found that many people had created packages in R that contain functions

specifically for Spotify querying. The package that was best for my uses was

Rspotify by Tiago Dantas. The package allows the user to authenticate their

token with a function and do many searches through functions. While this

package initially made life easier, the functions inside are very rigid. The

functions dealt with the majority of the data I needed well, but broke down at

any slight weirdness in the dataset. The Rspotify package was an essential, but

imperfect component in my project.

I wanted to gain the network of related artists for a single artist, but I

also wanted to make it user-friendly and customizable, so I nested the entire

process into a function. The function uses the getRelated2() function to get the

related artists of the given artist, iterates over the related artists getting the

name, id, and popularity of each, then repeating the process for the number of

layers. This left me with a data frame of all the artists and their data. From

here I was able to make a network that connected artists and related artists. At

this point, I iterated over the data frame and added a node characteristic which

was genre, stored as a character vector. Then, I made a bipartite projection

where the artists were connected based on genre. I then exported both graphs

into Cytoscape and created a custom style that would fit all of the graphs.

Using a combination of data analysis and visualization skills, I was able

to make some interesting conclusions. The first and most obvious conclusion is
3

that genre is very closely related with related artists. In Cytoscape, the Force

diffusion layout couldn’t separate the nodes. The connections were so strong

that the nodes were on top of each other. This showed that genre and related

artists are very heavily correlated. The real answer for how related artists are

found on the Spotify for Artists FAQ page. “These artists are determined

automatically by combining music discussions and trends happening around

the internet with Spotify user listening data” (Link). The other interesting

finding came from comparing the shapes of different graphs from different

Genres. The three that I specifically am choosing to look at are Ed

Sheeran(left), Phish(right), and Wolfgang Amadeus Mozart (below).


4

From these graphs, we can

compare how the related

artists split from the central

artist. Ed Sheeran is the

most popular artist on

Spotify, and so has a lot of

closely related artists. Also,

there are a lot fewer dead

ends and so the network is

much larger. This can be easily compared to the graph of Phish, a less popular,

more niche artist. They have a small cluster of similar artists, tightly held

together by their similar appeal. Only when you get to the outside do the more

mainstream bands like Crosby, Stills, Nash and Young. The graph of Mozart’s

related artists really surprised me. While there is a cluster in the middle, the

range extends much further than I would have expected. My guess for why the

network is so large and dense is that many classical artists have multiple

names associated with their work. Sometimes only the composer is credited,

sometimes only the conductor, sometimes the performer, or a combination.

This leads to many nodes in a similar location in the network. Different types of

artists lead to different looking graphs.

While I am happy with the results of my project, there are definitely areas

where I could do more development. The largest area of improvement would be


5

in the Rspotify package. I just wouldn’t use it and write the functions myself.

This would allow me to write catches in the code to avoid errors that end the

code running after hours of running. Another feature of the API that I didn’t

use was the ability to query multiple ids at a time. Rspotify only has support

for that. Doing many requests at a time would allow me to make the code faster

and more efficient. Finally, I would like to analyze the Networks in more depth

and maybe see how other aspects of the artist (i.e. track popularity, key, tone of

songs) play into their relations. I enjoyed this project because it allowed me to

dive into the inner workings of one of my favorite websites.


6

Appendix:
#Final Work
setwd(dir = "Final Project")
library(httr)

clientID= "9f8dd375fe384d5db40a6b8c0363322d"
secret = "40f0ebe28dc44ea98ff04706e2becd6c"
# Sys.setenv(SPOTIFY_CLIENT_ID = clientID)
# Sys.setenv(SPOTIFY_CLIENT_SECRET = secret)

library(devtools)
install_github("tiagomendesdantas/Rspotify")
library(Rspotify)
keys <- spotifyOAuth("QAC Final",clientID, secret)
library(igraph)

getRelated2 <-function(id, token){


req <- httr::GET(paste0("https://api.spotify.com/v1/artists/",id,"/related-
artists"),httr::config(token = token))
json1<-httr::content(req)
M <- lapply(json1$artists, "[",
c("name", "id", "popularity", "type" ))
N <- lapply(json1$artists, "[[", "followers")
N <- lapply(N, "[", "total")

relatedArtists <- plyr::ldply(M, data.frame)


relatedArtists$followers <- plyr::ldply(N, data.frame)$total

return(relatedArtists)
}

# artistsearch= searchArtist("Rihanna", token= keys)


# artistfull=getArtistinfo(artistsearch$id[1], token = keys)
# artist=artistfull[,1:3]
# relatedartists= getRelated2(artist$id, token = keys)
#
# df= data.frame(artist, stringsAsFactors = F)
# nrow(relatedartists)
# for (j in 1:nrow(relatedartists)) {
# temp_df = data.frame(name = relatedartists$name[j],
# id = relatedartists$id[j],
# popularity = relatedartists$popularity[j],
7

# stringsAsFactors=F)
# df = rbind(df, temp_df)
#}
#
# links_df = data.frame(from=rep(df$id[1], 20),
# to = df$id[2:21], stringsAsFactors=F)
# g = graph_from_data_frame(d=links_df, directed=T, vertices=df[, c("id", "name",
"popularity")])
# plot(g, vertex.size= (V(g)$popularity/3))
# n=nrow(df)
# for (k in 2:n) {
#
# ## take the id and construct a new URL for Spotify request
# artistfull=getArtistinfo(df$id[k], token = keys)
# artist=artistfull[,1:3]
# tmp_a = as.character(artist$name)
# tmp_a = enc2utf8(tmp_a)
# relatedartists= getRelated2(artist$id, token = keys)
# #if (nrow(relatedartists) == 0) { next }
#
# ## reuse the code of for loop from before
# for (j in 1:nrow(relatedartists)) {
# temp_df = data.frame(name = relatedartists$name[j],
# id = relatedartists$id[j],
# popularity = relatedartists$popularity[j],
# stringsAsFactors=F)
# df = rbind(df, temp_df)
#
#
# links_df = rbind(links_df,
# data.frame(from=df$id[k],
# to=relatedartists$id[j], stringsAsFactors=F))
# }
#}
#
# k= duplicated(df$id)
# df2=df[!k,]
#
# g = graph_from_data_frame(d=links_df, directed=T, vertices=df2[, c("id", "name",
"popularity")])
# plot(g, vertex.label=V(g)$name, vertex.size=(V(g)$popularity/10), edge.arrow.size=0.5)

numberlayersearch <- function(artistid, layers, token= keys){


#for the normal one
8

# artistsearch= searchArtist(artistname, token= keys)


relatedartists= getRelated2(artistid, token = keys)
artistfull=getArtistinfo(artistid, token = keys)
artist=artistfull[,1:3]
df= data.frame(artist, stringsAsFactors = F)
for (j in 1:nrow(relatedartists)) {
temp_df = data.frame(name = relatedartists$name[j],
id = relatedartists$id[j],
popularity = relatedartists$popularity[j],
stringsAsFactors=F)
df = rbind(df, temp_df)
}

links_df = data.frame(from=rep(df$id[1], 20),


to = df$id[2:21], stringsAsFactors=F)
i=1
while (i<layers){
n=nrow(df)
for (k in 2:n) {

## take the id and construct a new URL for Spotify request


artistfull=getArtistinfo(df$id[k], token = keys)
artist=artistfull[,1:3]
tmp_a = as.character(artist$name)
tmp_a = enc2utf8(tmp_a)
relatedartists= getRelated2(artist$id, token = keys)
#if (nrow(relatedartists) == 0) { next }

## reuse the code of for loop from before


for (j in 1:nrow(relatedartists)) {
temp_df = data.frame(name = relatedartists$name[j],
id = relatedartists$id[j],
popularity = relatedartists$popularity[j],
stringsAsFactors=F)
df = rbind(df, temp_df)

links_df = rbind(links_df,
data.frame(from=df$id[k],
to=relatedartists$id[j], stringsAsFactors=F))
}
}

k= duplicated(df$id)
9

df=df[!k,]
i=i+1
}
g = graph_from_data_frame(d=links_df, directed=T, vertices=df[, c("id", "name", "popularity")])
return(g)
}

searchArtist("Backstreet Boys", token = keys)


g2=numberlayersearch("5rSXSAkZ67PYJSvpUpkOr7", layers=3)
searchArtist("Miles Davis", token = keys)
g3=numberlayersearch("0kbYTNQb4Pb1rPbbaF0pT4", layers=2)
plot(g2, vertex.label=V(g2)$name,edge.arrow.size=0.5, vertex.label.cex=.5,
vertex.size=(V(g2)$popularity/10))
c=cluster_edge_betweenness(g2, modularity = F)
length((V(g2)))
V(g2)$genres=vector(mode= "character", length = 1)
for (x in 1:length(V(g2))){
tmp=searchArtist(gsub(pattern="[^a-zA-Z0-9\\s]", replacement= " ",
x=enc2utf8(as.character(V(g2)$name[x]))), token = keys)
if (length(tmp)==0) {next}
cat(x,labels = "#")
V(g2)$genres[x]=strsplit(as.character(getArtistinfo(tmp$id[1], token = keys)$genres), split =
",", fixed = T)
}

r = sapply(V(g2)$genres, FUN=length)
genres_vec = unlist(V(g2)$genres)
genres_vec = paste("#", genres_vec, sep="")

xx = rep(V(g2)$name, times=r)
g2_genre_net = data.frame(from=xx,
to = genres_vec, stringsAsFactors = F)
genre_net = graph_from_data_frame(g2_genre_net, directed=F)
V(genre_net)$type = grepl("^#", V(genre_net)$name)
bands = bipartite.projection(genre_net, which="false")

searchArtist("Ed Sheeran", token= keys)


gf = numberlayersearch("6eUKZXaKkcviH0Ku9w2n3V", layers = 3)

length((V(gf)))
V(gf)$genres=vector(mode= "character", length = 1)
for (x in (y-1):length(V(gf))){
tmp=searchArtist(gsub(pattern="[^a-zA-Z0-9\\s]", replacement= " ",
x=enc2utf8(as.character(V(gf)$name[x]))), token = keys)
10

if (length(tmp)==0) {next}
cat(x)
y=x
V(gf)$genres[x]=strsplit(as.character(getArtistinfo(tmp$id[1], token = keys)$genres), split = ",",
fixed = T)
}

r = sapply(V(gf)$genres, FUN=length)
genres_vec = unlist(V(gf)$genres)
genres_vec = paste("#", genres_vec, sep="")

xx = rep(V(gf)$name, times=r)
gf_genre_net = data.frame(from=xx,
to = genres_vec, stringsAsFactors = F)
genre_net = graph_from_data_frame(gf_genre_net, directed=F)
V(genre_net)$type = grepl("^#", V(genre_net)$name)
bands2 = bipartite.projection(genre_net, which="false")
V(bands2)$name[1:20]
plot(bands2, vertex.label=V(bands)$name, vertex.label.cex=.5, vertex.size=2)

searchArtist("Bach", token = keys)


ghuge= numberlayersearch("5aIqB5nVVvmFsvSdExz408", layers= 3)

length((V(ghuge)))
V(ghuge)$genres=vector(mode= "character", length = 1)
y=2
for (x in (y-1):length(V(ghuge))){
tmp=searchArtist(gsub(pattern="[^a-zA-Z0-9\\s]", replacement= " ",
x=enc2utf8(as.character(V(ghuge)$name[x]))), token = keys)
if (length(tmp)==0) {next}
cat(x)
y=x
V(ghuge)$genres[x]=strsplit(as.character(getArtistinfo(tmp$id[1], token = keys)$genres), split
= ",", fixed = T)
}
r = sapply(V(ghuge)$genres, FUN=length)
genres_vec = unlist(V(ghuge)$genres)
genres_vec = paste("#", genres_vec, sep="")

xx = rep(V(ghuge)$name, times=r)
ghuge_genre_net = data.frame(from=xx,
to = genres_vec, stringsAsFactors = F)
11

genre_net = graph_from_data_frame(ghuge_genre_net, directed=F)


V(genre_net)$type = grepl("^#", V(genre_net)$name)
bands3 = bipartite.projection(genre_net, which="false")

V(bands3)$name[1:20]
plot(bands3, vertex.label=V(bands)$name, vertex.label.cex=.5, vertex.size=2)

searchArtist("Phish", token= keys)


gex = numberlayersearch("5wbIWUzTPuTxTyG6ouQKqz", layers = 3)
plot(gex, vertex.label=V(gex)$name,edge.arrow.size=0.5, vertex.label.cex=.5,
vertex.size=(V(gex)$popularity/10))
write.graph(graph=gex, file="~/Documents/Documents - Griffin’s MacBook Pro/Sophmore
Year/Network Analysis/Final Project/Inclassexample3.gml", format="gml")
dev.off()

searchArtist("Mozart", token= keys)


gex = numberlayersearch("4NJhFmfw43RLBLjQvxDuRS", layers = 3)
plot(gex, vertex.label=V(gex)$name,edge.arrow.size=0.5, vertex.label.cex=.5,
vertex.size=(V(gex)$popularity/10))
write.graph(graph=gex, file="~/Documents/Documents - Griffin’s MacBook Pro/Sophmore
Year/Network Analysis/Final Project/Inclassexample3.gml", format="gml")
dev.off()

Das könnte Ihnen auch gefallen