Final Report QAC 241

1
Griffin Barich
QAC 241
Spotify’s Network of Related Artists
For my final project in this class I looked at the network of artists based
on the related artists from Spotify. Spotify is a music streaming company that
holds almost all commercial music available in the market. It works as a
service to consumers by making available millions of songs and playlists for
any occasion. It also allows smaller artists to get their music listened to more. I
have used Spotify for many years and loved the freedom it has given me to
listen to all types of music from Bach to Britney. When listening to music, I
find that Spotify does a really good job of finding related music so I can always
find new and interesting music. This sparked my interest in how Spotify finds
these artists. Do they use genres? Do they use user data? Do they use data
from the artist’s songs? Using a network of related artists, I made the first step
to finding out.
The first step in creating this network was finding the data. Luckily,
Spotify has a pretty easy to use API. Their API works by submitting GET
requests with a token at certain endpoints in order to get data about artists,
their albums, songs, users, and some relevant metadata. All the data is
returned in JSON format and is relatively easy to parse. I had some trouble
understanding how to optimally request information, so after some research I

2
found that many people had created packages in R that contain functions
specifically for Spotify querying. The package that was best for my uses was
Rspotify by Tiago Dantas. The package allows the user to authenticate their
token with a function and do many searches through functions. While this
package initially made life easier, the functions inside are very rigid. The
functions dealt with the majority of the data I needed well, but broke down at
any slight weirdness in the dataset. The Rspotify package was an essential, but
imperfect component in my project.
I wanted to gain the network of related artists for a single artist, but I
also wanted to make it user-friendly and customizable, so I nested the entire
process into a function. The function uses the getRelated2() function to get the
related artists of the given artist, iterates over the related artists getting the
name, id, and popularity of each, then repeating the process for the number of
layers. This left me with a data frame of all the artists and their data. From
here I was able to make a network that connected artists and related artists. At
this point, I iterated over the data frame and added a node characteristic which
was genre, stored as a character vector. Then, I made a bipartite projection
where the artists were connected based on genre. I then exported both graphs
into Cytoscape and created a custom style that would fit all of the graphs.
Using a combination of data analysis and visualization skills, I was able
to make some interesting conclusions. The first and most obvious conclusion is
3
that genre is very closely related with related artists. In Cytoscape, the Force
diffusion layout couldn’t separate the nodes. The connections were so strong
that the nodes were on top of each other. This showed that genre and related
artists are very heavily correlated. The real answer for how related artists are
found on the Spotify for Artists FAQ page. “These artists are determined
automatically by combining music discussions and trends happening around
the internet with Spotify user listening data” (Link). The other interesting
finding came from comparing the shapes of different graphs from different
Genres. The three that I specifically am choosing to look at are Ed
Sheeran(left), Phish(right), and Wolfgang Amadeus Mozart (below).

4
From these graphs, we can
compare how the related
artists split from the central
artist. Ed Sheeran is the
most popular artist on
Spotify, and so has a lot of
closely related artists. Also,
there are a lot fewer dead
ends and so the network is
much larger. This can be easily compared to the graph of Phish, a less popular,
more niche artist. They have a small cluster of similar artists, tightly held
together by their similar appeal. Only when you get to the outside do the more
mainstream bands like Crosby, Stills, Nash and Young. The graph of Mozart’s
related artists really surprised me. While there is a cluster in the middle, the
range extends much further than I would have expected. My guess for why the
network is so large and dense is that many classical artists have multiple
names associated with their work. Sometimes only the composer is credited,
sometimes only the conductor, sometimes the performer, or a combination.
This leads to many nodes in a similar location in the network. Different types of
artists lead to different looking graphs.
While I am happy with the results of my project, there are definitely areas
where I could do more development. The largest area of improvement would be

5
in the Rspotify package. I just wouldn’t use it and write the functions myself.
This would allow me to write catches in the code to avoid errors that end the
code running after hours of running. Another feature of the API that I didn’t
use was the ability to query multiple ids at a time. Rspotify only has support
for that. Doing many requests at a time would allow me to make the code faster
and more efficient. Finally, I would like to analyze the Networks in more depth
and maybe see how other aspects of the artist (i.e. track popularity, key, tone of
songs) play into their relations. I enjoyed this project because it allowed me to
dive into the inner workings of one of my favorite websites.

6
Appendix:
#Final Work
setwd(dir = "Final Project")
library(httr)
clientID= "9f8dd375fe384d5db40a6b8c0363322d"
secret = "40f0ebe28dc44ea98ff04706e2becd6c"
# Sys.setenv(SPOTIFY_CLIENT_ID = clientID)
# Sys.setenv(SPOTIFY_CLIENT_SECRET = secret)
library(devtools)
install_github("tiagomendesdantas/Rspotify")
library(Rspotify)
keys <- spotifyOAuth("QAC Final",clientID, secret)
library(igraph)
getRelated2 <-function(id, token){

req <- httr::GET(paste0("https://api.spotify.com/v1/artists/",id,"/related-
artists"),httr::config(token = token))
json1<-httr::content(req)
M <- lapply(json1$artists, "[",
c("name", "id", "popularity", "type" ))
N <- lapply(json1$artists, "[[", "followers")
N <- lapply(N, "[", "total")
relatedArtists <- plyr::ldply(M, data.frame)

relatedArtists$followers <- plyr::ldply(N, data.frame)$total
return(relatedArtists)
}
# artistsearch= searchArtist("Rihanna", token= keys)

# artistfull=getArtistinfo(artistsearch$id[1], token = keys)
# artist=artistfull[,1:3]
# relatedartists= getRelated2(artist$id, token = keys)
#
# df= data.frame(artist, stringsAsFactors = F)
# nrow(relatedartists)
# for (j in 1:nrow(relatedartists)) {
# temp_df = data.frame(name = relatedartists$name[j],
# id = relatedartists$id[j],
# popularity = relatedartists$popularity[j],
7
# stringsAsFactors=F)
# df = rbind(df, temp_df)
#}
#
# links_df = data.frame(from=rep(df$id[1], 20),
# to = df$id[2:21], stringsAsFactors=F)
# g = graph_from_data_frame(d=links_df, directed=T, vertices=df[, c("id", "name",
"popularity")])
# plot(g, vertex.size= (V(g)$popularity/3))
# n=nrow(df)
# for (k in 2:n) {
#
# ## take the id and construct a new URL for Spotify request
# artistfull=getArtistinfo(df$id[k], token = keys)
# artist=artistfull[,1:3]
# tmp_a = as.character(artist$name)
# tmp_a = enc2utf8(tmp_a)
# relatedartists= getRelated2(artist$id, token = keys)
# #if (nrow(relatedartists) == 0) { next }
#
# ## reuse the code of for loop from before
# for (j in 1:nrow(relatedartists)) {
# temp_df = data.frame(name = relatedartists$name[j],
# id = relatedartists$id[j],
# popularity = relatedartists$popularity[j],
# stringsAsFactors=F)
# df = rbind(df, temp_df)
#
#
# links_df = rbind(links_df,
# data.frame(from=df$id[k],
# to=relatedartists$id[j], stringsAsFactors=F))
# }
#}
#
# k= duplicated(df$id)
# df2=df[!k,]
#
# g = graph_from_data_frame(d=links_df, directed=T, vertices=df2[, c("id", "name",
"popularity")])
# plot(g, vertex.label=V(g)$name, vertex.size=(V(g)$popularity/10), edge.arrow.size=0.5)
numberlayersearch <- function(artistid, layers, token= keys){

#for the normal one
8
# artistsearch= searchArtist(artistname, token= keys)

relatedartists= getRelated2(artistid, token = keys)
artistfull=getArtistinfo(artistid, token = keys)
artist=artistfull[,1:3]
df= data.frame(artist, stringsAsFactors = F)
for (j in 1:nrow(relatedartists)) {
temp_df = data.frame(name = relatedartists$name[j],
id = relatedartists$id[j],
popularity = relatedartists$popularity[j],
stringsAsFactors=F)
df = rbind(df, temp_df)
}
links_df = data.frame(from=rep(df$id[1], 20),

to = df$id[2:21], stringsAsFactors=F)
i=1
while (i<layers){
n=nrow(df)
for (k in 2:n) {
## take the id and construct a new URL for Spotify request

artistfull=getArtistinfo(df$id[k], token = keys)
artist=artistfull[,1:3]
tmp_a = as.character(artist$name)
tmp_a = enc2utf8(tmp_a)
relatedartists= getRelated2(artist$id, token = keys)
#if (nrow(relatedartists) == 0) { next }
## reuse the code of for loop from before

for (j in 1:nrow(relatedartists)) {
temp_df = data.frame(name = relatedartists$name[j],
id = relatedartists$id[j],
popularity = relatedartists$popularity[j],
stringsAsFactors=F)
df = rbind(df, temp_df)
links_df = rbind(links_df,
data.frame(from=df$id[k],
to=relatedartists$id[j], stringsAsFactors=F))
}
}
k= duplicated(df$id)
9
df=df[!k,]
i=i+1
}
g = graph_from_data_frame(d=links_df, directed=T, vertices=df[, c("id", "name", "popularity")])
return(g)
}
searchArtist("Backstreet Boys", token = keys)

g2=numberlayersearch("5rSXSAkZ67PYJSvpUpkOr7", layers=3)
searchArtist("Miles Davis", token = keys)
g3=numberlayersearch("0kbYTNQb4Pb1rPbbaF0pT4", layers=2)
plot(g2, vertex.label=V(g2)$name,edge.arrow.size=0.5, vertex.label.cex=.5,
vertex.size=(V(g2)$popularity/10))
c=cluster_edge_betweenness(g2, modularity = F)
length((V(g2)))
V(g2)$genres=vector(mode= "character", length = 1)
for (x in 1:length(V(g2))){
tmp=searchArtist(gsub(pattern="[^a-zA-Z0-9\\s]", replacement= " ",
x=enc2utf8(as.character(V(g2)$name[x]))), token = keys)
if (length(tmp)==0) {next}
cat(x,labels = "#")
V(g2)$genres[x]=strsplit(as.character(getArtistinfo(tmp$id[1], token = keys)$genres), split =
",", fixed = T)
}
r = sapply(V(g2)$genres, FUN=length)
genres_vec = unlist(V(g2)$genres)
genres_vec = paste("#", genres_vec, sep="")
xx = rep(V(g2)$name, times=r)
g2_genre_net = data.frame(from=xx,
to = genres_vec, stringsAsFactors = F)
genre_net = graph_from_data_frame(g2_genre_net, directed=F)
V(genre_net)$type = grepl("^#", V(genre_net)$name)
bands = bipartite.projection(genre_net, which="false")
searchArtist("Ed Sheeran", token= keys)

gf = numberlayersearch("6eUKZXaKkcviH0Ku9w2n3V", layers = 3)
length((V(gf)))
V(gf)$genres=vector(mode= "character", length = 1)
for (x in (y-1):length(V(gf))){
x=enc2utf8(as.character(V(gf)$name[x]))), token = keys)
10
cat(x)
y=x
V(gf)$genres[x]=strsplit(as.character(getArtistinfo(tmp$id[1], token = keys)$genres), split = ",",
fixed = T)
}
r = sapply(V(gf)$genres, FUN=length)
genres_vec = unlist(V(gf)$genres)
xx = rep(V(gf)$name, times=r)
gf_genre_net = data.frame(from=xx,
genre_net = graph_from_data_frame(gf_genre_net, directed=F)
bands2 = bipartite.projection(genre_net, which="false")
V(bands2)$name[1:20]
plot(bands2, vertex.label=V(bands)$name, vertex.label.cex=.5, vertex.size=2)
searchArtist("Bach", token = keys)

ghuge= numberlayersearch("5aIqB5nVVvmFsvSdExz408", layers= 3)
length((V(ghuge)))
V(ghuge)$genres=vector(mode= "character", length = 1)
y=2
for (x in (y-1):length(V(ghuge))){
x=enc2utf8(as.character(V(ghuge)$name[x]))), token = keys)
cat(x)
y=x
V(ghuge)$genres[x]=strsplit(as.character(getArtistinfo(tmp$id[1], token = keys)$genres), split
= ",", fixed = T)
}
r = sapply(V(ghuge)$genres, FUN=length)
genres_vec = unlist(V(ghuge)$genres)
xx = rep(V(ghuge)$name, times=r)
ghuge_genre_net = data.frame(from=xx,
11
genre_net = graph_from_data_frame(ghuge_genre_net, directed=F)

bands3 = bipartite.projection(genre_net, which="false")
V(bands3)$name[1:20]
plot(bands3, vertex.label=V(bands)$name, vertex.label.cex=.5, vertex.size=2)
searchArtist("Phish", token= keys)

gex = numberlayersearch("5wbIWUzTPuTxTyG6ouQKqz", layers = 3)
plot(gex, vertex.label=V(gex)$name,edge.arrow.size=0.5, vertex.label.cex=.5,
vertex.size=(V(gex)$popularity/10))
write.graph(graph=gex, file="~/Documents/Documents - Griffin’s MacBook Pro/Sophmore
Year/Network Analysis/Final Project/Inclassexample3.gml", format="gml")
dev.off()
searchArtist("Mozart", token= keys)

gex = numberlayersearch("4NJhFmfw43RLBLjQvxDuRS", layers = 3)
plot(gex, vertex.label=V(gex)$name,edge.arrow.size=0.5, vertex.label.cex=.5,
vertex.size=(V(gex)$popularity/10))
write.graph(graph=gex, file="~/Documents/Documents - Griffin’s MacBook Pro/Sophmore
Year/Network Analysis/Final Project/Inclassexample3.gml", format="gml")
dev.off()

Final Report QAC 241

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Final Report QAC 241

Hochgeladen von

Copyright:

Verfügbare Formate

1

Spotify’s Network of Related Artists

holds almost all commercial music available in the market. It works as a

service to consumers by making available millions of songs and playlists for

understanding how to optimally request information, so after some research I

imperfect component in my project.

also wanted to make it user-friendly and customizable, so I nested the entire

was genre, stored as a character vector. Then, I made a bipartite projection

Using a combination of data analysis and visualization skills, I was able

automatically by combining music discussions and trends happening around

Genres. The three that I specifically am choosing to look at are Ed

Sheeran(left), Phish(right), and Wolfgang Amadeus Mozart (below).

From these graphs, we can

compare how the related

artists split from the central

artist. Ed Sheeran is the

most popular artist on

Spotify, and so has a lot of

closely related artists. Also,

there are a lot fewer dead

ends and so the network is

sometimes only the conductor, sometimes the performer, or a combination.

artists lead to different looking graphs.

where I could do more development. The largest area of improvement would be

dive into the inner workings of one of my favorite websites.

getRelated2 <-function(id, token){

relatedArtists <- plyr::ldply(M, data.frame)

# artistsearch= searchArtist("Rihanna", token= keys)

numberlayersearch <- function(artistid, layers, token= keys){

# artistsearch= searchArtist(artistname, token= keys)

links_df = data.frame(from=rep(df$id[1], 20),

## take the id and construct a new URL for Spotify request

## reuse the code of for loop from before

searchArtist("Backstreet Boys", token = keys)

searchArtist("Ed Sheeran", token= keys)

searchArtist("Bach", token = keys)

genre_net = graph_from_data_frame(ghuge_genre_net, directed=F)

searchArtist("Phish", token= keys)

searchArtist("Mozart", token= keys)

Das könnte Ihnen auch gefallen