0. Distribution of work and initialisation

Work David Ari Ostenfeldt, s194237 Kristian Rhindal Møllman, s194246 Kristoffer Marboe, s194249
Data 40% 30% 30%
Networks 30% 40% 30%
Text 30% 30% 40%
Website 33% 33% 33%
Explainer notebook 33% 33% 33%

Everyone contributed equally to this project.

1. Motivation

What is your dataset?

The dataset we will be analysing is a collection of songs, each with four attributes: the artists who worked on them, the lyrics, the release date, and the song title.

The network will be created with each artist as a node, and the links will be if the artists have collaborated on a song. In our case, collaboration just means if they are credited in any way on the same song, be it featuring , or with separating them.

The lyrics will be used to analyse the language used in the songs we listen to. It will be split up into genres, decades and individual artists.

Why did you choose this dataset?

Musicians often collaborate with other musicians, creating new songs for our enjoyment. We thought the links between artists would be interesting to dig into and would make for an interesting network. Furthermore, investigating the different artists' language through their song lyrics to find patterns and attributes would maybe provide some insight into the various genres, artists or evolution of music.

What was your goal for the end user's experience?

We wanted to provide some insight into how artists collaborate, which genres and artists collaborate more and how the language between genres and artists differs. We wanted to create an experience for the user via the webpage, in which they could explore the parts that interest them specifically. Maybe they have a favourite genre or artist that they want to understand better. Furthermore, by providing the data set for the user, we also let them play around with it on their own, to investigate other genres or, e.g. look at how a specific artist has developed through the years.

Scraping the data

The first part of any project is collecting the data. We needed a list of songs to collect from Genius, and for this purpose, we chose Billboard's 'The Hot 100' list. The list goes back to 1960 and is updated every week. In theory, this has the potential to grant us 5200 songs a year * 62 years, which means 322,400 possible songs - though, in practice, many songs reappear on the chart.

To collect the list of songs, we used the billboard.py module, which is an interface of Billboards API for python.

Note: The code in this section is not meant to be run; it is simply to show how we collected the data

First we create some helper functions, that we will make use of when searching for songs.

The find_artist function takes a name and returns an artist.
find_song takes an artist and a song title and returns a song.
artist_to_list returns a list of artists.
process_artist_names uses regex to find all the seperate artists in the given name segment.

When searching for songs using the Genius API, we used the LyricsGenius API. To collect Genius's genre tags and release date, we had to modify the source code. The change was made in the last return statement in the Genius.lyrics() function with the following code:

# Remove PYONG and Embed
    lyrics = lyrics[:-5]
    while lyrics[-1].isdigit():
        lyrics = lyrics[:-1]

    all_tags = html.find("div", class_=re.compile("^SongTags__Container"))
    tags = all_tags.get_text('_')

    all_creds = html.find("div", class_=re.compile("^SongInfo__Columns-nekw6x-2 lgBflw"))
    creds = all_creds.get_text('_')
    if 'Release Date_' in creds:
        release_date = creds.split('Release Date_')[1].split('_')[0]
    else:
        release_date = 'Unknown'

    return lyrics.strip("\n") + '<<<<<<<<<<' + tags.lower() + '>>>>>>>>>>' + release_date
`

We used a sequential searching strategy. This means that we would first search for the song title and full artist name, and if that does not yield any results, we first split the artist name at 'feature', 'feat.', 'ft.' or 'with' and then search for the song title and the first partition of the artist's name query. If this still doesn't result in any valid song, we remove parentheses from the artist names and replace 'and' with '&', after which we again search for the song title and artist's name. If this fails as well, we try splitting the modified artist names at '&' and ',' and search again. If none of these steps results in a valid song, we simply search for the song title and hope for the best.

Immediately after loading a song, we make sure it is actually a song. We filter out songs with specific genres/tags, as Genius also houses texts that are not song lyrics. We, therefore, used the following list of bad genres to avoid those; ['track\\s?list', 'album art(work)?', 'liner notes', 'booklet', 'credits', 'interview', 'skit', 'instrumental', 'setlist', 'non-music', 'literature'].

Before all the raw data was gathered, the last step was to separate all artists for each song. This was done using regex to find and split artists at ',', 'and', 'featuring' and so on. This results in the artists Megan Thee Stallion & Dua Lipa for the song Sweetest Pie being changed to [Megan Thee Stallion, Dua Lipa] and the artists Lil Durk Featuring Gunna for the song What Happened To Virgil to be changed to [Lil Durk, Gunna]. However, a negative side effect of this processing is that artists like the previously mentioned Earth, Wind & Fire were changed to [Earth, Wind, Fire]. This was a necessary part of the preprocessing, and these kinds of artists were regrouped later in the data cleaning.

Manual lookup of songs

When collecting data for each song through the modified LyricsGenius API, we would retrieve five attributes: date of release, artists who collaborated on the song, lyrics, genres and the song title. The data looks as follows:

released artists lyrics genres title
1957 [marty robbins] El Paso Lyrics\nOut in the West Texas town of ... [country] El Paso
1960-01-04 [frankie avalon] Why Lyrics I'll never let you go\nWhy? Because ... [pop] Why
1959 [johnny preston] Running Bear LyricsOn the bank of the river\nS... [pop] Running Bear
1960-01-04 [freddy cannon] Way Down Yonder in New Orleans LyricsWell, way ... [pop] Way Down Yonder in New Orleans
1960-01-04 [guy mitchell] Heartaches by the Number Lyrics\nHeartaches by... [country, cover] Heartaches by the Number

2. Basic stats

Data Cleaning

At this point, we had all the raw data, but it was apparent that a lot of cleaning still had to be done despite our efforts during the data gathering.

Unwanted characters and non-English songs

First, unwanted unicodes like \u200b, \u200c and \u200e, which had slipped in when the data was loaded, were removed from artists, genres, and the lyrics. Next up, duplicates were removed, and songs not in English were removed by doing a language detection with the Python module langdetect.

As can be seen in the table above, each of the songs' lyrics begins with the title of the song and 'Lyrics'. This was also removed, as it wasn't part of the actual lyrics but rather an artefact from gathering the song info using the Genius API.

Create a list of all unique genres

Check if a song is non-english or doesn't have lyrics

Counting the amount of songs:

Removing long songs

Afterwards, we decided to remove all songs where the lyrics were longer than 10,000 characters. This was done because, despite all the aforementioned approaches to clean the data, e.g. entire book chapters by the French novelist Marcel Proust were still present in the dataset because they were labelled with the genre rap. The cut-off at 10,000 was chosen because all songs we investigated that were longer were songs that we clearly loaded incorrectly. In addition to this, the 6-minute-long song Rap God by Eminem, where he flexes his ability to rap fast, contains 7,984 characters.

While doing a finer combing of the data, we also produced a blocklist for artists that we deemed unwanted in the data set. This list includes Glee Cast as they were present in over 200 songs, even though their songs are covers of other popular songs. The full list is seen here ['highest to lowest', 'marcel proust', 'watsky', 'glee cast', 'harttsick', 'eric the red', 'fabvl', 'c-mob', 'hampered'].

Regrouping artists

As mentioned earlier, after gathering the data, we had to separate all artists to work with them properly, though in some cases, this results in one artist being split up into multiple - as was the case with Earth, Wind & Fire. To mitigate this problem, we first calculated how many times each artist appeared in the data set and, afterwards, for each artist, how many times they appeared with collaborating artists. Having known these values, we could then, for each artist, check which other artists they have collaborated with on all of their songs. Artists found using this method were then joined with an underscore, such that ['earth', 'wind', 'fire'] became ['earth_fire_wind'].

Preliminary look at the data

After doing all data processing and cleaning, the final data set is comprised of 25,419 songs and 7,855 unique artists. In the table below, the three data sets used throughout the project can be seen and downloaded.

Data Set Songs Size (mb)
Billboard List 29,128 1.6
Pre-cleaned 29,128 92.5
Cleaned 25,419 44.2

From this figure, we can see that Drake has the most number of songs on the Billboard 'Hot-100' list. There's some good diversity in the type of artists with most songs on the list, but they all mainly fall into the rap, r&B or pop genres.

Creating a list of all unique genres and plotting the amount of songs in each genre

Most songs fall into the pop genre, with rock, r&b and rap taking 2nd to 4th place. This is not surprising as all these genres have been hugely popular since 1960. Rap, however, only saw its inception in the 1990s but has become a staple in the music industry since then.

And doing the same for decades:

A quick look at the distribution of songs through the decades shows us that many old songs make it to the list, with 1960 having more songs than any other decade on the 'Hot-100' list. 2010 saw a steep increase in the number of songs on the list compared to previous years. Perhaps there was a shift in what kind of music we listened to.

Characteristics of the data

The data has now been gathered and thoroughly cleaned, but before we are ready to apply our network science and text analysing techniques to it, we will first look at the ten characteristics of Big Data:

Big

As mentioned previously, the data set comprises 25,419 songs and 7,855 unique artists, but in addition, the lyric corpus has a total size of 8,476,446 with 74,915 unique tokens. With this type of information, a data set of this size would be tough to come by other than scraping the internet.

Always-on

Billboard updates their 'The Hot 100' chart each week, which means the list has been updated since we first collected the data. Because it updates each week, the data set can be updated 52 times a year, which makes the data longitudinal, but since it updates only 52 times a year and not constantly like, e.g. Twitter, it is not entirely always-on.

Non-reactive

Reactivity describes whether subjects know researchers are observing them because that might change the subjects' behaviour. All musical artists are most likely aware that they are present on the chart and might follow their ranking closely, but the question is how much they change their behaviour and musical style to get a higher ranking on the chart. One could speculate that some artists might change their use of words and language to appeal to a broader audience to perform better on the chart, while others follow their musical heart. Though, with this being said, we do not believe that the fact that researchers might also be looking at the chart with the intent to do network science and text analysis will change the behaviour of the artists.

Incomplete

Completeness express if the data set manages to capture the entire unfolding of a specific event or, e.g., the entire network of a specific group. In the case of this project, we are attempting to analyse the network and text of the most popular artists and songs through modern time. With this in mind, we believe that using Billboard's 'The Hot 100' chart gives a good indication of the most popular artists and songs, though arguments could be made for the case that the chart might be skewed towards music popular in the states.

Inaccessable

The data used in this project is very much accessible. As was accounted for earlier on this page, everything has been downloaded freely off the internet via different APIs.

Nonrepresentative

Representativity denotes whether the data can generalise to, e.g., social phenomena more in general - out-of-sample generalisation. To this end, being a musician is quite a unique occupation when it comes to a social network of collaboration, in comparison to, e.g. a profession like acting. One could presume the typical actor is more connected than the typical musician since many actors are associated with a movie or tv-show, while often not many musicians are working on a song. At least not many musicians are seen shown as the artists on a given song, while many people might have worked on it during the songwriting and musical production. Additionally, since our data set only contains songs in English from a popular music chart in the west, the data might not be suited for generalisation of the network, or text, for musicians from other parts of the planet. With this being said, the data set is probably still perfectly applicable for within-sample comparisons.

Drifting

There is some systemic drifting in the data set, as the way songs were picked for the 'Hot-100' list has changed since its inception back in 1958. Originally, songs were selected purely based on how well they sold. Still, as the music industry evolved and radio, tv, and streaming started becoming more prevalent, all these factors are now considered when songs are picked for the list.

Algorithmically confounded

As the songs are only picked from the Billboard 'Hot-100' list, there is some algorithmic confounding going on. What is meant by this is that we don't know precisely how Billboard's algorithm for selecting the songs for their charts works.

Dirty

The data set could be dirty as some songs could still be loaded wrongly, or we might have missed something via the cleaning. Furthermore, the data is not a complete overview of the connections between artists or their language, as we only chose songs that appeared on the 'Hot-100' list.

Sensitive

The data is not sensitive, as there is no information in it that isn't already public, as well as the data just being elementary stats, release year, song title, song artists.

3.Tools, theory and analysis.

Network

This section of the notebook will go through the network analysis of the data. We have used networkx to build the networks and netwulf to visualise them. In the following sections, we will be investigating the full network of all musicians and a subset of them based on selected genres. The networks will be studied by calculating different statistics, such as the number of nodes, number of links, density, clusterings, etc. In addition, we will look at community detection to see how well the different genres manage to partition the networks into communities compared to the Louvain algorithm for community detection.

Network visualisation config.

Creating the full network

Calculate all genres associated to each artist as well as how many songs they have made for each genre.

Creating a list of 20 genres from which each artist can get their main genre label. In addition, a colour list to colour each node based on their main genre.

Calculate number of songs each artist has in the data set as well as how many times they have collaborated with other artists.

Add nodes

Add each artist as a node with three attributes

genre: most common genre for that artist within the fixed list 'genre_list'

size: number of times the artist has appeared on Billboard's the hot 100 (used to give each node the correct size)

all_genres: all genres associated with that artist

group: the colour of the genre associated with the artist

If an artist has multiple most common genres, meaning that they, e.g. have made five pop songs and five rock songs, that artist's genre attribute will be picked at random amongst the most common genres. An exception is with rap and trap; because trap is a subgenre of rap (but still a significant and defined genre), we deem it more appropriate to label artists as trap, if they have an equal number of rap and trap songs.

Add edges

Add edges between two artists if they have collaborated on a song and weigh the edge by the number of times they have collaborated.

Helper functions

Deciding which genre networks to analyse

It was previously decided that each artist could get their main label based on genre_list. Though, analysing and visualising 20 different networks can get a bit cumbersome, so we will be picking out a subset of these. To do this, we will first find out how many artists have each genre as their primary genre and how many times each genre has occurred in total.

The genres we have decided to pick out are based on the number of times these genres occur and genres we deem interesting. Based on the results seen above, the following 11 genres' networks will be analysed:

pop, rap, rock, R&B, country, soul, ballad, hip-hop, trap, singer-songwriter and funk.

Analysis

The full network has now been created, and we are ready to do visualisations and analyses. We will be working with the full network and sub-networks described above in the following sections. We will be investigating the full network and versions of the networks where singleton nodes with less than five songs are removed for each network.

The reasoning for only removing singleton nodes with less than five songs is that we want to make the networks as clear as possible while still maintaining the singleton artists that are influential for the genre at hand.

NB: Networks are not meant to be looked at here in the notebook but rather in the network section on the website.

With singletons

From these basic statistics we see that the number of nodes in the networks is 7854 and the number of links is 6799.

The density of an undirected graph is given by:

\begin{align} d=\frac{2m}{n(n-1)}, \end{align}

where $m$ is the number of edges, and $n$ is the number of nodes. The interpretation of the measure is that the density is 0 for a graph without edges and 1 for a completely connected graph and is, therefore, a measure of how dense a graph is wrt. edge connectivity. In this case, the network has a density of 0.00022. This can be a little hard to interpret, which is why we've also calculated the average clustering coefficient, which is given by:

\begin{align} \overline{C}=\frac{1}{N} \sum_{i=1}^N \frac{2L_i}{k_i(k_i-1)}, \end{align}

where $L_i$ is the number of links between the $k_i$ neighbours of node $i$. The interpretation of this measure is the probability that two neighbours of a randomly selected node link to each other. For this network, we have an average clustering coefficient of 0.16.

Lastly, we see that the average degree of the nodes in the graph is 1.73, which means that a node on average is connected to 1.73 other nodes. We also see that the minimum, median and mode of degrees are 0, whereas the maximum degree is 108.

Analysis of degrees

We will now analyse the degrees of the network a bit more thoroughly by looking at the distribution of degrees on a log-log scale. The reasoning for this is that a common feature of real-world networks is hubs - meaning that a few nodes in a network are highly connected to other nodes. Scale-free networks are networks with large hubs, and a power-law degree distribution characterises such networks.

Looking at the figure above, we see exactly that the network's degree distribution follows a power law, which thus gives a good indication that we are dealing with a real-world network compared to a random network.

Community detection

In this section, we will explore the communities of the network. To do this, we are looking at the partition obtained when grouping the artists by their genre. This will be compared to the partition obtained using the Louvain algorithm. To indicate whether the two partitions are good at dividing the network into modules, both of these partitions will then be juxtapositioned with random networks based on the real network. When making this comparison, we can see if the modularity of the two partitions is significantly different from 0.

First off, we will be getting the partitions based on the genres.

We will now be calculating the modularity of the network based on the partitioning obtained using the Louvain algorithm.

We initially see that the Louvain algorithm's modularity is more than twice as large as when using the genres.

Building randon network for comparison

Next up, we will generate 1000 random networks using the double edge swap algorithm. This makes each node in the new random network have the same degree as it had in the original networks, but the connections are different. For each of these random networks, we will be partitioning them using the genres and calculate their modularities. We do 1.2*number of edges swaps to ensure we get an entirely random version of the network.

We see that the mean and standard deviation of the modularity is 0, which is to be expected, as the networks are random, and we, therefore, shouldn't have any good partition using the genres.

To understand the genre partition and the Louvain algorithm partition, we will plot the distribution of the configuration model's modularity alongside the genre partition's modularity and the Louvain algorithm partition's modularity.

The figure above shows that both partitioning methods lead to a modularity significantly different from 0 and thereby larger than any of those from the random networks. We can thus deem that the network is not random through the modularity measure. Though, as touched upon previously, the modularity of the networks partitioned using the Louvain algorithm is more than twice the size of the genre partition. To understand how this partition looks, we will be visualising the graph with the Louvain partitioning.

Noticeable here is that the Louvain algorithm also groups many of the rap, pop, rock and country artists together into four separate groups, though in general, also a lot more groups are seen. Let's see just how many groups:

We see that the Louvain algorithm partitions the network into an immense 4994 groups, which is enormous compared to the 7854 nodes in the graph. An explanation for this is that many singleton nodes are probably given their own group, which gives a good partitioning but doesn't make much sense compared to partitioning using genres.

Betweenness centrality

As mentioned previously, we have decided to weigh the nodes in the network with the number of songs that the artist has in the data set. The advantage of this is that the most popular artists will be the easiest ones to see; this is especially the case for older artists who haven't collaborated as much - such as Elvis Presley or The Beatles. Artists like these would be virtually invisible if we weighted the nodes by the strength of their connections. Though, weighing nodes by the strength of their connections tell a great deal about which nodes are the biggest collaborators and, thereby, some of the most central nodes in the graph.

We will, therefore, in this section deal with betweenness centrality that, for each node in a graph, is a measure of how central that node is. The measure is based on shortest paths so that the betweenness centrality for each node is the number of shortest paths that pass through the node. The formula for betweenness centrality is given by:

\begin{align} BC(n)=\sum_{s\neq v \neq t} \frac{\sigma_{s,t}(n)}{\sigma_{s,t}}, \end{align}

where $\sigma_{s,t}$ is the total number of shortest paths from node $s$ to node $t$ and $\sigma_{s,t}(n)$ is the number of those paths that pass through $n$.

Combining this with weighing the artists by the number of songs they have in the data set will give us a great overview of the most popular artists and the most central, collaboratory, and connective artists.

Having calculated the betweenness centrality for each node, we see many rappers present in the top-20. This is not too surprising given the number of rap artists, their tendency to collaborate and the graph we were looking at earlier. Though we also see names like Quincy Jones, James Ingram and Stevie Wonder - it is interesting to see those artists playing a central part in the network.

Without singletons

The next part of the analysis for the full network is the version where we will be removing singleton nodes with less than five songs. The following section will go through the same steps as the complete network, so not everything will be described with the same level of detail.

Properties

Calculate basic statistics for the network

We have now gone down from 7854 to 4154 nodes compared to the full network while keeping the same number of edges. As expected, all the other network properties have gone up, meaning that with a larger density, avg. clustering and average degrees, we should now see a more densely connected network.

Analysis of degrees

Looking at the figure above, we again see that the degree distribution follow a power-law.

Community detection

We will again build network communities using both the genres and the Louvain algorithm, both of which will be compared to random networks.

First off, we will be getting the partitions based on the genres.

We here see a modularity that is exactly the same as before. The formula for the modularity is given by (cf. eq, 9.12 of the NS book):

\begin{align} M= \sum_{c=1}^{n_c}\left\lfloor \frac{L_c}{L}-\left(\frac{k_c}{2L} \right)^2 \right\rfloor, \end{align}

where $n_c$ is the number of communities, $L_c$ is the number of links in community $c$, $L$ is the total number of links in the network, and $k_c$ is the total degree of community $c$. Therefore, this means that the modularity doesn't depend at all on the number of nodes, and since these are the only things removed from the full network, the modularity doesn't change.

We will now be calculating the network's modularity based on the partitioning obtained using the Louvain algorithm.

We initially see that the modularity obtained using the Louvain algorithm is almost the same as for the full network (0.7440). This is due to the Louvain algorithm not being fully optimal and non-deterministic. So as for the full graph, the modularity of the Louvain partition is more than twice the size of the genre partition.

Building randon network for comparison

Next up, we will generate 1000 random networks using the double edge swap algorithm. For each of these random networks, we will be partitioning them using the genres and calculate their modularities.

We see that the mean and standard deviation of the modularity is 0, which is to be expected, as the networks are random, and we, therefore, shouldn't have any good partition using the genres.

To understand the genre partition and the Louvain algorithm partition, we will plot the distribution of the configuration model's modularity alongside the genre partition's modularity and the Louvain algorithm partition's modularity.

The figure above shows that both partitioning methods lead to a modularity significantly different from 0 and thereby also larger than any of those from the random networks. Though, as touched upon previously, the modularity of the networks partitioned using the Louvain algorithm is more than twice the size of the genre partition. To understand how this partition looks, we will be visualising the graph with the Louvain partitioning.

As with the previous Louvain graph, the algorithm groups the main clumps of nodes together quite well. However, noticeable is that the rappers are divided into two groups (light green and black).

Let's see how many groups we have in this partitioning:

We here see that the Louvain algorithm partitions the network into 1293 groups, which is a lot less compared to the 4992 of the last Louvain network. This means that the number of communities is reduced by 4994 - 1293 = 3701. Having lost 7854 - 4154 = 3700 nodes when removing singletons, it is confirmed that the Louvain algorithm gives all singleton nodes their own community.

Having now examined the full network for all genres for the musical artists, we will be moving on to analysing some of the most popular genres that we think are interesting.

Pop network

Were here looking at the network of artists who has at least one song with the tag pop in the data set. The size of the nodes will be determined by the number of songs they have with the tag pop.

With singletons

Properties

Calculate basic statistics for the network

In comparison to the full network, the pop network has approximately 3000 fewer nodes, 2900 fewer links, but the density, average clustering and average degree hasn't changes all that much.

Community detection

In this section we will explore the communities of the pop network. We will go through the same steps as previously. First off, we will be getting the partitions based on the genres

We here see a modularity which is lower than what it was for the full network.

We will now be calculating the modularity of the network based on the partitioning obtained using the Louvain algorithm.

Louvain partition modularity is seen to be quite a lot large than the genre modularity.

Building random network for comparison

Next up, we will be generating a 1000 random networks using the double edge swap algorithm. For each of these random networks, we will be partitioning them using the genres and calculate their modularities.

We see that the mean and standard deviation of the modularity is 0, which is to be expected, as the networks are random, and we, therefore, shouldn't have any good partition using the genres.

To get an overview of the genre partition and the Louvain algorithm partition, we will now plot the distribution of the configuration model's modularity alongside the genre partition's modularity and the Louvain algorithm partition's modularity.

Looking at the figure above, we see that both partitioning methods lead to a modularity significantly different from 0 and thereby also larger than any of those from the random networks. Though, as touched upon previously, the modularity of the network partitioned using the Louvain algorithm is much larger than the genre partition. To understand how this partition looks, we will be visualising the graph with the Louvain partitioning.

Noticeable here is that the Louvain algorithm manages to divide the pop artists into communities that make decent sense. E.g. some of the rappers and R&B artists are grouped as red nodes, whereas female artists like Taylor Swift are seen in very light green and other artists like Beyoncé and Rihanna in light green. Very interesting.

Let's see the communities we have in total:

We here see that the Louvain algorithm partitions the network into 3328 groups, which is quite a lot compared to the 4802 nodes in the graph. Again, the large number of singleton nodes is likely the explanation.

Without singletons

This then brings us on to the next analysis for the full network; the version where we will be removing singleton nodes with less than 5 songs. The following section will go through the same steps as as previously.

Properties

Calculate basic statistics for the network

Compared to the full network, we have gone down from 4802 to 2218 nodes while keeping the same number of edges. As expected, all the other network properties have gone up, meaning that with a larger density, avg. clustering and average degrees, we should now see a more densely connected network.

Community detection

We will again communities of the network using both the genres and the Louvain algorithm, both of which will be compared to random networks.

First off, we will be getting the partitions based on the genres.

We will now be calculating the modularity of the network based on the partitioning obtained using the Louvain algorithm.

We initially see that the modularity obtained using the Louvain algorithm is almost the same as for the full network (0.7053). This is due to the Louvain algorithm not being fully optimal and non-deterministic. So as for the full graph, the modularity of the Louvain partition is more than twice the size of the genre partition.

Building randon network for comparison

Next up, we will generate 1000 random networks using the double edge swap algorithm. For each of these random networks, we will be partitioning them using the genres and calculate their modularities.

We see that the mean and standard deviation of the modularity is 0, which is to be expected, as the networks are random, and we, therefore, shouldn't have any good partition using the genres.

To get an overview of the genre partition and the Louvain algorithm partition, we will now plot the distribution of the configuration model's modularity alongside the genre partition's modularity and the Louvain algorithm partition's modularity.

Looking at the figure above, we see that both partitioning methods lead to a modularity significantly different from 0 and thereby also larger than any of those from the random networks. Though, as touched upon previously, the modularity of the networks partitioned using the Louvain algorithm is much larger than for the genre partition. To understand how this partition looks, we will be visualising the graph with the Louvain partitioning.

As with the previous Louvain graph, the algorithm manages to group the main clumps of nodes together quite well.

Let's see how many groups we have in this partitioning:

We here see that the Louvain algorithm partitions the network into 740 groups, which is a lot less compared to the 3328 of the last Louvain network. This means that the number of communities is reduced by 3328 - 740 = 2588 and having lost 4802 - 2218 = 2584 nodes when removing singletons, we again see that the Louvain algorithm gives all singleton nodes their own community.

Retrieving statistics and visualisations for the remaining genres

For the remaining genres: rap, rock, R&B, country, soul, ballad, hip-hop, trap, singer-songwriter and funk, we will be gathering statistics and be making visualisations of the networks with and without singletons with the genre community partition and the Louvain community partition, as this information will be used on the website. However, these results will not be shown here in the notebook, as it would simply take up too much space.

The following function takes in a genre and a graph -> computes and saves statistics and network graph visualisation for both the genre partition and the Louvain partition for the graph with and without singletons.

Text Analysis

This part of the notebook will contain different analyses of the song lyrics. The main methods used are TF-IDF scores which will be used to create word clouds, sentiment analysis, dispersion plots, and lastly, LSA will be performed to calculate similarities between artists. Most of these methods will be applied in multiple scenarios. In general, the songs will be analysed with respect to the decade in which they were released and also according to the genre to which they belong.

Preprocessing lyrics

Before conducting any analysis, the lyrics are preprocessed to prepare the data. All lyrics are tokenized and lemmatized using nltk, and all tokens containing a non-alphabetic character are removed. All characters are made lowercase, and for every song, each word is only counted once. This is done since it is typical for songs to contain a lot of repetition (as it makes the lyrics easier to remember).

Fraction of genres pr. decade

Since the data stems from the Billboard hot 100 chart, it is possible to show how dominant some genres have been over time. The figure below shows how much of the music on the chart was labelled as the given genre in each decade. Note that most songs have plenty of genre tags, so the ratios do not sum to 1 (also, only the most popular genres are shown).

This graph and the table above illustrate a clear trend. Pop has been dominating for a long time, but since 2010 rap has overtaken the throne. Nowadays, even a "subgenre" of rap, namely trap, has become more popular than pop. Another interesting fact is that rock has almost completely vanished from the charts in the last decade, whereas folk has remained consistent throughout time. This graph also illustrates when rap started gaining traction in the US around the eighties.

TF-IDF & Wordcloud

The TF-IDF (term frequency, inverse document frequency) score measures how much a term relates to the characteristics of a document. In this study, terms are, of course, words in the lyrics of songs and documents can be either decade, genre or artist - according to the scenario we are interested in analysing. The TF is simply how many times a given term occurs in the document, and IDF is a measure of how unique the term is given by:

\begin{equation} \text{idf}(t, D) = \log\left(\frac{N}{|d\in D:t\in d|}\right) \end{equation}

where $t$ is a term and $D$ is the set of documents, denoted as the corpus. The TF-IDF is the product of TF and IDF, meaning that terms are most important if they frequently occur in the given document while also not appearing in any other document.

Genres

The data contains 582 genres. Many of these are sub-genres of the main genres we all know and love. Notably, many songs are tagged as several different genres. This is handled by assigning the song to all genres it is tagged. This creates some overlap between the genres, but this is only an issue for subgenres. Using all genres is thus not desirable since it is not relevant how pop relates to dance-pop or alternative-pop, but it is relevant how pop relates to rap and rock. Therefore, the genres which will constitute the corpus were hand-selected from the genres which appear the most from 1960 to 2022.

NOTE: The next section's output has been limited not to clutter the notebook too much. If you want to see the full output, you can view it under the Decades part of the Text Analysis section on the webpage.

As is evident by the output above, the TF-IDF scores succeed in highlighting a lot of the characteristics of the different genres.

Wordclouds are useful for illustrating the important terms since the importance corresponds to the font size of the term. This makes for a nice visual representation which grants a much clearer overview of the similarities and differences between documents (in this case, genres)

As a small note, the word clouds are displayed with masks of well-known musicians from the given genre. The original images are transparently overlayed to aid the image's clarity. These images are used on the website, and to avoid any ugly background, a background-removing-helper-function is implemented

The masks have been chosen somewhat arbitrarily, but hopefully, some artists are recognisable. Looking at the word cloud for country, an extremely clear tendency is evident. All terms of significant TF-IDF score describe everyday activities relevant for farmers in the US and alike. The UK word cloud contains a lot of British slang such as mum, paigons, blud and ting, and the rap word cloud is all about the harsh language known for today.

Decade

The same procedure is then done while instead dividing the songs according to the decade in which they were released.

NOTE: The next section's output has been limited not to clutter the notebook too much. If you want to see the full output, you can view it under the Decades part of the Text Analysis section on the webpage.

NOTE: The next section's output has been limited not to clutter the notebook too much. If you want to see the full output, you can view it under the Decades part of the Text Analysis section on the webpage.

In the '60s, '70s and '80s, most words are completely normal words which everyone might use in their everyday life. Some perhaps more expressive than ordinary speech, but still real words. Also, some quite romantic words like tenderly are used. In the '60s, the word watusi was used a lot. That is because it is the name of a popular dance at the time. In the '70s, doggone is used a lot. In more recent times, it has been completely replaced with the term damn. In the seventies, the term nigger also has a high TF-IDF score which is surprising, but the reason is that five different songs mention the word in the '70s, and it is never mentioned in another decade. In most of these songs, it is used to provoke.

The '90s almost seem like a transitioning time from the old school to the new school of mainstream music. That is when rap entered the music scene for good. In the '00s, mostly slang words fill the word cloud. These slang words are mainly attributed to the rap/hip-hop artists. Some examples are shawty and swag. Also, some of the most influential artists and producers appear, such as Ludacris and Darkchild.

Lastly, in the '10s and '20s, the word clouds are filled with ad-libs such as skrrt, brrt, ayy and baow, and modern slang/shorthands like opp meaning opponent, and hunnid meaning hundred.

Artists

Since there are 7855 artists in the dataset, the artists who will be considered in the corpus will be those who have managed to appear on the hot 100 chart at least ten times. This is done to achieve documents that actually can have different term frequencies for each term and also to show how well-known artists differ from each other in their use of words. Identically to the genres, some songs are shared by multiple artists (thank god for that; otherwise, there would be no network). This is handled in the same way, meaning that if two artists collaborate on a song, they both are assigned all the words in the song. This seems fair since putting one's name on a song automatically means you are associated with the whole song.

This still is quite a lot of musicians, so some of the most well-known artists have been selected for investigation. In total, there are 41 selected artists for whom a picture of them is available - making the word clouds nicer to look at! These artists are:

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Artists part of the Text Analysis section on the webpage.

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Artists part of the Text Analysis section on the webpage.

These word clouds tell much the same story as those of the genres and the decades. Musicians from the sixties and seventies (although also regarded as pop artists) use a vastly different language than the musicians who thrive today in the mainstream music scene. One example is Frank Sinatra, who uses many long and very expressive words such as inconcievable or reminding. Another word which shows signs of the time when Frank Sinatra published his music is the word musical, which certainly was a thing which was more popular back in the day.

The mainstream rappers such as Juice Wrld use many swearwords and ad-libs. Juice Wrld died due to an overdose at a very young age, and it is no secret that he was an addict. This makes sense since his word cloud is overrun with drugs.

Another good comparison is that Elvis uses the word darling a lot, whereas popular pop and rap artists nowadays use the words bitch and hoe A LOT more. It is also clear that the audience has changed significantly through the years.

Dispersion plot

Dispersion plots are interesting as they can give an indication of when certain words were used in music throughout time. As the data table is sorted according to the release date, it is simple to create a dispersion plot of all the songs. A small modification to the nltk dispersion_plot function had to be implemented to allow for the xticks to be the decades. The function for plotting dispersion plots with custom xticks is shown below with the appertaining dispersion plot of certain handpicked words, which illustrate a shift in the language of the mainstream music scene.

One can spend an endless amount of time with interesting terms that define specific periods. Thus the dispersion plot above is far from exhaustive of the trends which came and went throughout the last six decades. However, it tells an interesting story and illustrates the beginning and end of eras.

For example, it seems almost as if the sweet word darling was phased out during the nineties and replaced with the more degrading word bitch. boogie and funky also illustrate the rise and fall of funk music. It almost seems from the plot that it died out a bit in the late eighties and then came back in the nineties.

As rap hit the mainstream in the early nineties, the word nigga became a fixed part of the rap songs made by black rappers. The words swag and shorty followed around the year 2000 - 2010 but have become less used in the present time.

The word watusi is included as it is the name of a specific dance which was popular in the sixties. That is also easy to see in the dispersion plot; it is rarely used after 1970.

Sentiment analysis

Next, the sentiment of the genres, decades and artists is investigated. Here the labMT Hedonometer data from class is used as a lookup table for the sentiment of terms. The sentiment score ranges from 0 to 10, where 0 is extremely negative, and 10 is extremely positive. The words are stored in a dictionary with their corresponding sentiment scores to allow for fast lookups. Lastly, the sentiment of a document is computed as a weighted average of the sentiment of all words in the given document, which have a sentiment score in the Hedonometer data frame. All other words are removed so that they do not count towards the average sentiment score. Otherwise, this would lead to them counting as 0, e.g. the most negative word one could imagine. Another option is to set those words to have sentiment 5 (which is in the middle), but that may create a bias since the actual average of the sentiment scores in the Hedonometer data is not 5.

Genre

ones again the focus is on the genres previously defined as being the most popular through time.

The results of the sentiment analysis is not very surprising. Most genres have about equal sentiment, but rap and trap have the lowest sentiment scores, albeit still above the average sentiment of all the words in the Hedonometer data. The happiest genres are jazz, soul, funk and country, closely followed by pop.

Decade

The same procedure is carried out now, focusing on the decades. However, the sentiment for each month is also calculated along with a rolling one year average to illustrate the finer nuances of the trend in sentiment. The rolling average is a moving mean of the a window of the points plotted. It essentially smoothes the function to highligt the general trend of the data.

The plot displays what has already been established. It seems that lyrics have become less happy through time, especially in the reason years. Of course, this also can be linked to the rise of the angry genres such as rap and its offspring trap. An example was seen in the dispersion plot where darling was used until the nineties where bitch replaced it.

Artist

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Artists part of the Text Analysis section on the webpage.

The distribution in light blue is over all 7855 artists. The green distribution is only over the 735 top artists. The plots show the tendency of old pop artists such as The Beatles and Frank Sinatra to have happier lyrics. In contrast, rappers fall within the left part of the distribution with the lowest average sentiment. In the middle, we see a lot of popular pop artists from the last two decades.

LSA

Latent semantic analysis is a method for processing text where the relationship between documents and terms is analysed. In particular, it will be used to compute the similarity scores between artists. The aim is to uncover which artists are most alike and which artists are the least alike. Perhaps it will indicate artists who have used the same ghost-writers. Since songs with collaborations are assigned to all collaborating artists, they will be a lot more likely to be similar. That does, in the meantime, not mean that the result will not be interesting. Also, as mentioned before, one should think twice about putting their name on a song with lyrics that do not fit their agenda. Cosine similarity is used since all artists are mapped into a D-dimensional space where D corresponds to the total number of words in the vocabulary. In this case, D=50,697, which is a lot!

To illustrate what can be done with this technique, the five artists most and least similar to Justin Bieber are shown above. The most similar artists are pop artists. Chris Brown and Drake belong to r&b and rap, respectively. However, it can indeed be argued that they are quite "pop-y". It should also be noted that Taylor Swift and Justin Bieber have not collaborated on a song, so the bias is not completely ruining the similarity scores. Looking at the least similar artists, it is a mix of different genres. K.A.A.N. is a rapper, and Kali Uchis is a modern r&b artist.

NOTE: The output of the next section has been limited in order to not clutter the note book too much. If you want to see the full output you can view it under the Artists part of the Text Analysis section on the webpage.

4. Discussion

Overall, we are quite satisfied with the results of the project. We have been able to find interesting attributes for collaborations of artists via our network analysis, and our text analysis shows how the language of the songs we listen to has changed throughout the years, but also from artist to artist and genre to genre.

The custom styling for the website that we created had a huge role in displaying the networks and text analysis parts without overwhelming the reader with a mile-long page. If time had permitted it, we would have liked to delve even deeper into the website, adding small features and making the layout even better. One such feature would be to search for artists by name and have their label appear in the networks.

Using the network theory from the course, we have created thorough analyses of the different networks for each genre. Furthermore, we expanded on the course material by calculating the betweenness centrality of the networks to see which artists were more collaborative or central than others. In addition to this, we also investigated LSA to find similar and dissimilar artists. One element of the network analysis which became apparent later on in the project was that the edges in the networks are not genre specific. This means that in the ballad network, Kanye West and Lil Wayne will have a connection since they have collaborated, but it was not necessarily on a song with the ballad tag. There is no doubt this would reduce the number of edges in the networks of genres which are not rap, pop and r&b and that would also be interesting to investigate. Still, since the decision was made to scale nodes by them number of songs with the given genre, the significant nodes for a given genre still stand out the most.

Unfortunately, an early look into the lexical diversity of the lyrics did not show much. Thus, it was not prioritised as highly as the other text analysis aspects. Given more time, it would be interesting to look into this thoroughly. The tendency throughout all comparisons of lexical diversity was that artists, decades and genres were heavily dependent on the varying length of the appertaining documents.

Another interesting point could be that the genres might have changed through the decades. This is not something we have looked into, but it could be done with our data. At least the pop genre has a significant number of songs throughout all decades. For example, pop music has changed a lot through the years, starting with being a mix of rock and R&B into becoming more disco-oriented. Today, it is heavily influenced by rap and electronic music. Our analysis only looked at mainstream music through each decade, which leaves out some information about how the individual genres evolved. The change in rap wordclouds from its origin in 1980s onto 2020 is certainly also an interesting topic to look into.