NLP for Seasonal Music Classification
There’s nothing I love more than tanning at the beach listening to my summer playlist — other than going for a fall hike listening to my autumn playlist, baking Christmas cookies listening to my winter playlist, or splashing in the rain listening to my spring playlist. Music sets the mood for every season, and for the past 7 years I have been meticulously crafting the perfect seasonal playlists to fit the months around me.
Some of these songs are easy to classify — of course “Midnights in October” by Dom Fera would go on my fall playlist, and “Sunbleached Girl” by Shag Rock in summer. But others come down purely to vibes, to feelings, to an offhand line or beat.
I’m not the only one thinking about this. Every year starting in April or May we see dozens of tweets, TikToks, Reddit posts asking about the song of the summer — a viral, upbeat, joyful tune to dance and sing along to on warm summer nights. There are albums that are widely classified as quintessentially fall albums, dusted off the first weeks of September as a soundtrack for rainy evenings and falling leaves and pumpkin spice (my favorite? Good Morning Rain (1970) by Bonnie Dobson).
Sometimes I don’t want to go out and find songs to put on my playlist for the season. I want to automatically classify my entire library of 4000 songs into seasonal categories so I can throw it on shuffle. And more, I want to understand why it is we tend to subconsciously classify songs as relating to summer, fall, winter, or spring.
Classifying Music
The process of classifying music — into genres, moods, clusters of like-songs — is a history longer and far more complex than I will get into here. Spotify uses a huge amount of data to classify music into micro-genres, as Nick Seaver explains in his (fantastic!) 2022 book, Computing Taste:
“Artists that were often listened to by the same people, or that were often described online with the same words, would find themselves grouped together. […] The location of genres was determined […] [by calculating] the typical acoustic features of music within each cluster, arraying them in space along two auditory dimensions.” [Seaver 130]
These genres can be as broad as “folk” or as niche as “vegan straight edge”, and use a combination of content-based and collaborative filtering to assign them to genres.
What they don’t take into account for classification (or at least didn’t in the past, until Spotify partnered with Musixmatch in late 2021 to offer lyrics as part of their playback) are song lyrics. Song lyrics provide us with a huge amount of (con)textual data about a song and allow us to analyze its mood, intent, and themes. So can we use natural language processing on them to automatically assign a season to a song? And can this help us to understand why we tend to associate songs with seasons without explicit lyrical explanation?
Feature Collection
My first steps were to collect a broad and varied array of songs with seasons assigned to them. No such dataset exists, so I searched on Spotify for playlists like “spring mornings”, “fall vibes”, “summer dance hits”, selecting those with the most likes in the most varied range of genres I could find. I intentionally avoided any Spotify “Made for You” playlists, as I wanted to avoid bias towards my listening habits as much as possible in selecting music.
I ended up with 32 summer playlists, 28 fall playlists, 27 winter playlists, and 25 spring playlists (it’s very difficult to find springtime playlists and they seem to be mostly lo-fi). I used the Spotify API with the Spotipy Python package to pull every song in these playlists with its artist, id, and release date, and append on the season of the playlist I pulled it from.
# spotify client credentials
client_credentials_manager = SpotifyClientCredentials(spotify_client_id,spotify_client_secret)
sp = spt.Spotify(client_credentials_manager = client_credentials_manager)
def get_playlist_songs(playlist):
track_ids = []
track_names = []
artist_ids = []
artist_names = []
release_date = []
popularity = []
data = sp.playlist_items(playlist)
# can only pull 100 songs at once
for page in range(0,data['total'] // 100+1):
data = sp.playlist_items(playlist, limit=100, offset=100*(page))
# for each song on each page, pull data
for song_num, item in enumerate(data['items']):
# check if the 'track' field is not None or if it is a podcast episode
if item['track'] is None or item['track']['type']=='episode':
print(f"Track at index {song_num} is None. Skipping...")
continue
track_ids.append(data['items'][song_num]['track']['id'])
track_names.append(data['items'][song_num]['track']['name'])
artist_ids.append(data['items'][song_num]['track']['artists'][0]['id'])
artist_names.append(data['items'][song_num]['track']['artists'][0]['name'])
release_date.append(data['items'][song_num]['track']['album']['release_date'])
popularity.append(data['items'][song_num]['track']['popularity'])
data_track = {'name': track_names,
'id': track_ids,
'artist': artist_names,
'artist_id': artist_ids,
'popularity': popularity,
'release_date': release_date
}
# compile dataframe
df_track = pd.DataFrame(data_track)
return df_track
# pull for all playlists and assign season
playlist_df = pd.DataFrame()
for item in playlists:
playlist = get_playlist_songs(item[0])
season = item[1]
playlist['season'] = season
playlist_df = pd.concat([playlist_df, playlist], ignore_index=True)
We end up with 20,000 songs split roughly evenly across all 4 seasons. These songs, however, are not unique. A song might be repeated across several playlists of the same season, or even several different seasons. Different people with different histories, perspectives, and opinions will assign the same song to multiple season playlists, leaving us with multiple target variables for the same entry. (This is bad).
To get around this, we choose the target variable for that entry as whichever season has the most number of entries. “Pink + White” gets chosen as a summer song, “we fell in love in october” as a fall song. After this cleaning and re-assigning, we’re left with 14,307 unique songs: 4770 summer, 2723 fall, 2199 winter, and 4615 spring.
Now we have song identifiers and target variables, and it’s time to pull in song lyrics.
Lyric Collection
Bringing in lyrics is slightly more complicated than bringing in songs. We’re using the Genius API with the LyricsGenius Python package to pull lyrics for each of our 14,000 songs.
import lyricsgenius
LyricsGenius = lyricsgenius.Genius(
genius_access_token, verbose = True, skip_non_songs = True, timeout = 15
)
# remove lines dictating parts of the song
keywords = ['chorus', 'instrumental', 'bridge', 'verse', 'embed', 'lyrics', 'outro', 'intro']
lyrics_df = []
def genius_search(song, artist):
lyrics = ''
retries = 0
while retries < 3:
try:
track = LyricsGenius.search_song(song, artist)
# ensure track is not None and contains lyrics
if track is None:
retries += 1
continue
# process the lyrics line by line
for line in track.lyrics.lower().split('\n'):
if not any(keyword in line.lower() for keyword in keywords):
lyrics += line+' '
return lyrics
except TimeoutError as e:
retries += 1
# if it fails after 3 retries, return an empty string
return ''
# pulling lyrics for multiple songs
for song, artist, season in unique_songs_df[x:y].itertuples(index=False):
lyrics = genius_search(song, artist)
time.sleep(0.5)
temp_df = pd.DataFrame([[song, artist, season, lyrics]], columns=['song', 'artist', 'season', 'lyrics'])
lyrics_df = pd.concat([lyrics_df, temp_df], ignore_index=True)
Unfortunately, these lyrics are extremely messy. There are null values, entries which are just lists of other songs, entries in multiple languages, and entries which appear to be the entirety of the 1996 memoir Angela’s Ashes by Frank McCourt (???) which appears multiple times for multiple different songs (???).
It is instinctively very easy to discern these true lyrics from the random lists and novels included. However, I don’t want to go through and manually check each of my 12,600 songs (I removed 1400 nulls) to determine which are which. I don’t even want to manually label a few hundred of them and then build a supervised model that can discern the other thousands. Instead, we are going to let the machine do the learning for us through clustering.
K-Means Clustering
This was the simplest model I have ever built. I created 5 features for each of the entries of the lyrics column: number of words, number of dashes, number of numeric values, number of periods, and number of words per numeric value. These were based purely on a quick scan of a few of the ill-categorized lyrics: the memoirs had a huge number of words, the lists had a huge number of numerical values and dashes and periods, the lyrics were mostly around the 200–400 word range.
I scaled my features and pulled them into my clustering model to see the results. It did an amazing job immediately of discerning lyrics from non-lyrics! 2 clusters worked well to explain this difference, but I settled on 4 because it helped remove some more non-lyrics from the lyrics cluster. My goal here was to remove false positives — I’d rather have fewer true lyrics going into my future classification model than having all the lyrics and several non-lyrics as well. I wanted a smaller, purer dataset.
I used Principal Component Analysis to reduce dimensionality so I could visualize my clusters in 2-D space to confirm everything looked right and no clusters were heavily overlapping. Everything looked great! Our 0-cluster of true lyrics was distinct from our other clusters and captured the majority of our data.
After removing the non-0 clusters from analysis I was left with 12,074 songs. Finally, I used langdetect in Python to determine the language of the lyrics and remove any non-English entries from the dataset to prepare for natural language processing with BERT. There is actually a multilingual version of BERT called mBERT that could handle and interpret the multiple languages in my dataset — however, these non-English entries are so sparse that there’s not enough data to properly train on.
Finally, we are left with 11,303 songs: true lyrics, in English, no nulls, categorized by season. We’re ready to start modeling!
Natural Language Processing
There’s a variety of tools we can use in Python for NLP text classification. For this use case, we will be using the pre-trained BERT model from Hugging Face. I set up a Google Colab notebook running on a Tesla 4 GPU and my first run took only 15 minutes.
The actual process of running and tuning BERT was extremely easy. I followed a simple process:
- Split lyrics (feature) from season (label).
- Numerically encoded labels.
- Train-test split my data 80–20.
- Resampled my training data classes so they were approximately even in size.
- Tokenized my data & fine-tuned my parameters using the bert-base-uncased pre-trained transformer model from Hugging Face.
- Evaluated my model.
- Profited (if only).
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
# load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(label_mapping))
# tokenize the data
def tokenize_function(texts):
return tokenizer(texts, padding='max_length', truncation=True, max_length=256)
train_dataset = Dataset.from_dict({'text': X_train, 'label': y_train}).map(lambda e: tokenize_function(e['text']), batched=True)
test_dataset = Dataset.from_dict({'text': X_test, 'label': y_test}).map(lambda e: tokenize_function(e['text']), batched=True)
def compute_metrics(pred):
# get true labels and predictions from the model
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy
accuracy = accuracy_score(labels, preds)
# calculate precision, recall, f1
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
return {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1
}
# set up training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=2,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
load_best_model_at_end=True,
weight_decay=0.01,
metric_for_best_model="accuracy",
evaluation_strategy="steps",
logging_dir='./logs',
logging_steps=25,
)
# initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset, # add test dataset for evaluation
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
# train the model
trainer.train()
# evaluate the model
trainer.evaluate()
After some parameter tuning over multiple model runs, I was left with my final evaluation metrics:
An accuracy of 0.47 and F1 of 0.46 is not bad at all for a 4-class classification system with an inherently subjective and messy dataset. These results don’t show us everything — for a more granular view of each of our classes, I wanted to look at the confusion matrix of predicted vs. actual seasons of our test data.
A strong diagonal line shows overall good predictive power. But what I think is even more encouraging is seeing where the model made mistakes. We see that the largest error here is between our spring and summer classes, predicting a song should be summer when it is actually spring, and vice versa. This is remarkable! I think anyone would argue that the most musically-related season to summer would be spring, a relationship the model has captured in its prediction. The results are even stronger in my eyes after seeing this matrix. It’s hard to classify songs between spring and summer! And the model agrees.
I’m impressed by the model and looking forward to next steps around interpretability, explainability, and visualization. In my mind, we have proven the ability to classify songs into seasons based on lyrics far better than random assignment. Finally —
Known Issues
This approach (like all mathematical approximations of real life) has many issues! Some are solvable if I put in some more time, and some aren’t. The ones I’ve thought of so far are as follows:
- Many songs are over-indexed because different versions are included several times. For example, a first release of a Taylor Swift song and a (Taylor’s Version) of a Taylor Swift song. At best, this causes the song to be weighted twice as heavily; at worst, if these songs are categorized as different seasons, it adds noise and contradiction into the data.
- People have different opinions. If the first time you ever listened to “Only the Good Die Young” by Billy Joel was in fall, you might (incorrectly) classify it as a fall song (when it is obviously a summer song). There is no way to create a perfect universal classification system in this way because everyone perceives music differently!
- Some songs are included on Spotify playlists as the (Remastered) or (slowed + reverb) versions but Genius does not understand these labels and is unable to pull the original song lyrics, resulting in blanks when there shouldn’t be.
- A major increase in performance here would be to pull in Spotify’s audio features vectors, as I’ve done in other projects. It would not surprise me if “danceability” correlated strongly with summer songs and “acousticness” with fall songs. This could provide an entirely new dimension that lyrics alone are unable to capture. Unfortunately, Spotify API ToS makes it explicitly clear that its data cannot be used for machine learning projects, so we have to skip this.
Please let me know your thoughts! And if you work in Spotify legal, please give me permission to use your API for ML projects. I have so many ideas. Please.