About this site

This is Mixtape50, a final project for CS50's Introduction to Computer Science.

Description

Mixtape50 takes a user-given message as input, and returns a playlist of songs whose titles - when put together - form that same message. For example, converting this:

Dear Prudence: I'm so tired. I don't want to spoil the party. When I get home, I'll follow the sun. I'll be back, Rocky Raccoon.

...to this:

dear prudence

im so tired

i dont want to spoil the party

when i get home

ill follow the sun

ill be back

rocky raccoon

Kind of like a mixtape - an artfully put together cassette of songs that might suggest a mood or message - only considerably less artful, and a lot more direct.

Admittedly, this example is a little contrived. But it makes clear some of the subproblems we need to solve.

Let's break them down.

Title cleaning

Let's say we want to include the word happy in the song. If we search Spotify, we end up with 39 distinct titles. Here is a curated subset of that, to illustrate:

#	Track
1	Happy
2	Happy - From "Despicable Me 2"
3	We Were Happy (Taylor's Version) (From The Vault)
4	Happy Birthday to You (Instrumental) [Short Edit 1]
5	GongXi GongXi (Happy Lunar New Year!)
6	Happy w u (feat. Jason Dhakal)

1 is an exact match. 2 is a close match: we mentally filter out the metadata portion when we read, and all that's left is the message string.

The rest do not match, but are good examples of some common metadata conventions in Spotify track titles.

The artist name, version information, and title translations; within parentheses (3, 5) or square brackets (4).
A featured artist, after feat. and within parentheses (6).

We want to format the titles so that 1 and 2 will match our message happy, but not the rest. How? Mixtape50 cleans the track titles in two steps - cleaning metadata, and removing punctuation.

In the first step, regular expressions are used to clean metadata by checking for some common patterns:

Anything between ( ) or [ ] (and any number of spaces before that) are removed.
Any instances of ␣-␣ and everything after are removed.
Any instances of ␣ft␣ ␣ft.␣ ␣ft:␣ ␣feat.␣ ␣feat:␣ ␣featuring␣ and everything after are removed, but not feat or feature (both are legitimate words).

These are bundled into clean_title, a function. A second function, remove_punctuation takes the cleaned title as input, and returns a string of words:

composed of only lowercase alpha-numeric characters,
separated by single ␣s.

Putting this together, after searching Spotify, we put it through clean_title and then remove_punctuation. We also put the message string through remove_punctuation. We then return only the results whose cleaned versions exactly equal the cleaned message string.

Happy

Happy - From "Despicable Me 2"

We Were Happy (Taylor's Version) (From The Vault)

Happy Birthday to You (Instrumental) [Short Edit 1]

GongXi GongXi (Happy Lunar New Year!)

Happy w u (feat. Jason Dhakal)

▼ clean_title

Happy

We Were Happy

Happy Birthday to You

GongXi GongXi

Happy w u

▼ remove_punctuation

happy

we were happy

happy birthday to you

gongxi gongxi

happy w u

String segmentation

This is the meat of the problem: having cleaned the user's message (via remove_punctuation), how can we split it into a list of valid songs (found on Spotify)? More generally, given:

a string, and
a set of valid substrings (a 'dictionary'),

how can we partition the string into a list of valid substrings?

Theory

It turns out that this is a well-known family of problems known as string segmentation or word break problems. The goal is either to determine if a valid partition exists, or to return one (or all) valid partitions.

One key feature of this problem is that segmentations compose. If we find a valid word in the message, then splitting the remaining sentence substring (which we might call the "submessage") is also a string segmentation problem.

The problem's structure - that the given problem can be solved by first solving smaller subproblems of the same kind - lends it well to a recursive approach. Consider the short code snippet below:

def wordBreak(message, dictionary, output = []):
    # Base case
    if message == "":
        print(output)
        return
    # Recursive case
    for i in range(len(message)):
        word = message[:len(message) - i]
        sub_message = message[len(message) - i:]
        if word in dictionary:
            wordBreak(message = sub_message,
                      dictionary = dictionary,
                      output = output + [word])
    # Failure case
    return None

wordBreak is a function that takes a 'dictionary' of valid words and a message string as input, and prints all possible segmentations to the console. Like other recursive algorithms, it handles a base case and recursive case. (You can tinker with this code snippet to run the program here.)

In the base case (the empty message string ""), we print the output to the console, which defaults to the empty list.

In the recursive case,

We segment the message into the word and the remaining sub_message.
We keep decreasing the word length until we find a word in the 'dictionary'.
If the word is in the 'dictionary', we add the word to the output, and call wordBreak on the remaining substring.
If the message is segmentable, the last sub_message will be the empty string "", triggering the base case and printing the segmented output to the console.

Because there is no return statement after calling wordBreak, the loop continues after wordBreak has returned - even for successful calls, which print a valid segmentation to the console. This means that all valid segmentations are eventually found.

Here's an example (adapted from these lecture notes from Duke) to demonstrate:

>>> wordBreak("bothearthandsaturnspin", dictionary)
['both', 'earth', 'and', 'saturn', 'spin']
['both', 'earth', 'and', 'sat', 'urns', 'pin']
['both', 'earth', 'and', 'sat', 'urn', 'spin']
['bot', 'heart', 'hands', 'at', 'urns', 'pin']
['bot', 'heart', 'hands', 'at', 'urn', 'spin']
['bot', 'heart', 'hands', 'a', 'turns', 'pin']
['bot', 'heart', 'hand', 'saturn', 'spin']
['bot', 'heart', 'hand', 'sat', 'urns', 'pin']
['bot', 'heart', 'hand', 'sat', 'urn', 'spin']
['bot', 'he', 'art', 'hands', 'at', 'urns', 'pin']
['bot', 'he', 'art', 'hands', 'at', 'urn', 'spin']
['bot', 'he', 'art', 'hands', 'a', 'turns', 'pin']
['bot', 'he', 'art', 'hand', 'saturn', 'spin']
['bot', 'he', 'art', 'hand', 'sat', 'urns', 'pin']
['bot', 'he', 'art', 'hand', 'sat', 'urn', 'spin']

Let's consider the time complexity of this algorithm. For a message of \(n\) characters, there are \(n-1\) places to segment the string. Since each space can either be segmented or not, there are \(2^{n-1}\) possible segmentations.

In the worst-case scenario, all possible segmentations are valid, so this algorithm runs in \(O(2^{n-1})\) (exponential time). However, in the best-case scenario, no sub-string beginning with the first character is a valid word, so only \(n\) searches are performed before wordBreak exits. So, the algorithm's asymptotic lower-bound is \(\Theta(n)\) (linear time).

Implementation

This maps nicely to our problem at hand. We just need to substitute:

single characters with cleaned "words" (happy, dont, etc.),
valid words with cleaned Spotify track titles, and
the look-up 'dictionary' with a search on Spotify.

In the implementation (the function search_message in helpers.py), there are some deviations from the algorithm described above, mostly to reduce runtime.

Returning the first result

First, like in the algorithm above, search_message starts with the longest possible first track title, decrementing by one word each time. When a valid first track title is found, it calls itself on the remaining submessage. In this way, search_message is "greedy" - it goes with the longest track title starting with the first word (if there is one). This generally results in longer track titles, which is more interesting than having, say, a playlist of mostly one-word track titles.

However, search_message returns the first valid segmentation and stops, rather than continuing the for loop. What this means is that if a valid segmentation exists, both algorithms are guaranteed to find it. However, while wordBreak generates all valid segmentations (and can choose, say, the shortest one), search_message will not.

Limiting track title length

Additionally, another speed-up can be achieved by setting a limit on the number of words to search at a time. Some analysis of a large Spotify track database revealed that most track titles are short - not more than 10 words long. (Most longer track titles are usually either edgy tunes by indie rock bands or long orchestral pieces, both unlikely to be found in a typical message.)

Therefore, starting the search for the first word with the entire message seems inefficient, especially for long message strings. In the implementation, a parameter max_search_length is set by default to 10 words. search_message will start the search with just first 10 words (or the message length, whichever is shorter), decrementing to nine words, eight words, and so on.

Database lookups

While wordBreak looked for words in the 'dictionary' (a set), searching Spotify was implemented via an API (in search_spotipy - more on this later.) However, one practical challenge was that many crucial song track titles could not be located. Running search_message on the 100 most common words in English, only about 20 or so had even a single matching track, causing most messages to trip up on words like the or has.

The first fix was to additionally query a Kaggle database of 1.2 million+ Spotify tracks if the naive Spotify search did not return results (implemented in search_db). For some reason, this significantly improved coverage, so that about 80 of the 100 most common English words could be matched with Spotify tracks.

Memoization

The next change was to 'cache' successful and unsuccessful queries. Some substrings, although not segmentable, took a long time to verify. Consider the hypothetical unsegementable substring handsaturnspinhaha in the example above.

handsaturnspin can be split six different ways:

hands at urns pin hands at urn spin hands a turn spin hand saturn spin hand sat urn spin hand sat urns pin

Each time this full substring is searched, each possibility will be generated, only to find that haha is not in the 'dictionary'. In the example above, handsaturnspin is actually searched twice: once for the segmentation with "he", "art", and again for the segmentation with "heart". This seems wasteful: if handsaturnspinhaha is unsegmentable, there should be no need to search it on both "he art" and "heart".

In our code, if the submessage is found to be unsegmentable, it is added to failed_queries, a set of unsegmentable substrings. The next time it is selected, search_message finds that it has already been searched and skips the computation entirely. This technique of 'caching' results is known as 'memoization' in functional programming, a computing paradigm that minimises repeated calculations, decreasing the runtime.

Similarly, successful segmentations (as returned by search_message) are stored in a python dictionary, query_lookup, for future reference.

Putting it all together

For each submessage:

If the submessage is found in failed_queries, we continue to the next iteration of the loop.
Otherwise, we loop through three search functions in order: search_lookup, search_spotipy, and search_db. These check the cache of successful previous queries, Spotify (via its API), and our Kaggle database respectively.

Creating the playlist

This part was comparatively easy. Mixtape50 uses the Spotify API via the spotipy python library to create a playlist, and add tracks to it. The Spotify API relies on the OAuth 2.0 authorisation protocol, but the implementation details are thankfully abstracted away by the spotipy library.

Coda: server-sent events

Unlike the toy example shown earlier, the results take some time to return - usually under a minute - because queries to Spotify's API are rate-limited. To improve the user experience, once the user submits a message, a new tab is opened, which displays the progress of the search in real-time for the user to inspect.

This presents another challenge: since the search takes place in the server, how can updates be sent live from the server to the client (the browser)? In the typical HTTP paradigm, the server only sends a response after the client sends a request. What we want to do is allow the server to send a response unilaterally, without a prior request from the client.

After considering a few alternatives like WebSocket, I decided to use server-sent events (SSE). Essentially, this allows the client to "subscribe" to a stream of events from the server, until the connection is closed. The flow goes something like this:

The client (the browser) sends a request to the server to "subscribe" to updates. (This is handled via EventSource, a vanilla JavaScript interface to server-sent events.)
Once the EventSource connection is opened, the server can send updates to the client via by setting the Content-Type header to text/event-stream.
When the search has completed, a final message is sent to the client to close the connection.

Most of this is implemented in a JavaScript file, creating.js on the client side, while the server's transmissions are handled via a MessageAnnouncer object in announcer.py (adapted from this article by Max Halford).

Final thoughts

If you've made it this far, please accept my sincere thanks - it means a lot to me! If you have any comments or suggestions for improvement, please feel free to contact me, or make a merge request.

Limitations and extensions

There are quite a few features I'd have liked to implement but for my lack of time, will, and technical ability. I may refactor my code and attempt to do so in future. If it is interesting or of value to you, you are most welcome to contribute!

Improved track title cleaning. There are still a few edge cases, like "I'm Not Okay (I Promise)" (by My Chemical Romance) - this would match i'm not okay, but not i'm not okay, i promise!. Additionally, to reduce the chances of a message tripping up, I'd like to allow, say, the word you to match both "you" and "u", or want to to match "want to", "wanna", or even "wna", in both directions.
Updating the cache into a SQL database directly. Common queries could be cached into a SQL database instead of a python dictionary, allowing for caching to persist between sessions.
Customizing the playlist. Usually, a successful Spotify or database search returns not one, but quite a few songs. Once the screen has loaded, I'd like for the user to be able to choose which of the matching songs to include in the playlist. For example, the user could pick out a specific cover of the matching song, which is in line with the thoughtful, deliberate composition process that defined mixtapes in the past. (Actually, if you open the browser console, you'll see the song data is actually sent over to the client - I just couldn't figure out the JavaScript to make the selection happen.)
Making the app thread- and proces-safe. One serious bug is that the app currently cannot handle simultaneous requests - one user's input would cause server-sent events to be broadcast to not only them, but also other users on the /creating route. I explored using Flask's request-context g object to create a separate "channel" for each user, but it proved beyond me.

Lessons learnt

This project felt very apt as a capstone for the CS50x course. It essentially tied together concepts from the entire course, including

Algorithms and data structures: in implementing the string segmentation algorithm,
SQL: in constructing the database lookups,
HTML, CSS, and JavaScript: in designing the website,
Flask and HTTP protocols: in putting together the app, and learning the EventSource protocol and more about APIs in general.

This was Mixtape50!