Play (with) music represents a unique exploration into the intersection of music data analysis and interactive visualization. Designed to enhance user engagement with Spotify playlists, it utilizes robust technologies like D3.js and NLTK to create a dynamic interface where each song is visually represented by animated circles. The project incorporates six distinct visualization filters derived from Spotify API data, including criteria such as decade, genre, lyric similarity, popularity, energy, and danceability. Lyrics for each song are sourced through the LyricsGenius API, undergo comprehensive preprocessing, clustering, and subsequent visualization using advanced techniques like TF-IDF vectorization, K-Means clustering, and Principal Component Analysis. Technical implementation includes ensuring security and authentication protocols such as SSL/TLS compatibility, OAuth2 authorization flow management, and robust handling of network issues through Python's requests.Session object. Supported by a diverse toolkit including Spotify API, scikit-learn, matplotlib, jQuery, Flask, and Python, this project aims to provide a comprehensive framework for exploring and interpreting music data with precision and depth.
The main backend script (main.py) utilizing the Spotify API and Flask, allows querying Spotify playlists and processing track information. The script starts by importing necessary libraries: ‘requests’ and ‘urllib.parse’ for handling HTTP requests and URL parsing, urllib.error for HTTP error handling, `requests.adapters.HTTPAdapter and urllib3.util.retry.Retry for configuring request retries, ‘time’ for delays, ‘json’ for JSON operations, ‘ssl’ for SSL configurations, `datetime` for date and time operations, and Flask components for web server functionality.
A function ‘sslwrap(func)’ is defined using the `wraps` decorator from `functools`, ensuring SSL/TLS protocol version compatibility by wrapping the SSL socket. The Flask application is initialized with a secret key for session management. Essential Spotify API credentials and endpoints are defined, and a `requests.Session` object ‘my_session’ is configured with retry capabilities for handling transient network issues.
The Flask route (/) serves the index page, rendering ‘index.html’. The ‘/login’ route initiates the OAuth2 authorization process with Spotify, redirecting users to Spotify's authorization page with necessary parameters. The ‘/callback’ route handles the OAuth2 callback, exchanging the authorization code for an access token and storing token information in the session. The ‘/refresh-token’ route handles token refresh logic, sending a POST request to the Spotify API to obtain a new access token using the refresh token and updating the session with the new token information.
The ‘/playlists’ route fetches the user's playlists using the stored access token, checking for token expiration and refreshing if necessary. It sends a GET request to the Spotify API to retrieve playlists, manages potential request exceptions, and iterates through each playlist to fetch track information by calling a helper function that retrieves detailed information about each track, including song name, artist name, album name, album cover image, and year of release. A separate API request is made for each Spotify song that gathers audio features like song genre, popularity index, danceability index, and energy index. All aforementioned data associated with each song are stored within one massive dictionary, with five smaller dictionaries representing each playlist’s data. The script writes this data to a JSON file, ‘tracks.json’, and triggers lyric processing for each track by calling ‘preprocessing_lyrics’ and ‘analyzing_song_lyrics’ functions from python scripts, ‘lyrics.py’ and ‘lyricsanalysis.py’ respectively, that handle lyric processing and analysis for each playlist (more info in part 2). This setup ensures secure handling of user authentication, retrieval of playlist and track data from Spotify, and processing of the data for further analysis.
This setup ensures secure handling of user authentication, retrieval of playlist and track data from Spotify, and processing of the data for further analysis.
The lyric processing module analyzes and visualizes lyrics from songs in a user's Spotify playlist, employing a range of Natural Language Processing (NLP) and Machine Learning (ML) techniques. The primary goal was to plot each lyric line on a 2D plane, revealing patterns and clusters within the text data.
First, each song from every playlist was preprocessed by obtaining song lyrics from the Genius API. Several libraries were imported: ‘lyricsgenius’ for interacting with the Genius API, ‘re’ for regular expression operations, ‘json’ for reading JSON files, ‘csv’ for writing to CSV files, ‘os’ for operating system interactions, and ‘time’ for introducing delays. The main function starts by reading .json files listing every track name, one for each playlist generated from main.py from the `./static/` directory, and loading its contents into a constant variable. The Genius API client is then initialized with an access token and a timeout of 40 seconds, and an attempt is made to fetch lyrics for a given song title from the API, introducing a 2-second delay between requests to avoid rate limiting. If lyrics are found and their length is less than 2000 characters, they are returned; otherwise, `None` is returned. The function also includes error handling to catch and print exceptions during the API request. Each song's lyrics are cleaned by removing unnecessary text, common filler words, and phrases such as "[Chorus]" and "[Verse]", using regular expressions to remove non-alphabetic characters, extra whitespace, digits, and Unicode characters. The final cleaned lyrics are constructed by concatenating non-empty processed lines. A relative path for output CSV files associated with each playlist is constructed and filled with the preprocessed lyrics.
To analyze and visualize each lyric from the songs for each playlist, a combination of NLP tools for text preprocessing and ML algorithms for clustering and topic modeling was used. Song lyrics were sourced from the Lyric Genius API and split into individual lines. The dataset, consisting of individual lyric lines along with song titles and artists, was then cleaned and preprocessed—tokenized into words, converted to lowercase, filtered to remove stopwords and non-alphanumeric tokens, and lemmatized to obtain their base forms. Text processing involved the use of several NLTK package modules including WordNet, stopwords, and the Punkt tokenizer.
The text was transformed into Term Frequency-Inverse Document Frequency (TF-IDF) feature vectors using the TfidfVectorizer module from the Scikit-learn library, which numerically represented the importance of each word in the context of the entire corpus. With these feature vectors, K-Means clustering was applied to group the lyric lines into 15 distinct clusters based on their textual similarities. The clustering process involved fitting the K-Means algorithm to the TF-IDF vectors and assigning each lyric line to one of the clusters. Furthermore, Latent Dirichlet Allocation (LDA) was employed for topic modeling, extracting the three top words that characterize each topic. These top words were then used to create identifiers for each cluster. To calculate the precision of the clustering results, cosine similarity scores were computed between each lyric line and the centroid of its assigned cluster, providing a measure of how closely each line matched the central theme of its cluster. Additionally, confidence levels were calculated based on the Euclidean distance to the nearest cluster center, with higher confidence corresponding to shorter distances.
To visualize the analysis, dimensionality reduction on the TF-IDF vectors was performed using Principal Component Analysis (PCA), reducing the high-dimensional data to two principal components. This was used to plot the lyric lines in a 2D space, where each point represents a lyric line and is color-coded based on its cluster assignment. The processed data, the computed 2D coordinates, cluster assignments, cosine similarity scores, and confidence levels, along with the corresponding lyric line, song title, and artist, were saved to a localized CSV file.
By meticulously preprocessing the lyrics, transforming the data into meaningful numerical representations, and applying clustering and topic modeling techniques, it allows users to identify distinct themes and topics across different songs. The visualization of these themes through dimensionality reduction and 2D plotting not only highlights the diversity and commonalities within the playlists but also provides an intuitive way to explore the lyrical content.
The application begins on the landing page, with a backdrop animated song nodes with a button that redirects them to login to their Spotify account and grant permission to access their playlists and account data.
The main page is divided into three main components. First is a user interactive interface where users can control the mechanisms of the project/web page: five dials representing each playlist, and two meters exhibiting the different data visualization options. There is also a record player displaying the album cover on the vinyl record with song title, album, and artist information displayed at the bottom that changes upon hovering through the song nodes.
The second and third components are two SVG elements representing the songs on each playlist through circular nodes created via D3.js with each circle displaying the song’s album cover. The main svg is displayed on the right half of the page, animating song data on a two dimensional scale. For example, song genres are displayed on a grid-like form (see video). Popularity, energy, and danceability values are represented by the size of the circles—the higher the value, the larger the node. The decade attribute is visualized by configuring the nodes vertically by decade. Lastly, a ‘recombine’ button in the user interface repositions all the nodes to the center of the SVG.
The second and third components are two SVG elements representing the songs on each playlist through circular nodes created via D3.js with each circle displaying the song’s album cover. The main svg is displayed on the right half of the page, animating song data on a two dimensional scale. For example, song genres are displayed on a grid-like form (see video). Popularity, energy, and danceability values are represented by the size of the circles—the higher the value, the larger the node. The decade attribute is visualized by configuring the nodes vertically by decade. Lastly, a ‘recombine’ button in the user interface repositions all the nodes to the center of the SVG.
A smaller SVG is placed directly underneath the interface frame. The SVG visualizes the numerical index of popularity, energy, danceability, and decade by positioning the song nodes with respect to their data values on a one dimensional, horizontal axis. The default attribute visualized on the node for both SVGs upon loading the page is popularity. Thus, when switching around to different playlists, the SVGs are reset to their default configuration.
The functionality of the main svg is driven by the ‘create_bubbles()’ function defined in the ‘bubble_chart.js’ script. This function is called in the ‘main.js’ script that passes the data of the songs from the current active playlist as the parameter. In essence, the function dynamically generates a bubble chart visualization where each circle node represents a song, utilizing D3.js for data binding, scales, forces, and interactivity to provide an engaging exploration of the designated playlist’s song dataset.
Firstly, it selects the main SVG element and clears any existing content within the SVG to reset for each playlist. Then, circle elements representing songs are appended to the SVG, each with its own radius and fill pattern based on the track popularity and image. Drag behavior is implemented for interactivity, allowing circles to be dragged smoothly. A force simulation is set up with various forces like charge, centering, and collision detection. Tooltip functionality is added for mouseover events on circles to display song details that are displayed on the user interface.
Interactive buttons for genre, decade, combine, energy, popularity, and danceability from the main page are defined for user interaction, triggering changes in circle positions and sizes based on selected criteria. To position circles by genre and decade, the function sets up scales and forces that are triggered upon clicking the genre and decade button, respectively. It creates new object literals to store distinct genres and decades present in the parameter dataset. The built-in javascript map() function is utilized to create new arrays for each song by taking the dimensions of the SVG and generating numeral positionings in the SVG based on its genre and decade. Upon clicking the buttons, D3 forces simulations evenly distributes the circle nodes based on the data and appropriately scale them based on the size of the SVG. Genre configuration creates an uniform grid layout based on the number of distinct genres represented in the dataset. Decade configuration distributes the nodes vertically, clustering them by the decade the song was released.
Interactive buttons for genre, decade, combine, energy, popularity, and danceability from the main page are defined for user interaction, triggering changes in circle positions and sizes based on selected criteria. To position circles by genre and decade, the function sets up scales and forces that are triggered upon clicking the genre and decade button, respectively. It creates new object literals to store distinct genres and decades present in the parameter dataset. The built-in javascript map() function is utilized to create new arrays for each song by taking the dimensions of the SVG and generating numeral positionings in the SVG based on its genre and decade. Upon clicking the buttons, D3 forces simulations evenly distributes the circle nodes based on the data and appropriately scale them based on the size of the SVG. Genre configuration creates an uniform grid layout based on the number of distinct genres represented in the dataset. Decade configuration distributes the nodes vertically, clustering them by the decade the song was released.
For energy, popularity, and danceability animations, the song circle’s radius changes with respect to the value associated with each song. Helper functions are defined to calculate the minimum and maximum radius range based on the width of the SVG and number of tracks. It sets up radiusScale() and trackdetailsScale() scales using d3.scaleSqrt() to map data values to circle radii. The ‘combine’ button utilizes both X and Y forces to bring the circles to the center of the SVG. The forceCollide D3 method is implemented to ensure that circles do not overlap.
The ‘generatePlot()’ function that establishes the features for the smaller SVG similarly operates on song data and utilizes D3.js to create a visualization, but with a different approach compared to ‘create_bubbles()’. Overall, ‘generatePlot()’ dynamically generates a horizontal plot visualization of songs, allowing users to explore the dataset based on different attributes, with interactive features powered by D3.js.
The initialization of the smaller SVG begins the same as the main SVG: by selecting the SVG element and clearing any existing content within the SVG. Drag behavior functions are defined to enable manipulation of elements within the visualization. Circle elements representing songs are appended to the SVG using data binding. Each circle is filled with a pattern based on the track image and positioned along the x-axis at the center of the SVG. Circles are positioned using a force simulation (alignX()) to create a scatter plot effect. The axis is labeled with ticks that increment by factors of 10 and whose range is defined from the minimum value to the maximum value from the dataset.
As the initial configuration, the x-axis scale is set up based on the popularity index. The function also establishes event listeners for the decade, energy, and danceability buttons, allowing them to change the x-axis scale and reposition circles based on the numerical values corresponding to each song's release year, popularity score, energy score, and danceability score, respectively. Each button click triggers a transition to update the visualization accordingly. Tooltip functionality is added for mouseover events on circles to display song details.
The CSV file generated by lyricsanalysis.py contained the processed lyrics data, x and y coordinates, cluster assignments, sentences, song titles, artists, cosine similarity scores, and confidence levels. This information was used to visualize the lyric lines in an interactive plot.
The main scatterplot SVG was created to plot the data, and scales for axes were defined based on the minimum and maximum x and y values of data. Scatterplot circles representing lyric points were drawn using D3, plotted according to their x and y coordinates, and color-coded based on the cluster type the lyric represents. Drag behavior was implemented to make the circles draggable, enhancing interactivity. A force simulation was created to apply forces on the lyric circles for collision detection and positioning, ensuring they don't overlap.
A tooltip was added to display details of each lyric point on the plot upon hovering, showing lyric sentence, cluster, song title, artist, cosine similarity score, and confidence level. The tooltip contains a doughnut chart made using chart.js representing cluster distribution, with each unique cluster color coded via a custom generated color scheme. I also added horizontal meters to depict the x, y, and confidence level values, a circular progress bar for cosine similarity score, and an area chart created through chart.js that graphs all x and y coordinates lyric data points for the chosen song.
Overall, the code dynamically generates a scatterplot visualization where each circle represents a lyric line from songs. It leverages various D3.js features for data binding, DOM manipulation, tooltip creation, drag behavior, force simulation, zooming, and chart creation, providing an interactive interface for exploring and analyzing the song lyrics data in detail.