UC Berkeley MIDS Program
Shuo Wang, Ivan Escalona, Daisy Khamphakdy, Iris Lew, Amanda Teschko
December 2022
This project is to build machine learning learning models to have the music app automatically recognize a song’s genre when a song is added to its database, rather than manually classifying a song genre.
Kaggle link: https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify
Dimension: 42,305 rows x 22 columns
Features: ‘danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms', 'time_signature', 'genre', 'song_name', 'Unnamed: 0', 'title'
Data dictionary:
https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features
https://docs.google.com/document/d/1LWl88F8wGY1WkkOSzzVzwSYij1yeMR8hFXllRpkMyrI/edit
- Underground rap was the most popular genre
- Imbalance in the records per genre
- Some numeric features are on different scales
- Some fields has high levels of missing data
- Some rows are duplicated
- Some tracks are mapped to more than one genre
- Baseline Models: ALWAYS predict the most most popular genre from the raw data (Underground Rap)
- Random Forest
- XGBoost
- Neural Networks
- K-Means
- K-Nearest Neighbors
- Logistic Regres
- Feature scaling and balancing data is crucial for some models
- The benefits of different feature engineering techniques will vary from model to model
- Establishing a baseline gave us better appreciation for our model, even though the accuracy wasn’t objectively high
- More investigation into balancing techniques could be helpful
Google Colaboratory