Data Processing and Annotation - Creating Song Datasets from Social Tags

3.3 Creating Song Datasets from Social Tags

3.3.3 Data Processing and Annotation

To produce a large final set of labeled songs (first requisite) we imported all tracks ofMillion Song Dataset presented in [62]. It is one of the biggest song collections, created to test the scalability of algorithms to commercial sizes. We mixed in the records of Playlist dataset as well. This is a smaller collection (75,262 tracks) of more recent songs [63]. At this point, a total of 1018596 tracks was reached. Data processing went on removing duplicate tracks. Afterwards, we crawled all tags of each track utilising Last.fm API.7 Songs with no tags were removed and statistical analysis of tags was performed. The most frequent tag wasrockappearing 139295 times, followed bypopwith 79083 occurrences. We also analyzed tag type frequencies. Genre tags were the most common with 36% of the total, followed by opinion (16.2%) and mood (14.4%) tags.

Among mood tags, mellow was the most frequent with 26,890 occurrences, followed byfunk(16324) andfun(14777). The word cloud of mood tags is shown in Figure 3.6. There was an obvious bias towards positive emotion tags. This is probably because people are more inclined to give feedback when they listen to positive songs. Popularity bias may be another reason. After concluding the analysis of tag statistics, we moved on removing every tag that was not about mood or other tags that were ambiguous (e.g., we could not know if taglovemeans the user loves that song or he/she thinks it is about love). At the end of this phase, we reached to 288708 tracks. Further details about data processing steps and tag statistics can

3.3 Creating Song Datasets from Social Tags 37

Fig. 3.6 Word frequency cloud of mood tags

be found in [64]. Next, we identified and counted emotion tags of each cluster appearing in the remaining tracks. Four counters (one per emotion cluster) were obtained for every track. To reach to a polarized collection of songs (third requisite) we used a tight annotation scheme. A track is set to quadrantQxif it fulfills one of the following conditions:

• has 4 or more tags ofQxand no tags of any other quadrant • has 6 up to 8 tags ofQxand at most 1 tag of any other quadrant • has 9 up to 13 tags ofQxand at most 2 tags of any other quadrant • has 14 or more tags ofQxand at most 3 tags of any other quadrant

Songs with fewer than four tags or those not fulfilling any of the above conditions were discarded. This scheme guarantees that even in the worst case scenario (song tag distribution), any song set to Qx quadrant has more than 75% of all its received tags being part of that quadrant. What remained was a collection of 1986 happy or

Q1, 574 angry orQ2, 783 sad orQ3 and 1732 relaxed or Q4 songs for a total of 5075 (2,000 after balancing).

Datasets withPositive vs. Negativerepresentation are clearly oversimplified and do not reveal much about song emotionality. However such kind of datasets could be used for various experimental purposes. We mergedQ1 withQ4 (happywith

38 Emotions in Music

Table 3.3 Confusion matrix between A771 and ML4Q datasets

A771 \ ML4Q Happy Angry Sad Relaxed

Happy 97.43 0.85 0 1.7

Angry 0.85 98.29 0.85 0

Sad 0 0.85 97.43 1.7

Relaxed 1.7 0 1.7 96.58

negativecategory. The corresponding tags of each cluster were recombined as well. As binary discrimination is easier, an even tighter annotation scheme was enforced. A track is considered to belong to Qx (positiveornegative) only if:

• it has 5 or more tags ofQxand no tags of the other category • it has 8 up to 11 tags ofQxand at most 1 tag of the other category • has 12 up to 16 tags ofQxand at most 2 tags of the other category • has 16 or more tags ofQxand at most 3 tags of the other category

This scheme guarantees that even in the worst case scenario (song tag distribution), any song labeled aspositiveornegativehas more than 85% of all its received tags being part of that category. We got a collection of 2589negativeand 5940positive

songs, for a total of 8529 (5,000 after balancing). Apparently, the resulting datasets are imbalanced towards positive songs, same as the corresponding emotion tags they were derived from. To have an idea about the quality of the first labeling scheme that was used, we compared our labels of the first dataset (ML4Q) with those of another one considered as ground-truth. The most appropriate for our purpose was the dataset (here A771) described in [54]. It consists of 771 songs labbeled according to the planar model of Russell, same as we did. Authors usedAllMusictags for the process and involved three persons to validate the annotation quality. The problem is however the size of this dataset. From the 771 songs it contains, only 117 were part of our initial collection of 5075 labeled tracks.

In Table 3.3 we show the confusion matrix between labels of our dataset and those of A771 for each category. As we can see, the overall agreement between the two datasets is 97.28%. Despite the fact that this result is based on a small portion of the records, it seems to be high enough to confirm the validity of our method. Both

In document Text-based Sentiment Analysis and Music Emotion Recognition (Page 50-53)