Categorization of Mouse Ultrasonic Vocalizations Using Machine Learning Techniques

(1)

Article

Categorization of Mouse Ultrasonic Vocalizations

Using Machine Learning Techniques

Spyros Kouzoupis1,* , Andreas Neocleous2 and Irene Athanassakis3

1 _{Department of Music Technology and Acoustics, Hellenic Mediterranean University, 71500 Crete, Greece}

2 _{Department of Computer Science, University of Cyprus, 1678 Nicosia, Cyprus; [email protected]}

3 _{Department of Biology, University of Crete, 70013 Heraklion, Greece; [email protected]}

* Correspondence: [email protected]

Received: 13 July 2019; Accepted: 23 October 2019; Published: 4 November 2019

Abstract:A study of the ultrasonic vocalizations of several adult maleBALB/cmice in the presence of a female, is undertaken in this study. A total of 179 distinct ultrasonic syllables referred to as “phonemes” are isolated, and in the resulting dataset, k-means and agglomerative clustering algorithms are implemented to group the ultrasonic vocalizations into clusters based on features extracted from their pitch contours. In order to find the optimal number of clusters, the elbow method was used, and nine distinct categories were obtained. Results when thek-means method was applied are presented through a matching matrix, while clustering results when the agglomerative technique was applied are presented as a dendrogram. The results of both methods are in line with the manual annotations made by the authors, as well as with the ones presented in the literature. The two methods of unsupervised analysis applied on 14 element feature vectors provide evidence that vocalizations can be grouped into nine clusters, which translates into the claim that there is a distinct repertoire of “syllables” or “phonemes”.

Keywords:ultrasonic vocalizations; miceBALB/cbiosignals;k-means clustering; bioacoustics

1. Introduction

It has been known for over three decades that mice produce ultrasonic courtship vocalizations [1]. Specific pitch patterns are characterized by short, repeating monochromatic vocalizations with silent gaps in between [2]. The production of ultrasonic vocalizations in mammals has a communicative role [3–7]. A number of studies have shown that mice produce ultrasonic vocalizations in at least two situations: Pups produce “isolation calls” when cold or when removed from the nest [8,9], while males emit ultrasonic vocalizations in the presence of females or when they detect female urinary pheromones [3]. It has also been observed that females in oestrus are attracted to vocalizing males, thus facilitating the process of reproduction [10]. Several studies use the term “courtship vocalizations” [7,11–13]. It is known that these vocalizations are mainly monochromatic and appear in sequences of small bursts with approximately 30 ms of silent gaps in between [4]. These small bursts are typically called “phonemes”. A spectrogram of such a sequence is presented in Figure1. The vertical red lines indicate the points where phonemes were manually segmented.

As described by Holy and Guo [3], murine vocalizations fall into two major groups: a group consisting of “modulated” pitch contours and a group containing “pitch jumps”. The frequency trace of the vocalization in the “pitch jump” group contains one or two discontinuities. They actually report three types of phonemes as subcategories. Similar phonemes appear in our dataset and will be presented in Section3.

The concept of using a representation of the pitch tracks as feature vectors and the use of machine learning techniques can also be found in [7,11,14–16]. In [14], Musolf et al. used the Avisoft

(2)

software to extract 25 features including, among others, frequency values indicating the start, center, and end frequency of the pitch track, frequency at peak energy, as well as the vocalization duration. They performed dimensionality reduction using principal component analysis and applied supervised techniques such as support vector machines. The phonemes of male mice were classified into seven categories (five modulated and two pitch jumps). In [11], the authors suggest four categories for the repertoire of the wild house mice. Two belong in the modulated and two in the pitch jump group.

Since the phonemes of male vocalizations are repeating, Holy and Guo introduced the term “male mice songs” [3]. The term “song” has been widely used in animal vocalization research, including on birds, frogs, whales, and bats [17–20]. The terms “calls”, “phonemes”, “motifs”, and “phrases” relate to the structure of a song in a similar way to the songs in music [21].

Figure 1.Spectrogram of an ultrasonic vocalization sequence consisting of several syllables.

In this study, we apply two unsupervised machine learning methods for clustering the ultrasonic vocalizations of young maleBALB/cmice in the presence of a female in oestrus (BALB/cis an albino, laboratory-bred strain of the house mouse from which a number of common substrains are derived; having initially been bred in New York in 1920, over 200 generations ofBALB/cmice are now distributed globally, and they are among the most widely used inbred strains used in animal experimentation). More precisely, we chose to apply thek-means method mainly because not only it is easy to understand and implement, but it is also one of the most widely used algorithms for clustering. Since we usek-means, which is a distance-based method, we normalized the data (range between 0 and 1). Furthermore, we chose to apply an additional method (agglomerative clustering) for the verification of our results. The agglomerative clustering results become more interesting when they are visualized as a dendrogram. The added value of such a method is that one can explore subclusters, which can give insights and a better understanding of the data. We show that the ultrasonic vocalizations can be categorized into nine distinct categories based on their acoustical features. This fact reinforces the claim that there is a distinct repertoire of “syllables” or “phonemes” and comprises the main contribution of this work.

The rest of the paper is organized as follows: In Section2, the methodology pertaining to data acquisition, signal analysis, and dataset formation along with the machine learning techniques used for clustering, is presented. In Section3, remarks on our results are made, followed by a discussion and concluding remarks in Section4.

2. Materials and Methods

2.1. Experimental Conditions and Recordings

(3)

All the experiments were conducted with the mice in a container of dimensions of 30×15×15 cm. The ultrasonic condenser microphone (Model CM16/CMPA, Avisoft Bioacoustics, Berlin, Germany) was placed on top of the container at a distance of approximately 30 cm (Figure2).

Figure 2. Left: Equipment used for the recording of the ultrasonic vocalizations. Right: Container hosting a female and a male mouse.

Signal acquisition was done through Ultra Sound Gate (Avisoft Bioacoustics, Berlin, Germany), which performs analog to digital conversion, signal conditioning, and provides polarizing voltage to the ultrasonic condenser microphone. It comes with built-in anti-aliasing filters, and its A/D converters are of a Delta–Sigma architecture. It can amplify the signal up to 40 dB, while its sampling frequency can reach 750 kHz at 8 bit resolution. We used a 250 kHz sampling rate at a 16 bit resolution, and the recording was done in the SASLab software environment (Avisoft Bioacoustics, Berlin, Germany).

2.2. Data and Pitch Contour Extraction

The total duration of all the ultrasonic recordings undertaken sum up to about 2.5 h. Several recording sessions were undertaken, and more than 15 different mice were used. The recordings were manually segmented, resulting in a dataset of 179 phonemes.

For each phoneme extracted from a series of audio recordings, contours were formed from the detected vectors of the monochromatic frequencies. The pitch contour of the ultrasonic monochromatic vocalizations is formed from overlapping frames, where in each frame the most prominent frequency is detected through the FFT. (Here, we use the term “pitch” loosely since this term is meant to be used for the perceived sensation of frequency). The amplitude of the prominent frequency, compared to the remaining spectrum, is significantly higher, as shown in Figure3, and thus, it is relatively easy to identify a unique pitch candidate for each frame. After the estimation of pitch candidates, a post-processing procedure is applied in order to correct some errors, appearing mainly in the noisiest parts of the signal. All pitch contour shapes were manually categorized into one of the representative categories depicted in Figure4(see also Section3).

2.3. Feature Extraction

The approach taken here was to define a set of 14 features that pertain to: (a) the frequencies themselves and (b) the temporal positions of some designated time-points across the phonemes. Thus, the features are: (1) the beginning frequency, (2) the end frequency, (3) the mean frequency, (4) the standard deviation of the frequency, (5–6) the middle frequency (value and position), (7–8) the first quartile of the frequency (value and position), (9–10) the second quartile of the frequency (value and position), (11–12) the third quartile of the frequency (value and position), and (13–14) the fourth quartile of the frequency (value and position).

(4)

The decision on the number and nature of parameters was made with robustness and low dimensionality in mind. The selected parameters effectively attribute the contour shape of each vocalization. Descriptors based on the signal’s acoustic characteristics were not used (i.e., those derived from the Fourier transform, like the spectral centroid or the spectral spread).

0 20 40 60 80 100 120 140

Frequency (kHz) -90

-85 -80 -75 -70 -65 -60 -55

Amplitude (dB)

Figure 3.A single audio frame spectrum. Note the prominence of the main vocalization frequency, which essentially makes this type of high-frequency mouse vocalization monochromatic.

2.4. k-Means

Thek-means algorithm is repeatedly applying two operations until convergence is reached. It initially computes the distances between all data points and all the prototypes representing every cluster and assigns the data points to the cluster belonging to the closest prototype. Then, it repositions the prototypes to the mean of each cluster. The convergence criterion can be set as the condition where all of the prototypes are not significantly different from the mean values of their clusters. Another criterion can simply be a specified number of iterations. In this paper, we used the first case as a convergence criterion.

We used the elbow method for finding the optimal number of clusters. First, we compute the sum of squared errors (SSE) fork=1, 2, . . . , 15, as shown in Equation (1), and then we look at the values of the SSE across the number of clusters. Typically, the SSE drops significantly as more clusters are added. After the optimal number of clusters for a particular problem is reached, the SSE remains relatively constant.

Wk= k

∑

r=1

1

nrDr (1)

In Equation (1),kis the number of clusters,nris the number of points in clusterr, andDr is the

sum of distances between all points in a cluster, given by:

Dr = nr−1

∑

i=1 nr

∑

j=i

||di−dj||₂ (2)

(5)

11 44 51.17 61.79 Frequency (KHz) Category "A" 3 37 54.71 66.18 Category "B" 11 57 66.15 93.37 Category "C" 4 122 49.47 70.82 Category "D" 8 121 46.79 69.89 Frequency (KHz) Category "E" 4 25 40.49 60.61 Category "F" 6 45 50.78 64.29 Category "G" 8 110 50.81 82.66 Time (ms) Category "H" 2 149 45.37 87.35 Frequency (KHz) Time (ms) Category "I" 6 55 49.38 98.91 Time (ms) Category "J" 6 72 54.21 83.56 Time (ms) Category "K"

Figure 4.A typical phoneme pitch contour for every category.

3 6 9 12 15

2 4 6 8 10 12 14

Within−cluster sum of squares

Number of clusters Optimal number of clusters

Figure 5.Within-cluster sum of squared errors (SSE) across 15 runs of thek-means algorithm with

k=1, 2, . . . , 15. The optimal number of clusters in this example is 9 since after this model, the SSE does not improve significantly.

(6)

2.5. Agglomerative Clustering

We chose to apply agglomerative clustering [22] to our data in order to compare the results with the widely usedk-means clustering algorithm. In addition, data can be viewed as a dendrogram, which can give a better understanding of the subclusters and the relation of the instances between pairs.

Agglomerative clustering groups the instances in a dataset with a hierarchical order. Initially, each instance is seen as a separate cluster. Then, a similarity measure is computed between all possible pairs. A linkage criterion determines pairs of instances that are grouped together within a single cluster, creating the first layer of subclusters. In a similar manner, pairs of data are linked in subclusters until all instances become one cluster. A threshold can be used in any branch of the tree to get the clustering results.

The similarity measure can be any distance, for instance, the Euclidean or Mahalanobis distance. Examples of linkage criteria include the maximum or minimum distances between pairs or other more sophisticated approaches taking into account the probabilities between distributions. A dendrogram derived from our dataset is shown in Figure6. We elaborate further on the outcome of the agglomerative clustering in Sections3and4.

3. Results

By visual inspection of the pitch contours, we identified 11 generic contour shapes, which were labeled according to their shape (see Figure 4). In Figure 7, we present some statistics of the phonemes in each category. The ground truth here is used only to measure the performance of the unsupervised methods. Therefore, the recommended number of clusters is derived by the “elbow method”. The ground truth is not involved in the training procedure.

Number of instances 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Linkage value

Threshold = 0.39

Cl. 1 Cl. 2 Cl. 3 Cl. 4 Cl. 5 Cl. 6 Cl. 7 Cl. 8 & 9

Figure 6.Dendrogram resulting from the agglomerative clustering algorithm. Categories are indicated with different colors (in the electronic version of the paper).

(7)

40 50 60 70 80 90 100 K Jump2 (n = 5)

J Jump1 (n = 6) I Jump M left (n = 7)

H jump asymmetric lamda right (n = 6) G Linear negative (n = 16) F Linear positive (n = 17) E M left (n = 21) D M type (n = 22)

C Asymmetric lamda right (n = 26) B Asymmetric lamda left (n = 29) A Lamda type (n = 24)

kHz (a) Frequency

50 100 150

Duration

ms (b)

Figure 7. Statistical properties of the frequency (a) and the duration (b) of the phonemes of the 11 categories. The number of instances in each category is shown in parentheses after the labels on the left side of the figure.

In Figure8, mean normalized values of the 14 features for all 179 phonemes are plotted. The dots represent these values in every category and are joined with lines for better visualization. The category label (see Figure4) is annotated below the horizontal axis, while the vertical lines simply separate the different categories. The number of phonemes in each category is shown at the top of the figure between the separation lines. The average value for each category is pictured as a thick horizontal line. In Section4, we discuss further the consistency between the feature set, as shown in Figure8, and the results from thek-means algorithm.

A B C D E F G H I J K

Number of phonemes

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

Mean normalized values over 14 features

24 29 26 22 21 17 16 6 7 6 5

(8)

The clustering results from thek-means algorithm fork=9 are shown with a matching matrix in Table1. The rows indicate the 11 ground truth categories and the columns the k-means output. More than 80% of the A - lambda category has been assigned to cluster 9, together with the 45% of the B - asymmetric lambda left. The majority of the instances in categories C - asymmetric lambda right and G - linear negative are assigned to cluster 3 (95% and 86%, respectively), showing some similarity in their feature vectors. Similarly, 77% of the D - M type, 81% of the E - M left, and 80% of the K - jump 2 categories are assigned to cluster 1. The two categories H - jump asymmetric lambda right and I - jump M left (67% for H and 100% for I), are placed in cluster 2. The category F - Linear positive is spread in clusters 4–7.

In Figure6 we present a dendrogram derived from the agglomerative clustering algorithm. In order to view the several clusters of this method, it is necessary to apply a threshold to a branch of the tree, for the sub-branches to form the respective clusters. With the threshold set at 0.4 we get 9 clusters, shown with different colors in Figure6(in the electronic version of the paper). The first five clusters are stemming from the same branch in the two higher levels of the tree and they contain mainly phonemes from the modulated group. The remaining four clusters are coming out from different branches of the tree and they contain phonemes mainly from the pitch jump category.

Table 1.Matching matrix (k-means).

Category/Cluster 1 2 3 4 5 6 7 8 9

A - Lambda 0 0 0 0 0 4 0 0 20

B - Asymmetric lambda left 0 0 0 1 0 2 13 0 13 C - Asymmetric lambda right 0 0 20 0 0 5 0 0 1

D - M type 17 0 3 0 0 0 0 0 2

E - M left 17 0 4 0 0 0 0 0 0

F - Linear positive 0 0 0 7 3 2 5 0 0

G - Linear negative 2 0 12 0 0 2 0 0 0

H - Jump asymmetric lambda right 0 4 1 0 0 1 0 0 0

I - Jump M left 0 7 0 0 0 0 0 0 0

J - Jump 1 0 0 0 0 0 0 0 5 1

K - Jump 2 4 1 0 0 0 0 0 0 0

4. Discussion

In this work, we used unsupervised machine learning techniques to create clusters of phonemes that share similar characteristics. It is shown in Figure7that the mean frequency of the J category is statistically higher than the average frequency among all categories. In the matching matrix (Table1), by comparing thek-means outputs to the ground truth, we can see that 83% of the phonemes of the J category fall in the 8th cluster. From the matching matrix, we also observe that categories D, E, and K are seen by thek-means mainly as one class. The mean durations of the phonemes for categories D and E are similar, although they are statistically higher than the overall mean duration (Figure7).

Looking at Figure8, we observe very similar values for the normalized means across the 14 feature values for categories D, E, and K. Similarly, from the matching matrix we see that categories C and G fall into cluster 3. The normalized mean feature values of categories A, C, and G are also statistically similar. The highest mean feature value appears in category F, and the pitch jump categories appear to have relatively lower mean feature values in relation to the total mean value across all categories.

The use of the dendrogram as a visualization and clustering technique adds some value to the understanding of the data and the hierarchy of the similarity between phonemes. It seems that if the threshold is placed at 0.39, then indeed, the agglomerative clustering returns results that are consistent with those of thek-means algorithm, as well as with those reported in the literature [3].

(9)

of recordings. Nevertheless, we claim that since our results are consistent with those reported in the literature, a richer dataset would have yielded similar results.

Note also that despite the fact that data were recorded both in the audio and the ultrasonic spectrum, in this study we focused primarily on the ultrasonic range. The ultrasonic phonemes, in most cases, were mainly continuous whistles with a mean duration of 500 ms. The sounds with discontinuities (denoted as jumps) are attributed to the nonlinearities of the vocal fold system, pertaining to bifurcations. Thus, in our study, the precise representation of the frequency trace (contour) was of primary concern. This is the main reason we developed the mentioned contour extraction and processing method instead of using other freely available software, such as MUPET [23]. MUPET is a remarkable project, but its contour extraction process inspired by the workings of the human auditory system and symmetric and evenly distributed Gammatone filterbank (composed of 64 band-pass filters) that spans the whole mouse ultrasonic vocalization (USV) frequency range is not fully justified in our opinion (especially the fact that the upper and lower bounds of the frequency range is modeled by a smaller number of wider filters that are symmetric).

Concluding Remarks

Recordings of the repetitive ultrasonic monochromatic vocalizations of male mice in the presence of a female were made, and from these, phrases were manually isolated and pitch contours extracted. Two methods of unsupervised analysis were performed using 14 element feature vectors, providing evidence that vocalizations can be grouped into 9 clusters, which translates into the claim that there is a distinct repertoire of “syllables” or “phonemes”.

Machine learning techniques such ask-means and agglomerative clustering proved suitable for clustering the different pitch patterns based on their reduced feature samples. This work supports the claims made by the scientific community concerning the existence of a distinctive set of mice ultrasonic vocalizations.

Author Contributions: Conceptualization, S.K. and I.A.; Formal analysis, S.K. and A.N.; Investigation, S.K., A.N., I.A.; Methodology, S.K., A.N., I.A.; Resources, S.K., A.N., I.A.; Software, S.K., A.N.; Writing original draft, S.K., A.N.

Funding:This research received no external funding.

Acknowledgments:All applicable international, national, and/or institutional guidelines for the care and use of animals were followed. This article does not contain any studies with human participants performed by any of the authors.

Conflicts of Interest:All authors declare that they have no conflict of interest.

References

1. Dizinno, G.; Whitney, G.; Nyby, J. Ultrasonic vocalizations by male mice (mus musculus) to female sex pheromone: Experiential determinants.Behav. Biol.1978,22, 104–113. [CrossRef]

2. Gourbal, B.E.; Barthelemy, M.; Petit, G.; Gabrion, C. Spectrographic analysis of the ultrasonic vocalisations of adult male and female balb/c mice.Naturwissenschaften2004,91, 381–385. [CrossRef] [PubMed] 3. Holy, T.E.; Guo, Z. Ultrasonic songs of male mice.PLoS Biol.2005,3, e386. [CrossRef] [PubMed]

4. Hofer, M.A.; Shair, H. Ultrasonic vocalization during social interaction and isolation in 2-week-old rats.

Dev. Psychobiol.1978,11, 495–504. [CrossRef] [PubMed]

5. Portfors, C.V.; Perkel, D.J. The role of ultrasonic vocalizations in mouse communication.Curr. Opin. Neurobiol.

2014,28, 115–120. [CrossRef] [PubMed]

6. Kobayasi, K.I.; Ishino, S. Riquimaroux, H. Phonotactic responses to vocalization in adult Mongolian gerbils (Meriones unguiculatus).J. Ethol.2014,32, 7–13. [CrossRef]

7. Hoffmann, F.; Musolf, K.; Penn, D.J. Ultrasonic courtship vocalizations in wild house mice: Spectrographic analyses.J. Ethol.2012,30, 173–180. [CrossRef]

8. Ehret, G. Infant rodent ultrasounds—A gate to the understanding of sound communication.Behav. Genet.

(10)

9. Hahn, M.E.; Schanz, N. The effects of cold, rotation, and genotype on the production of ultrasonic calls in infant mice.Behav. Genet.2002,32, 267–273. [CrossRef] [PubMed]

10. Hammerschmidt, K.; Radyushkin, K.; Ehrenreich, H.; Fischer, J. Female mice respond to male ultrasonic songs with approach behaviour.Biol. Lett.2009,5, 589–592. [CrossRef] [PubMed]

11. Hoffmann, F.; Musolf, K.; Penn, D.J. Spectrographic analyses reveal signals of individuality and kinship in the ultrasonic courtship vocalizations of wild house mice.Physiol. Behav. 2012,105, 766–771. [CrossRef] [PubMed]

12. D’amato F.R. Courtship ultrasonic vocalizations and social status in mice.Anim. Behav.1991,41, 875–885. [CrossRef]

13. Hanson, J.L.; Hurley, L.M. Female presence and estrous state influence mouse ultrasonic courtship vocalizations.PLoS ONE2012,7, e40782. [CrossRef] [PubMed]

14. Musolf K.; Meindl, S.; Larsen, A.L.; Kalcounis-Rueppell, M.C.; Penn, D.J. Ultrasonic vocalizations of male mice differ among species and females show assortative preferences for male calls.PLoS ONE2015,

10, e0134123. [CrossRef] [PubMed]

15. Liu, R.C.; Miller, K.D.; Merzenich, M.M.; Schreiner, C.E. Acoustic variability and distinguishability among mouse ultrasound vocalizations.J. Acoust. Soc. Am.2003,114, 3412–3422. [CrossRef] [PubMed]

16. Kalcounis-Rueppell, M.C.; Petric, R.; Briggs, J.R.; Carney, C.; Marshall, M.M.; Willse, J.T.; Rueppell, O.; Ribble, D.O.; Crossland, J.P. Differences in ultrasonic vocalizations between wild and laboratory California mice (peromyscus californicus).PLoS ONE2010,5, e9705. [CrossRef] [PubMed]

17. Marler, P.R.; Slabbekoorn, H.Nature’s Music: The Science of Birdsong; Academic Press: Cambridge, MA, USA, 2004.

18. Kelley, D.B.; Tobias, M.L.; Horng, S.; Ryan, M. Producing and perceiving frog songs; Dissecting the neural bases for vocal behaviors in xenopus laevis. InAnuran Communication; Ryan, M.J., Ed.; Smithsonian Institution Press: Washington, DC, USA, 2001; pp. 156–166.

19. Au, W.W.; Pack, A.A.; Lammers, M.O.; Herman, L.M.; Deakos, M.H.; Andrews, K. Acoustic properties of humpback whale songs.J. Acoust. Soc. Am.2006,120, 1103–1110. [CrossRef] [PubMed]

20. Behr, O.; von Helversen, O. Bat serenades complex courtship songs of the sac-winged bat (saccopteryx bilineata).Behav. Ecol. Sociobiol.2004,56, 106–115. [CrossRef]

21. Busnel, R.G.Acoustic Behavior of Animals; Elsevier: Amsterdam, The Netherlands, 1964.

22. Gowda, K.C.; Krishna, G. Agglomerative clustering using the concept of mutual nearest neighbourhood.

Patt. Recogn.1978,10, 105–112. [CrossRef]

23. Van Segbroeck, M.; Knoll, A.T.; Levitt, P.; Narayanan, S. MUPET-Mouse Ultrasonic Profile ExTraction: A Signal Processing Tool for Rapid and Unsupervised Analysis of Ultrasonic Vocalizations.Neuron2017,

94, 465–485. [CrossRef] [PubMed] c