Animal Sound Recognition Based on Double Feature of Spectrogram in Real Environment

(1)

Animal Sound Recognition Based on Double

Feature of Spectrogram in Real Environment

Ying Li

College of Mathematics and Computer Science. Fuzhou University

Fuzhou, China [email protected]

Zhibin Wu

College of Mathematics and Computer Science. Fuzhou University

Fuzhou, China [email protected] Abstract—In this paper, we propose an animal sound

recognition method in various noise environments with different Signal-to-Noise Ratios (SNRs). In real world, the ability to automatically recognize a wide range of animal sounds can analyze the habits and distributions of animals,which makes it possible to effectively monitor and protect them. However, due to the existence of different environments and noises, the existing method is difficult to ensure the recognition accuracy of animal sound in low SNR condition. To address this problem, this paper proposes double feature, which consists of projection feature and local binary pattern variance (LBPV) feature, combined with random forests for animal sound recognition. In feature extraction, an operation of projecting is made on spectrogram to generate the projection feature. Meanwhile, LPBV feature is generated by means of accumulating the corresponding variances of all pixels for every uniform local binary pattern (ULBP) in the spectrogram. As the experimental results show, the proposed method can recognize a wide range of animal sounds and still remains a recognition rate over 80% even under 10dB SNR.

Index Terms—Animal sound recognition, local binary pattern variance, projection feature, random forests.

I. INTRODUCTION

The ecological environment is closely related to our life, where animal sounds contain a large amount of rich information. By animal sound recognition, we can make some understanding and analysis on their life habits and distributions to effectively monitor and protect them.

Animal sound recognition is generally based on spectrogram, time-based audio feature, Mel Frequency Cepstrum Coefficient (MFCC), sound database index, and wavelet packet decomposition, classifying and recognizing by Support Vector Machine (SVM) et al. There are some typical methods including animal sound recognition based on spectrogram correlation [1], right whale sounds detection using an ‘edge’ detector operating on a smoothed spectrogram [2], animal sound recognition based on time-based audio features [3], bird sounds classification by combining MFCC with SVM [4], et al. In addition, by means of the classic method of text-based database query, Bardeli [5] proposes index-based animal sound retrieval and Cugler et al. [6] propose architecture for retrieval of animal sound recordings based on context variables. Recently, Exadaktylos et al. [7] confirm the status of animals by sound recognition for livestock production optimization. Potamitis et al. [8] present

a method of specific bird sound detection in long real-field recordings. In our recent work [9], we propose a bird detection method, in which bird sound signals are detected and selected via adaptive energy detection from the bird sounds with background noises, Mel-scaled Wavelet packet decomposition Sub-band Cepstral Coefficient (MWSCC) and Mel-Frequency Cepstral Coefficient (MFCC) are extracted from the above signals for classification by using the classifier of SVM.

The existence of varieties of noises in the real environment brings a series of challenges for recognizing animal sounds. In order to improve recognition accuracy of animal sound recognition in various noise environments with low SNR, an animal sound recognition method based on double feature of spectrogram is proposed in this paper. We extract projection feature and local binary pattern variance (LBPV) feature from spectrogram to generate the double feature. Projection feature [10], [11] is set as the first layer of double feature, which is a global feature, got by eigenvalue decomposition and projection on the entire spectrogram matrix. The second layer is LBPV feature [12], which captures local features of image, combining local binary pattern (LBP) feature [13], [14] with contrast feature effectively. The two features, which are complementary, not only can effectively improve recognition performance but also have anti-noise performance. Finally, we adopt random forests as classifier, which is a combination classifier that has good performance as well as fast speed [15].

After a series of designs, experiments, and analysis, we propose a framework of animal sounds recognition based on double feature of spectrogram. As show in Fig. 1, sound signal got its spectrogram firstly, then double feature is extracted from it, lastly, RF is applied to make classification.

II. DOUBLEFEATUREOFSPECTROGRAM Feature extraction is the core of our animal sound recognition method. The effectiveness of feature directly affects the classification results. Therefore, we propose double feature based on the time-frequency characteristics of sound signals, namely projection feature and LBPV feature.

This work is supported by the National Natural Science Foundation of China (No. 61075022).

Fig. 1. Animal sound recognition framework.

(2)

A. Projection feature

Different animal sounds have different frequency ranges, so their spectrograms are different. Sound signal can be transformed to its time-frequency spectrum S (t, f) by using Short-Time Fourier transform (STFT), where t is frame index,

f is frequency index. S (t, f) can be translated into a two-dimensional gray-scale image, namely spectrogram. The tth frame can be viewed as a vector S�_t=[S (t, 0), ⋯, S (t, N-1)]𝑇𝑇_, which contains N frequency bins. S �_t is further converted to the log-scale normalized vector:

10

ˆ 10log ( )

t t

S

=

S

(1)

ˆ

|| ||

t t t

S

=

(2) where St denotes the log-scale normalized tth frame. These

vectors are not suitable for classification because of their high dimensions, so it is necessary to reduce their dimensions.

Eigenvalue decomposition is a simple and effective method of dimensionality reduction. We will use eigenvalue decomposition to reduce dimensionality. Assuming that S(t, f)

has M frames, the vectors of these frames can be written as a matrix, X∈ℝM×N _and_X=[S

1,⋯,St,⋯,SM]𝑇𝑇 . The target of

eigenvalue decomposition is a square matrix. Therefore, the covariance matrix C∈ℝN×N_{of the matrix}_X_{is given by}_C=XT_X_.

The process of dimensionality reduction using eigenvalue decomposition can be written as

T

C U U

= Λ

(3)

(

)

' 1 1 1 2 '

, , ,

0

N N N

u

C

u u

u

λ













=























  





(4) ' ' ' 1 1 1 2 2 2

=

+

+ +

_{N N N}

C

λ

u u

λ

u u



λ

u u

(5)

K

' ' ' 1 1 1

+

2 2 2

+ +

K K K

,

C

≈

λ

u u

λ

u u



λ

u u



N

(6) where U∈ℝN×N_{is a matrix consisting of all eigenvectors} μ₁,⋯,μ_N of matrix C. Λ is a diagonal matrix containing all eigenvalues λ1,⋯,λN which represent the weights of

corresponding eigenvectors, where λ1≥λ2≥⋯≥λN. In this paper,

the value of eigenvalue λn reflects the importance of

corresponding eigenvector μ_nfor animal sound. The higher the value is, the more important the eigenvector is. The original matrix C can be approximately reconstructed by the first K

columns of U and Λ, where K≪N. Therefore, eigenvalue decomposition can be used for dimensionality reduction. The contribution ratio of the first K eigenvectors 𝜂𝜂𝐾𝐾 is calculated:

1 1

/

K N K i j i j

η

λ

= =

=

∑ ∑

(7) where 𝜂𝜂𝐾𝐾 shows the significance of the first K eigenvectors in representing the sound. Fig. 2 uses the sound of white crane as a sample, when K≤10, the contribution ratio of the first K

eigenvectors increases rapidly. And when K continues to increase, the ratio increases more gently and gradually tends to be 100%.

Matrix U contains the major information of sound, we select the first K eigenvectors to form basic vectors matrix

UK∈ℝN×K. Projection feature can be computed by projecting

the spectrogram matrix X againstUK:

K K

X

=

XU

(8)

where XK∈ℝM×K is the matrix of the projection feature. And

the dimension of each frame decreases from N to K, and K≪N. Projection feature will be used as a component for animal sound recognition in various environments.

B. LBPV feature

LBPV feature is formed by accumulating the corresponding variances of all pixels for every ULBP value. The ULBP value characterizes the spatial structure of image texture, and the variance describes the contrast information of image texture. LBPV feature combines the two features.

The texture T in a local neighborhood of a gray-scale image is defined as the joint distribution of the gray levels of

P equally spaced pixels on a circle of radius R [13], [14]:

(

) (

)

(

)

(

0 c

,

1 c

, ,

P 1 c

)

T t s g

≈

−

g s g g

−



s g

−

g

(9)

where gray value g_c corresponds to the gray value of the center pixel of the local neighborhood, g_i(i=0,1,⋯,P-1)

correspond to the gray values of P pixels and s is a sign function:

( )

1,

0

0,

0.

x

s x

x

≥



= 

<



(10)

LBP is a gray-scale texture operator, and LBP value denotes the spatial structure of the image. LBP operator forms LBP value by sorting T in a certain direction, then forming and computing the binary sequence, which can be written as a

LBPP, R value: 1 , 0

(

)2 .

P i P R i c i

LBP

−

s g g

=

∑

−

(11) As shown in the solid line area of Fig. 3(a), a 3×3 image with gray value is set as an example. The calculation process of LBP value of middle point c with gray value 80 is shown in Fig. 3(b), where (141≥80)→1, (109≥80)→1, (89≥80)→1, (68<80)→0, (48<80)→0, (52<80)→0, (60<80)→0, (89≥80)→1, so LBPP,R=(11100001)2=(225)10. To calculate

LBPvalues of edge pixels by (11), the dashed part of image is extended using the process shown in Fig. 3(a).

LBP operator produces 2𝑃𝑃_{different binary patterns, namely} LBP values, where P equally spaced pixels are on a circle of radius R. Ojala et al. [14] propose the uniform pattern based the fact that the vast majority of binary patterns contain at most 2 bitwise 0/1 changes.The uniform pattern has at most 2 bitwise

(3)

0/1 changes in the circular binary presentation. The U value is defined as the number of bitwise 0/1changes in the pattern, and used for determining whether the pattern is uniform pattern:

, 1 ( , ) 0 ( , ) 1 ( , ) 1 ( , ) 1

(

( , ))

(

)

(

)

| (

)

(

) |.

P R P m n m n P i m n i m n i

U LBP m n

s g

g

s g

g

s g g

s g

g

− − − =

=

−

+

∑

−

(12)

The pattern with U≤2 is uniform pattern, and its value is called ULBP value, which can be written as LBPP, Ru2 value:

1 ( , ) , 2 0 , s(g -g )2 ( ( , )) 2 ( , ) ( 1) 3,

,

P i i m n P R u i P R U LBP m n LBP m n P P other − = ≤ = − +









∑

(13) where superscript ‘‘u2’’ means that the uniform patterns have

U values of at most 2.

The uniform pattern can decrease the number of pattern from 2P_to_P(P-1)+2_{, we collect the patterns which have}_U

values of over 2 into one category, namely the P(P-1)+3th class. Fig. 3(a) is used as an example, where P=8 and R=1, and the number of uniform pattern is 59. We can get 59 the ULBP values according to (13). The mapping between ULBP values and the serial number 1-59 is shown in Table 1, where

ULBP(k) is the corresponding ULBP value of the serial number k.

As for a M×N gray-scale image, each pixel (m, n) can get a ULBP value. These ULBP values still form an image, which is called ULBP image. We can get a vector by counting the frequency of each value in the ULBP image, which represents a texture feature of gray-scale image. Fig. 3(c) denotes the ULBP image formed by computing the values of solid line area of Fig. 3(a) into ULBP values, and also is regard as a matrix consisting of ULBP values, namely ULBP value matrix u. Fig. 3(e) shows the histogram of the ULBP image, which also represents the texture feature vector of Fig. 3(a).

As for some ULBP image tiles, even they have same ULBP values in their ULBP images, their texture may be different. Therefore, the variance is used to describe contrast information of the texture [12]. The large value of variance represents a large change of the texture in an image region. Therefore, LBPV feature is formed by using the variances of pixel gray values as weight values of ULBP values. The kth element LBPV(k) of the LBPV feature can be expressed as

2 , 1 1

( )

M N

( , , )

u P R m n

LBPV

k

w m n k

= =

=

∑∑

(14) 2 ,

( , ) ,

( , )

( )

( , , )

0

,

u P R

VAR m n LBP m n ULBP k

w m n k

other



=



= 



(15)

where k is an integer and k∈[1,P�P-1�+3]. w(m, n, k) denotes the weight value of ULBP value which corresponds the kth element of LBPV feature, for pixel (m, n) in the spectrogram. When the ULBP value of pixel (m, n) is equal to ULBP(k) in Table I, the value of w(m, n, k) represents the variance of pixel (m, n) about pixel gray values in the local neighborhood of P

equally spaced pixels on a circle of radius R. LBPV(k)

accumulates the weight values of the ULBP value which corresponds the kth element of LBPV feature, for every pixel in the spectrogram. LBPV(1), LBPV(2), LBPV(k), …,

TABLEI

THE MAPPING BETWEEN ULBP VALUES AND THE SERIAL NUMBER

k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ULBP(k) 0 1 2 3 4 6 7 8 12 14 15 16 24 28 30 k 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ULBP(k) 31 32 48 56 60 62 63 64 76 92 120 124 126 127 128 k 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 ULBP(k) 129 131 135 143 159 191 192 193 195 199 207 223 224 225 227 k 46 47 48 49 50 51 52 53 54 55 56 57 58 59 ULBP(k) 231 239 240 241 243 247 248 249 251 252 253 254 255 other (a) (b) (c) (d) (e) (f)

Fig. 3. The calculation process of LBPV feature. (a) Gray-scale image. (b) LBP value of the central point c. (c) ULBP value matrix u. (d) Variance matrix

(4)

LBPV(P(P-1)+3) are got according to (14), and LBPV feature vector with the dimension of P(P-1)+3 is finally formed.

Fig. 3(d) is the variance matrix v of the solid line area in Fig. 3(a). As shown in Fig. 3(f), LBPV histogram, namely LBPV feature, is formed by computing LBPV(k) according to the ULBP values of Fig. 3(c), the corresponding serial number

k in Table I, and the variances of Fig. 3(d). The above process is as follows: u(0, 0)= u(0, 1)=193=ULBP(38)→v(0, 0)+v(0, 1)=577+653 →LBPV(38)=1230, u(0, 2)= u(1, 2)=241=ULBP(49)→v(0, 2)+v(1, 2)=218+446 →LBPV(49)=664, u(1, 0)= u(1,1)= 225=ULBP(44)→v(1, 0)+v(1,1) =1111+880 →LBPV(44)=1991, u(2, 0)= u(2, 1)=231=ULBP(46)→v(2, 0)+v(2, 1)=216+197 →LBPV(46)=413, u (2, 2)=255=ULBP(58)→u(2, 2)=132→LBPV(58)=132. Therefore, the corresponding values are put into

LBPV={0, …, LBPV(38), 0, …, LBPV(44), 0, LBPV(46), 0, 0,

LBPV(49), 0, …, LBPV(58), 0}, then we get LBPV={0, …, 1230, 0, …, 1991, 0,413, 0, 0, 664, 0, …, 132, 0}, whose histogram is shown in Fig. 3(f).

III. PEXPERIMENTDESIGN

A. Experimental data

The 40 animal sounds are used in our experiment, containing bird sounds, mammal sounds, and insect sounds which all come from Freesound [16]. Each sound is mono and truncated to short sound segments with the duration of about 2s, whose format are ‘wav’. Their sampling rate is set as 44.1 kHz, and quantization precision as 16 bits uniformly. The 3 environment noise used in the experiments are wind noise, traffic noise, and rain noise, which are recorded by SONY ICD-UX512F recorder with 44.1 kHz sampling rate from real world.

B. Experiment design

Two groups of experiments are designed to test the performances of projection feature, LBPV feature, and double feature combining with random forests. The first group is used to decide the parameter K in projection feature and the best scale (P, R) in LBPV feature. The second group proves that double feature better represents the animal sounds. According to the best values of the parameter K and the scale (P, R), we

extract three features including projection feature, LBPV feature, and double feature. Then the experiments of recognition accuracy are carried in noiseless condition and different noise environments with different SNRs, comparing to classic MFCC feature.

In the every experiment, there are 30 samples of each class, 10 samples of every kind of sounds are randomly selected for training, and the left 20 samples are used for testing.

C. Experimental results and analysis

The first group experiments are without environment noises. Firstly, we test the relation of accuracy rate and the parameter K in projection feature which is described in (7) and Fig. 2. As shown in Fig. 4, the recognition accuracy of projection feature increases as K increases. When K≥6, the recognition accuracy tends to flatten. Based on a tradeoff between computational cost and performance, we set K=6.

Then we test recognition accuracies of LBPV feature in different scales and multi-scales, whose results are shown in Table III. Using different (P, R) and the combination of multiple (P, R), we can exact LBPV feature with different scales and multi-scales. According to previous research[14], we choose 7 groups of (P, R), which are (8,1), (16,2), (24,3), (8,1) +(16,2), (8,1)+(24,3), (16,2)+(24,3) and (8,1)+(16,2)+(24,3). We observe from Table III that LBPV feature has good recognition performance in all scales and all recognition rates are over 96%. We take both performance and calculation cost into account, (P, R) is set as (16, 2).

The second group experiments compare different features in noiseless condition and different noise environments with different SNRs, whose results are shown in Table III, IV and

(a) (b) (c)

Fig. 5. Recognition rates of four features in three environments with different SNRs. (a) Rain noise. (b) Wind noise. (c) Traffic noise. Fig. 4. The relation of recognition accuracy and the parameter K.

(5)

Fig. 5. As shown in the Table III, there are four features including projection feature, LBPV feature, double feature, and MFCC feature, which all have a high accuracy rate in noiseless condition. But double feature is slightly higher than the other three features.

Trying to simulate real environment, we perform an experiment under different noise environments with different SNRs. Wind noise, traffic noise, and rain noise are used to simulate real environments. Three noises are added to testing samples with 0dB, 5dB, 10dB, 15dB, 20dB, and 30dB SNRs. Table V shows the average accuracy rates of four features in three noise environments, and it can be seen that the average accuracy rate of double feature is 37.86% higher than MFCC, 16.58% higher than LBPV feature, and 5.71% higher than projection feature. This illustrates that the combination of two features can effectively improve recognition performance, and also denotes LBPV feature and projection feature are complementary.

Fig. 8 shows the recognition results of four features in three environments with different SNRs, and different noise environments make different effects on recognition performance. Compared the three environments, it can be seen that traffic noise has a worse influence on recognition performance while rain noise and wind noise have less influence. The accuracy rates of double feature are significantly higher than other three methods in SNR range of 0dB to 15dB. That means that our proposed method itself is robust to noise. When the SNR is higher than 15dB, the accuracy rates of LBPV feature and projection feature are close to double feature, but double feature still remains the highest accuracy rate among them.

IV. CONCLUSION

This paper proposes an animal sound recognition method, based on double feature of spectrogram, in different noise environments from real world. The results of experiments indicate the proposed method can not only has good recognition performance, but also is robust to noise. In the next stage of study, we will optimize feature extraction to further improve the accuracy rate under low SNR.

References

[1] D. K. Mellinger and W. C. Christopher, “Recognizing transient low-frequency whale sounds by spectrogram correlation,” The Journal of the Acoustical Society of America ,vol. 107, no. 6, pp. 3518-3529, 2000.

[2] D. Gillespie, “Detection and classification of right whale calls using an ‘edge’ detector operating on a smoothed spectrogram,” Canadian Acoustics, vol. 32, no. 2, pp. 39-47, 2004.

[3] D. Mitrovic , M. Zeppelzauer and C. Breiteneder "Discrimination and retrieval of animal sounds", IEEE 12th Int. Multi-Media Modelling Conf. Proc., 2006

[4] S. Fagerlund, “Bird species recognition using support vector machines,” EURASIP Journal on Advances in Signal Processing , vol. 2007, no. 1, pp. 64-64, May. 2007.

[5] R. Bardeli, “Similarity search in animal sound databases,” IEEE Trans. Multimedia,vol. 11, no. 1, pp. 68-76, Jan. 2009.

[6] D. C. Cugler, C. B. Medeiros, and L. F. Toledo, “An architecture for retrieval of animal sound recordings based on context variables,” Concurrency and Computation: Practice and Experience, vol. 25, no. 16, pp. 2310-2326, Jun.2013.

[7] V. Exadaktylos, M. Silva, and D. Berckmans, “Automatic identification and interpretation of animal sounds, applications to livestock production optimization,” InTech, Mar.2014.

[8] I. Potamitis, S. Ntalampiras, O. Jahn, and K. Riede, “Automatic bird sound detection in long real-field recordings: Applications and tools,” Applied Acoustics, vol. 80, pp. 1-9, Jue. 2014.

[9] X. Zhang and Y. Li, “Adaptive energy detection for bird sound detection in complex Environments,” Neurocomputing, 155(2015), p108–116.

[10] S. Deng, J. Han, C. Zhang , T. Zheng, and G. Zheng, “ Robust minimum statistics project coefficients feature for acoustic environment recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’14), 2014, pp. 8232-8236.

[11] J. Ye, T. Kobayashi, M Murakawa, and T. Higuchi, “Robust acoustic feature extraction for sound classification based on noise reduction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’14), 2014, pp. 5944-4948.

[12] Z. Guo, Z. Lei, and D. Zhang, “Rotation invariant texture classification using LBP variance (LBPV) with global matching,” Pattern recognition, vol. 43, no. 3, pp. 707-719, Mar. 2010.

[13] T. Ojala, P. Matti, and D. Harwood, “A comparative study of texture measures with classification based on featured distributions,” Pattern recognition, vol. 29, no. 1, pp. 51-59,Jan. 1996.

[14] T. Ojala, P. Matti, and T Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987,Jul. 2002.

[15] L. Breiman. “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.

[16] Universitat Pompeu Fabra. Repository of sound under the creative commons license, Freesound. org [DB/OL]. http://www.freesound.org, 2012-5-14.

TABLE III

THE RECOGNITION RATES OF LBPV FEATURE IN VARIOUS SACLES

P, R 8,1 16,2 24,3 8,1+16,2

Recognition rate 96.37 97.80 97.03 97.82

P, R 8,1+24,3 16,2+24,3 8,1+16,2+24,3

Recognition rate 97.32 97.25 97.55

TABLE IV

THE AVERAGE ACCURACY RATES IN DIFFERENT ENVIRONMENTS Noise

types

Average accuracy rates of different features

MFCC LBPV feature Projection feature Double feature Rain 47.22 67.48 83.85 89.21 Wind 50.25 75.87 85.68 91.92 Traffic 47.79 65.75 72.20 77.71 Average 48.42 69.70 80.58 86.28 TABLE III

COMPARISION ON DIFFERENT METHODS IN NOISELESS CONDITION

Method LBPV feature Projection feature Double feature MFCC Recognition rate 97.80 97.32 98.02 93.74