A Method for Perceptual Attribute Transcription

(1)

2017 2nd International Conference on Advances in Management Engineering and Information Technology (AMEIT 2017) ISBN: 978-1-60595-457-8

A Method for Perceptual Attribute Transcription

Li-xin SHI

1,*

and Chuan-ping TONG

2

1

College of Information and Communication Engineering, Dalian Nationalities University, 116600, China

2

Dalian Air Force Communication Sergeant College, 116600, China

*Corresponding author

Keyword: Perceptual attribute, Basic attribute, Perceptual coordinate system.

Abstract. The perceptual attribute of timbre is more important than the basic attribute and the statistical attribute for music retrieval and emotion analysis. To transcript perceptual attribute rapidly, a perceptual attribute transcription method applying the basic attributes is proposed. Pitch, volume and duration are used to construct 3D perceptual coordinate system and to transcript perceptual attributes. The experiment results of common music instrument show that the accuracy of this method is more than 70%.

Introduction

Timbre is the feeling features of sound, and is multidimensional, including basic attributes, statistical attributes, and perceptual attributes. The attributes of each layer of timbre are interrelated, as well as significantly different. Timbre analysis as the basis of music signal processing has wide application field, where the basic attribute is suitable for note recognition and chord recognition, the statistical attribute for rhythm recognition and melody recognition, the perceptual attribute for music retrieval and emotion analysis.

The domestic research on timbre mainly focuses on how to change the timbre of music instrument by playing skill, and rarely relates to various attributes of the timbre. Liu [1] et al. summarized the basic features, complex features and overall characteristics of music in 2002, and extracted melody, harmony, rhythm and other complex characteristics to control musical fountain next year. Li [2] et al. analyzed music structure using the chroma feature based on discrete cosine transform. CHEN and LEI [3] presented an exploration of visualizing multivariate musical content based on visual thumbnails.

Foreign scholars have been devoted to the research and application of timbre since ASA gave the definition of timbre in 1960. Jensen [4] analyzed timbre attributes from the aspects of statistics, minimum description and the definition of instruments in detail in his doctoral thesis. Since then the basic attributes and statistical attributes have been widely used. The research of early perceptual attributes mainly focused on the establishment of the perceptual space. Alluri and Toiviainen [5] suggested three perceptual dimensions: activity, brightness and fullness. Elliott [6] et al. attempted to relate the perceptual dimensions of timbre to quantitative acoustical dimensions and showed that a specific combination of acoustic properties uniquely determines gestalt perception of timbre. Fritz [7] et al. investigated the relation between acoustical properties of the violin and the perceptual qualities as expressed in the lexicon of timbre descriptors. Zacharakis [8] et al. analyzed perceptual variables through Cluster and Factor Analysis techniques and identified three salient relatively independent perceptual dimensions. Zanoni [9] et al. used regression techniques of machine intelligence and audio features to model a set of high-level (semantic) descriptors for the automatic annotation of musical instruments in a training-based fashion.

(2)

Feature Extraction

Pitch

There are many approaches to detect pitch, such as short-time auto correlation method and short-time average magnitude difference method in time domain, harmonic peak method in frequency domain, wavelet analysis method in time and frequency domain. Each method has its advantage and disadvantage. In this paper, harmonic peak method with confidence is used to detect pitch of sound.

The sound of music consists of the fundamental frequency and some higher harmonic overtones. Pitch is in its simplest form seen as the fundamental frequency. When the fundamental frequency is missing, it can be recreated from the difference of higher harmonic overtones.

Considering following cases:

i)Although the amplitude of the fundamental frequency maybe not the maximum, the frequency with maximum amplitude is certainly one harmonic overtone of the fundamental frequency.

ii)The more there are harmonic overtones, the more possibility it is the fundamental frequency. The fundamental frequency is calculated in harmonic peak method with confidence using formula (1) and (2):

LN =

, 1 ≤ N ≤ 5 (1)

BN = ∑ pi

(2)

where L(N) is the candidate fundamental frequency, fp is the frequency with maximum peak, B(N) is

the confidence, m is the number of harmonic, P(i) is the amplitude.

Usually, the semitone is used to describe pitch in music. The relationship between semitone and the fundamental frequency is as follow:

s = 12 ∗ log + 69 (3) Volume

The energy of an acoustic signal is easy to measure, which can be simply the expected value of the square of a set of samples inside a window of length w, and the volume can be approximated as a logarithmic function of its energy. Since the energy of an acoustic signal changes violently in different stages, average of energy is applied to calculate the volume.

V = 10 ∗ log _&∑& s[n]

* (4)

Duration

(3)

then follow the derivative both backward and forward in time until it is smaller than a constant multiplied with the maximum of the amplitude. There are the start and end of the attack.

The same is done for the release, the minimum of the derivative, which corresponds to the middle of the release, is calculated as:

rt = min _/0/envelope (6) then follow the derivative both backward and forward in time until it is bigger than a constant multiplied with the minimum of the amplitude. There are the start and end of the release.

Perceptual Space

There are many possible ways how to construct perceptual space: i)get proximity ratings directly from the psychoacoustic experiment, ii)calculate them from any collection of data, and then use appropriate model of multidimensional scaling. In this paper, we focus specifically on constructing perceptual space using basic attributes of timbre.

Collection

We record 426 segments music samples in normal room, including 4 pianos, 3 violins, 2 guitars, 2 flutes and 3 saxes. Each segment contains only one note. We also collect 121 terms from descriptions of music in articles within China national knowledge internet published between 2005 and 2015. Within all the terms we achieved, some of them appear frequently, and others appear occasionally. Thirteen volunteers including professional music instrument players and normal audiences are invited to present 426 segments music samples using 121 terms collected before. For each term, every volunteer can present it using 1 to 2 terms. There are 27 terms is mentioned only 2 times, and 11 terms 1 time, and 12 terms is ignored completely. These terms which can not embody the expression of majority, is removed from collection.

Coordinate System

Pitch, volume and duration are the most common expression parameters used for isolated sounds in music. Pitch defines the perceived note of the sound, volume defines the perceived intensity of the sound and duration defines the length of the sound. Since we try to transcript perceptual attributes of timbre using basic attributes, pitch, volume and duration are selected as coordinate axis. The pitch of piano is from 27.5Hz to 4186Hz, which is from A2 to c5. Coordinate scale of pitch is divided using semitone, which is from 21 to 108, coordinate scale of volume is divided using normalized volume, which is from 0 to 1, and coordinate scale of duration is divided using second.

Each music sample has its pitch, volume and duration, corresponds with one certain point in coordinate system. Samples of the same term have similar pitch, volume and duration, near each other and occupy specific area in coordinate system. Thus, this specific area is described as the specific term, and other area is described as other term. Within many terms, there are some samples whose pitch, volume and duration are very dissimilar with other samples in the same term, and they are even confused with other term. These samples should be removed from its term. There is another case that the pitch, volume and duration of one term is very close to another term. These two terms mix together, and should be merge into one term.

(4)

heavy brisk hard deep metal rough dreary dim uneven lively grim closed weak light complex tinny smooth bleak resonant echoic balanced narrow sonorous nasal unripe dull silent numb stuffy informal crude quiet rich song full unbalanced ringing clear sweet gorgeous velvety clean pure attenuted loud mellow soft gloomy brilliant strong slippery mournful cheerful broad silery bright sharp dynamic keen shrill ripid harsh tinny thin Volume(dB) Pitch(semi) Duration(ms)

Figure 1. 3D coordinate system.

Experiments

[image:4.612.135.489.72.318.2]

304 segment samples is selected from the sample library, which is consistently labeled by the majority of volunteers (consistency >80%) , and the term which the majority of volunteers selected is used as the standard perceptual attribute of the sample. Pitch, volume and duration are calculated according to formula (1-3), and then the test perceptual attribute of this sample is labeled in the 3D coordinate system. The test perceptual attribute is compared with the standard perceptual attribute of the sample, the test results are shown in table 1.

Table 1. The relationship between test perceptual attribute and standard perceptual attribute.

Coincident Near Far away

217 73 14

71.4% 24.0% 4.6%

The test results show that:

(1)The transcription accuracy is different for different kinds of musical instruments. The deviations of the violin, flute and sax are more and bigger, while those of piano and guitar are less and smaller. The playing techniques of the violin, flute and sax have a greater impact on timbre, which lead to transcription deviation of perceptual attributes.

(5)

References

[1] Liu Dan, Zhang Naiyao, Zhu Hancheng. A Review on the Research of Music Features Recognition. Computer Engineering and Applications, 2002, 24.

[2] Li Xianglian, Li Ming, Liu Ruolun, Yan Yonghong. Music Structure Analysis Based on Timbre Unit Distribution. ACTA Acustica, 2010, 35(2).

[3] Chen Yaxi, Lei Kaibin. Visualizing multivariate musical content based on visual thumbnails. Application Research of Computers, 2012, 29(7).

[4] K. Jensen. Timbre Models of Musical Sounds[D], 1997.

[5] V. Alluri, P. Toiviainen. Exploring Perceptual and Acoustical Correlates of Polyphonic Timbre. Music Perception, 2010, 27(3).

[6] TM. Elliott, LS. Hamilton, FE. Theunissen. Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones. Journal of the Acoustical Society of America, 2013, 133(1).

[7] C. Fritz, AF. Blackwell, I. Cross, J. Woodhouse, BC. Moore. Exploring violin sound quality: Investigating English timbre descriptors and correlating resynthesized acoustical modifications with perceptual properties. Journal of the Acoustical Society of America, 2012, 131(1).

[8] A. Zacharakis, K. Pastiadis, J. Reiss, and G. Papadelis. Analysis of Musical Timbre Semantics through Metric and Non-Metric Data Reduction Techniques. In Proc. of 12th International Conference on Perception and Cognition, 2012.