語音情緒辨識在 VoIP 客服系統上的應用 Application of Speech Emotion Recognition in

(1)

語音情緒辨識在

V

o

IP

客服系統上的應用

Application of Speech Emotion Recognition

i

n

t

h

e

V

o

I

P Call Center

P

研究生：林秋煌

(Chou-Huang Lin)

指導教授：包蒼龍

(Prof. Tsang-Long Pao)

大同大學

資訊工程研究所

碩士論文

Thesis for Master of Science

Department of Computer Science and Engineering

Tatung University

中華民國九十七年十二月

(2)

(3)

ACKNOWLEDGMENTS

這篇論文的完成，首先要感謝我最深愛的家人，爸媽、內人、兒子、

哥哥、姐姐，謝謝你們這二年來一直給我最大的心靈支持與鼓勵，

也因為你們默默的付出，才讓我能順利的完成研究所的學業。

特別感謝我的指導教授包蒼龍博士兩年來的悉心指導，讓學生不論

是在知識上的學習或是在實作研究上，都有非常大的獲益。另外，

我還要感謝資工所語音情緒辨識實驗室學長白鎮宇在技術上的支

援及同窗正嘉、啟豪、信宏、昆達、名明，以及學弟士勛、永護、

芳逸、政賢的相互支持與協助。僅將此論文獻給所有幫助過我的

人，以表達我內心最誠摯的謝意。

(4)

ABSTRACT

The interactive voice response (IVR) in a Voice over Internet Protocol (VoIP) network has been addressed in the past few years. In the traditional Call Center, the IVR mechanism is manually done when transferring the on-line customer interaction from operator to customer service manager. This may be harmful to the corporate image because of not able to resolve the interaction problem immediately. This thesis proposed an IVR mechanism based on the VoIP system with build-in speech emotion recognition engine which can automatically switch the conversation from the operator whose emotion is out of control to a more experienced custom representative. The speech emotion recognition engine is developed by the Data Communication and Signal Processing Lab in the Department of Computer Science and Engineering, Tatung University. The engine incorporates with the Conference Call Control of IVR System of the VoIP-PBX server to perform automatic call transfer when certain predetermined emotion state of the customer representative is detected.

In order to emulate the IVR System built upon VoIP-PBX system, we install an open source SIP-PBX, the Asterisk® server, which is a PC based VoIP PBX software. We use this system as the base to implement the IVR system having crisis management based on emotion recognition engine. The emulation results reveal that such a system is feasible and is valuable to the call center service.

Keywords: interactive voice response (IVR), Voice over Internet Protocol (VoIP), Session Initiation Protocol (SIP)

(5)

摘要

IVR

在

VoIP

網路的應用是最近幾年陸續被研究的課題。有鑒於傳統客服

中心之

IVR

機制只能以人之主觀意識來作第一線客服人員和資深客服人

員手動轉接線上客戶的方式，由於有可能會因無法立即排除溝通障礙而

影響企業形象。本篇論文提出利用具有語音情緒辨識機制的

IP-PBX IVR

系統將可自動排除客戶與客服人員通話時情緒失控的問題，進而為公

司保持良好企業形象。在本論文中，我們利用本實驗室開發的語音情緒

辨識引擎在客戶與客服通話情緒失控時，自動觸發會談呼叫控制機制，

即時的讓資深客服人員介入安撫客戶情緒。

為了模擬

IP-PBX

伺服器的

IVR

系統，我們的實驗架設了

Open

Source

通訊平台

Asterisk®

。藉由這個系統我們模擬了客戶和客服人員間

的危機處理情境，來證實我們的

IP-PBX IVR

自動化機制的架構是可行的

(6)

LIST OF FIGURES

Figure 2.1: A simple session establishment example ... 8

Figure 2.2: The communication process of VoIP ... 11

Figure 2.3: VoIP multi-conference in many ways ... 12

Figure 2.4: Conversation structure of User Agent (UA) ... 13

Figure 2.5: Call center structure [6] ... 15

Figure 2.6: Block diagram of emotion recognition system [9] ... 22

Figure 2.7: Block diagram of emotion evaluation system [9] ... 24

Figure 2.8: Emotion radar chart of test data with angry emotion [9] ... 25

Figure 2.9: System interface (evaluation of test data from recording) ... 25

Figure 3.1: IP-PBX extension configuration ... 27

Figure 3.2: Interface between the speech emotion recognition engine and the IP-PBX server in a VoIP network ... 28

Figure 3.3: Architecture of thhee VoIP-IIVVRRSSyysstteemm ... 29

Figure 4.1: The inbound customer with extension 200 use X-Lite software to connect to Asterisk IP-PBX Server ... 30

Figure 4.2: The customer use X-Lite software to dial extension 201, the CSR ... 31

Figure 4.3: The X-Lite software used by CSM ... 32

Figure 4.4: Wireshark capture interfaces ... 33

Figure 4.5: Wireshark capture options ... 33

Figure 4.6: The connection status of a SIP call redirection process ... 34

(9)

LIST OF TABLES

Table 2.1: Basic request methods and response codes used in SIP [3] ... 7 Table 2.2: Human performance confusion matrix (%) [11] ... 23 Table 2.3: Experimental results of weighted D-KNN (k=10, Fibonacci series weighted)

(10)

CHAPTER 1 INTRODUCTION

1.1 Call Center and IVR system

Call Center is an important part of a modern enterprise. It is an useful tool that can improve the competitiveness of an enterprise, the Customer satisfaction. Successful enterprises are usually those companies which pay attention to customer needs. It is important for enterprise to response to the customer need in order to survive and expand in the highly competitive world. The Call Center may be able to collect and supply information to every department. Thus, its contribution to the company is promising.

Even though the Call Center is important to enterprises, however, the traditional Call Center is lack of crisis management related to the emotion of calling customer. Thus, incorporating the speech emotion recognition engine to the IVR System of the Call Center has its potential application. For example, it can avoid letting the same representative service bad emotion customers consecutively. By the use of speech emotion recognition engine, it can reduce the stress to the staff, and may also be able to reduce drop rate of representative. Furthermore, by analyzing the conversion between the customer and the representative we can perform content analysis and emotion detection to understand the need of the customer, and may then solve the problem for the customer rapidly. Thus, the IVR System of the Call Center incorporated with the speech emotion recognition engine makes it possible to improve the customer satisfaction and create competitiveness for the enterprise.

(11)

1.2 Motivation and Objective

The tradition Automatic IVR system can help enterprises to deal with a large amount of daily customer requests. If the IVR system can detect the angry condition between the inbound customer and forefront customer representative, and redirect the conversation line automatically for the experienced customer manager, then it can prevent the emotional conflict among the customer and forefront representative. And it will promote company’s corporate image, and then keep the good relation between company and customer. In this thesis, we proposed an IVR incorporated speech emotion recognition engine to improve the service quality of the customer service.

In the reason of running off of the customer, it is often a main factor to fail to pacify customer’s emotion. Though some IVR-PBX system can analyze customer’s emotion after conversation and notify the manager when it happened, customers usually lose their temper in that conversation. Based on the above reason, we think that it is necessary to put forward a mechanism that can reflect customer’s emotion and pacify customer’s emotion immediately. In addition, VoIP becomes one of the new generation telecommunication technologies now, so for the convenient and practical, our experiment regards VoIP as the baseline technology and uses the speech emotion recognition engine to detect the anger emotion at the conversation between customer and operator. Finally, we emulate the setup of the conversation lines of the customer and Customer Service Manager (CSM) which put together through Third Party Call Control (3PCC) function that Session Initiation Protocol (SIP) supports. And then reaches the result of pacifying angry customer.

(12)

1.3 Thesis Organization

The rest of the thesis is organized as follows: In Chapter 2, we will review general guidelines of developing an IP-PBX IVR System and some backgrounds of SIP Protocol. The engine of speech emotion recognition will also be discussed in this Chapter. In Chapter 3, we will explain the construction of IVR with automatic speech emotion recognition. In Chapter 4, we present the mechanism of IVR with automatic speech emotion recognition. Finally, conclusions are given in Chapter 5.

(13)

CHAPTER 2 BACKGROUNDS

2.1 SIP Introduce

Voice over Internet Protocol (VoIP) is gaining a lot of attention these days, as more companies and individuals switch from the Public Switch Telephone Network (PSTN) to phone service via the Internet. The reason is simple: a single network to carry voice and data is easier to scale, maintain, and administer. Additionally, since 1998, some network industries realize VoIP technology can substitute the traditional PSTN, and thus begin to set up relevant IP telephone network communication protocols, such as H.323, SIP, Media Gateway Control Protocol (MGCP), etc. There are many issues need to be considered in developing a VoIP-IVR Call Center with a speech emotion recognition including VoIP, PBX, and the function of speech emotion recognition.

At first, SIP was not gaining too much attention, as H.323 was considered the protocol of choice for VoIP transport negotiation. However, as the business grew, SIP began to gain popularity, and a lot of different factors accelerated its growth. Especially, the success of SIP is due to its freely available specification [1].

2.1.1 SIP Definition

A SIP URI identifies a communications resource by the SIP RFC 3261. SIP is an application-layer signaling protocol that uses the port 5060 for communications. SIP can be transported with either the UDP or TCP transport-layer protocols. However, SIP does not transport media between endpoints. RTP is used to transmit media (i.e., voice) between endpoint [1].

Like all URIs, SIP identity has a similar form similar to an email address, typically containing a username and a host name. For example, the identity of SIP is like

(14)

sip:[email protected], where domain.com is the domain of Bob's SIP service provider. SIP also provides a secure URI, called a SIPS URI. An example would be sips:[email protected] [2]. SIP and SIP URIs may be placed in web pages, email messages, or printed literature. They contain sufficient information to initiate and maintain a communication session with the resource.

2.1.2 SIP Protocols

The SIP protocols are designed to establish the end-to-end voice call. Generally, speaking, there are two types of protocol from the perspective of VoIP: [3]

 Signaling Protocols: These protocols are used to perform the functions related to setting up and maintaining a call. These tasks include:

1. Callee Location: This involves finding out the current location (in the network) of a callee when being called. The signaling protocol provides a mechanism to communicate with the authority to locate the user.

2. Availability Determination: The protocol must decide whether the called party is available and where to redirect the call if not available. For example, all calls can be set to redirect to voicemail if the called party is busy. The signaling protocol should be able to configure the desired action.

3. Session Parameter Negotiation: The caller and callee must agree upon the parameters used in order to communicate. These parameters include the type of media, the codec to be used, etc. The signaling protocol will help both parties to agree upon the parameters.

4. Session Modification: While a call is ongoing, the parties may decide to change the content of the transfer media. For example, the codec might change to improve the quality if the end-to-end path has better bandwidth. Such an intent has to be signaled to the other party.

(15)

5. Session Termination: When the parties involved in the communication decide to end the call, they need to signal each other and possibly to other entities that the call is ended.

 Media Transport Protocols: These protocols are used for the actual media transport in a call. The voice packets are transmitted between the parties using these protocols. Thus, encoding/decoding, packetization and transport of the VoIP packets are done by these protocols.

2.1.3 SIP Operation

SIP is a request-response protocol where each message is either a request or a response to a request. Its commands are in clear-text and human readable and are similar to HTTP with many of the response codes being reused (e.g. 200 – OK, 404 – Destination Not Found). A summary of basic SIP methods and response code prefixed is given in Table 2.1. Figure 2.1 shows a simple example of SIP session establishment.

(16)

Table 2.1: Basic request methods and response codes used in SIP [3]

Basic methods

Response

code prefix Purpose

REGISTR _{Update Register’s database with user’s location} INVITE _{Call Initiation}

ACK _{Acknowledgement for INVITE/BYE}

CANCEL _{Stop the execution of another command (such as INVITE)} OPTIONS _{Find out a server’s capability}

BYE _{Terminate a call}

1xx Provisional – indicates that request is being processed 100 Trying (試圖完成要求中)

180 Ringing (巳送出振鈴訊號)

2xx Success – request was successfully understood and accepted

200 OK (巳成功接受請求)

3xx Redirection – further action required to complete the request 302 Moved Temporarily (暫時重新導向)

4xx Client Error – bad syntax in request or unable to complete the request at this server

404 Not Found (找不到使用者)

487 Request Terminated (要求被中止)

5xx Server Error – server failed to execute a seemingly valid request 504 Server Time-out (伺服器逾時)

6xx Global Failure – request cannot be completed at any server 600 Busy Everywhere (全線忙碌中)

(17)

Figure 2.1: A simple session establishment example

Figure 2.1 shows the SIP message exchange between two SIP-enabled devices. These two devices could be SIP phones, hand-held devices, palmtops, or cell phones. It is assumed that both devices are connected to an IP network and know each other's IP address. SIP implements a three-way handshake call establish mechanism. The caller, John, sends an INVITE. And the callee, Mary, sends a 200 OK to accept the call. Finally, The caller sends an ACK to indicate that the handshake is done and a call is established.

The INVITE contains the details of the type of session or call that is requested. It could be a simple voice session or a multimedia session such as a video conference. Since SIP is a text-encoded protocol, the information shown in Fig. 2.1 is actually what the SIP message look like on the wire using the UDP datagram and being transported over the Internet.

INVITE is an example of a SIP request message. There are five other methods or types of SIP requests currently defined in the SIP specification RFC 3261 and others in

(18)

extension RFCs. The next message in Fig. 2.1 is a 180 Ringing message sent in response to the INVITE. This message indicates that the called party Mary has received the INVITE and that alerting is taking place. The alerting could be ringing a phone, flashing a message on a screen, or any other method of attracting the attention of the called party, Mary.

The 180 Ringing is an example of a SIP response message. Responses are numerical and are classified by the first digit of the number. A 180 response is an "informational class" response, identified by the first digit being a 1. Informational responses are used to convey noncritical information about the progress of the call. Many SIP response codes were based on HTTP version 1.1 response codes with some extensions and additions. As a "404 Not Found" in the web response, the 404 Not Found is also a valid SIP "client error class" response in a request to an unknown user.

When the called party Mary decides to accept the call (i.e., the phone is answered), a 200 OK response is sent. This response also indicates that the type of media session proposed by the caller is acceptable. The 200 OK

2.1.4 SIP components

The SIP protocol has several components. The key components of a SIP based system are as followings.

is an example of a "success class" response.

 User Agent (UA): The user agent is an entity which forms an endpoint of a SIP session. In a client role, it initiates a SIP session and in a server role, it listens to client requests. UA can be implemented in hardware or software.

 SIP Registrar: The SIP Registrar is the entity with which a user registers its identity and location. The registrar maintains location server to keep track of the current position of the user. This enables SIP to be used in presence-based applications where the current location and status of a user determines the actions needed to be

(19)

taken with an incoming call.

 SIP Proxy Server: The SIP proxy server relays the incoming/outgoing signaling messages to the appropriate entities. On receiving a call request, it contacts the registrar to locate the target UA, forwards the request to the UA, and relays the response from the UA to the contacting client.

2.1.5 Session Description Protocol (SDP)

SDP is used to describe the parameters of a session. When initiating a session between these devices, it is necessary for the two parties to agree upon a common framework for information transfer. For example, if both parties have a capability to encode/decode a voice call from a multitude of codecs, they need to agree upon which codec they will use.

SDP provides a simple way of negotiating such parameters. SDP forms part of the SIP INVITE message containing information that allows the participants to decide upon the format of communication. SDP has been designed for general session description and is not limited to a voice call. For example, it can be used to describe a conference. Among other fields, SDP includes information about the media that comprises the session (in form of the type of media, the transport protocol used, and the media format), and the information to receive the media (such as the IP address and port number). Using this information the parties can agree upon the baseline for their media transfer [3]. More details about SDP are described in RFC 2327 [4].

(20)

Figure 2.2: The communication process of VoIP

2.2 Operations of VoIP

In VoIP, when end user A wants to talk to end user B, the end user A sends out the analog voice signal. This signal is converted to digital signal by an analog-to-digital converter (A/D). The Coder-decoder (Codec) is used to process the digital signal according to the relevant agreement and compressing format. The results are a series of IP packets which are suitable to transmit in the Internet.

In the receiving side, the received IP packets are fed to a Codec to decompress and decode into the original digital signal. Then the digital signal is converted by a D/A converter into the analog voice signal which can then be heard by user B through the earphone of the IP phone. The communication process is shown in Fig. 2.2.

The A/D converter shown in Fig. 2.2 takes an analog audio signal and converts it into a digital bitstream. A codec is used to compress and encode the digital bit-stream into a smaller size and in a form that is suitable to be transmitted in the IP network. Without a

(21)

codec, no audio can be transmitted through IP network. The choice of codec depends on the voice quality we can accept and the bandwidth as well as the processing capacity available. The use of codec is always a compromise between these factors.

2.3 Call Control by utilizing Multi-Conference in Many Ways

The third party call control (3PCC) of SIP defines a mechanism that allows a third party to setup and manage the communications between two or more other parties. It is usually used in the operator services and conferencing. Operator intervention mechanism is required for our system and the 3PCC control method is the mechanism we used. The structure of the multi-conference in many ways is shown in Fig. 2.3. The operator takes charge of the media of both sides of conversation. The 3PCC is a flexible control method.

Figure 2.3: VoIP multi-conference in many ways 2.3.1 The Model of SIP Redirect Call

The role of Redirect Server (RS) is to direct the call to the location asked by the caller or redirect to the new address if the callee moved. The function of RS is simple and performs only a few basic tasks. It receives SIP REGISTER messages from User Agents (UAs), keeps track of registered users and their locations, and provides routing information for SIP INVITE messages [5].

(22)

After the Location Server found and called the callee address, the Redirect Server will response the caller with the directional information, and let caller speak with the callee and then establish the conversation. After the conversation being set up, RTP media streaming will be established between the caller and callee, in order to talk to each other. In this research the conversation structure of User Agent (UA) which is registered to the Register Server is shown in Fig. 2.4.

(23)

2.4 Private Branch Exchange

2.4.1 Traditional PBX

The call center system provider, Unisys, has made a definition of call center according to its actual operation in 1999 [6]. Figure 2.5 shows the components of a call center. The main components of a SIP call center are:

(PBX)

Enterprises have been using PBXs for many years to fulfill their communication need. Basically, a PBX is just another telephone exchange that serves a particular enterprise. We first give an overview of why PBXs are important in the enterprises along with the features they support. In the end, we will take a look at IP-PBXs currently available.

 PBX : Private Branch Exchange

 ACD : Automatic Call Distribution

 CTI : Computer Telephony Integration Server

 IVR : Interactive Voice Response

 FOD : Fax on Demand Server

 CDR : Customer call Detailed Recording

(24)

Figure 2.5: Call center structure [6]

Because the purpose of this study is to detect the emotion of the VoIP voice stream flow in the IP-PBX, we are not concerned with other components of a call center. Our study will concentrate on the PBX. Traditional PBX provides several invaluable features that enable efficient communication in the organization as well as significant savings in the telecommunication cost. We will discuss the basic functions and features provided by a PBX in detail. Along with the basic task of routing internal and external calls and multiplexing the external lines based on demands from internal devices, the PBX provide server features that are of interest to enterprises. The most important features are as follows:

Call Forwarding: This feature is used to route calls to another device. The call forwarding logic specifies which device to forward a call to when certain events happen. For example, when the original extension is not answering the call, the call can be forwarded to the voice-mail system.

Call Transfer: An on-going call can be transferred across different extensions connected to the PBX. This happens when the first-line representative cannot handle the requests of

(25)

the customer and need the manager to take care the call.

Conference Calls: The PBX can connect more than two parties into a group communication session called a conference call.

Automatic Call Delivery (ACD): The ACD feature is used to route calls to appropriate agents. This could be based on the specialization of the agent, caller priority, etc. The ACD may be used in conjunction with IVR system which automatically guides users through menus on the telephone asking them to choose among a set of options by pressing different keys on the keypad or speaking certain phrases.

Voice Messaging: This involves recording voice messages for users and then playing it back to them. It can be used in conjunction with the Message Waiting Indicator by which a user knows he/she needs to access the voice message box where one or messages are stored.

Call Queue: In case a call arrived when all the extensions who can answer it are busy, then rather than the call being dropped and the user having to call again, the call can be placed in a queue from which the call will be picked to answer when someone becomes available. This feature can be used in conjunction with the Music on Hold feature where the caller that is put in the queue can hear some music or other information, while s/he waits.

Least Cost Routing (LCR): The calls to the cellular devices can be routed via the PSTN network but will incur additional cost. Alternatively, the call can be routed via a cellular gateway directly making the call cheaper [3].

2.4.2 Introduction to IP-PBX

A PBX that can support VoIP Calls is called IP-PBX. From an enterprise’s perspective, VoIP offers an excellent communication capability with a fraction of infrastructure cost of the traditional telephony. However, its external customers may very

(26)

well be connected to the PSTN using analog lines. Thus, while it wants an IP-PBX to handle its internal VoIP network, it also needs the capability to communicate with the PSTN. This naturally means that in order to be of practical use for enterprises, IP-PBX devices need an ability to communicate with the PSTN network [8].

Like normal PBX, an IP-PBX can also be implemented in hardware or software. In the earlier days when computers were not as powerful, specialized hardware was an integral part of the core of the PBX devices. Faster CPUs have enabled software to take over more and more tasks from the hardware for most of the system. We will address the function of an IP-PBX with the most popular software-based IP-PBX, Asterisk.

The required functionalities are separated into two sets. The first set is called PBX core which implements the core functionality that determines the internal interconnections of a PBX independent of the specific protocols, codecs, etc. that are used to communicate. The PBX core functionality consists of:

 PBX Switching: This connects call together between various users and automated tasks. The switching is done without knowing the hardware/software interfaces used by different parties involved in the call.

 Application Launcher: This launches applications which perform various services such as voicemail, music on hold, and directory listing. Users can write their own applications similar to CGI scripts.

 Codec Translator: The codec translator functionality is used to convert the encoded voice packets from one format to another. Each codec module has specified translation capabilities. This facility is exported in terms of encoding and decoding its communication format. The task of translation involves decoding from one format and encoding into another.

(27)

under different scenarios. This component is responsible for optimizing the performance of the system under different operation conditions.

In order to handle the specific hardware/software that endpoints use and to provide the ability to read/write different formats of data, the IP-PBX generally uses loadable modules. These loadable modules have the following APIs:

 Channel API: The channel API handles the type of connection on which a user connects. This could be a hardware channel, such as connected device, or a software channel, such as SIP. Each channel abstraction implements the standard specification for that type of channel.

 Application API: The application API allows various task modules to perform functions such as voicemail, call transfer, etc. Therefore, if the processing for a particular call requires a particular application to be executed, this API is used to execute the desired application.

 Codec Translator: This API is used to load the coded modules to support different types of audio encoding/decoding functions associated with each codec module to translate between codecs. The codec translator uses the encoding/decoding functions associated with each codec module to translate between codecs.

 File Format API: This handles the reading and writing of various file formats for the storage in the file-system. For example, the voice clips can be used to create a menu in an IVR.

Using these APIs, the IP-PBX separates the core functions of a PBX server system and the different technologies in the telephony. This modular structure allows it to integrate both traditional Public Switch Telephone Network (PSTN) technology and the VoIP technologies. If a new protocol is to be added, the only requirement is to create a new channel for the protocol using the channel API. Similarly, in order to create user-specific

(28)

(29)

2.5 The Case of Call Centers Using “Emotion Detection”

Over the past several years, technology designed for eavesdropping has found a new home in corporate call centers. Recording and analyzing millions of phone conversations between customer service agents and consumers in an effort to better digest and organize what customers are saying.

FedEX, a transportation service company, uses speech emotion software to encourage employee morale. They match the word “wow” in the conversation between a customer and the Call Center Representative or to detect gratification from a caller. Then, the conversation is recorded and be shared to everyone of Call Center. By the example of above, it is proved that speech emotion recognition in conversation is needed for a company today. Wisconsin Physicians Service searched the word “Medicare” combined with “confused” from the conversation between customer and Call Center Representative. Then the system addressed the conversation and recorded it, and let the sales call to these customers to resolve customer’s confusion or provide service for customer. They all record and transcribe conversations and categorize them by words and phrases. The speech emotion system use algorithms that measure a baseline of emotion in the front few seconds of a phone call. If the customer's voice deviates from that baseline, a supervisor is alerted [7]. By the above examples, it will figure out the artificial intelligence speech emotion recognition system, such as by analyzing voice tone as well as pitch, tempo to differentiate between human’s angry and someone mimicking angry in the future.

The thesis will use a speech emotion recognition engine to afford a nonverbal detection. And we will develop a mechanism to detect the angry emotion of customer during the conversation, not the method to analyze the key word.

2.6 Speech Emotion Recognition

(30)

everything that is done, said, thought or imagined. Making computers being able to perceive and respond to human emotion, the human-computer interaction will be more natural. A lot of researches are focused on automatically assigning an emotion category, such as anger, happiness or sadness, to a speech utterance [8].

In order to differentiate from other IP-PBX systems that examine verbal information, our IP-PBX system tries to integrate the speech emotion recognition system developed by the research laboratory in the department of Computer Science and Engineering, Tatung University. Thus, our IP-PBX system can redirect call according to the emotion of the customer comparing to other IP-PBX systems that use the context to classify the emotion which is not very robust.

Because the angry emotion is the most common case of bad communication between the Call Center and customer, our IP-PBX system uses the speech emotion recognition to detect the anger emotion. The recognition system uses weighted D-KNN with Fibonacci weighting [9].

2.6.1 The Method of Conveying Emotion

Emotions can be considered as communications to oneself and others. The vocal cue is one of the fundamental expressions of emotions. Tone is the variation of pitch within a syllable. Chinese uses tones to distinguish words. Each word and phrase must be spoken at the right tone or the meaning may change and probably will be misunderstood. Previous studies of emotional contents in speech used prosodic information such as the pitch, duration and intensity of the utterance. To date, most works have concentrated on the analysis of the human vocal emotions [9].

More recently, the increasing awareness that affective computing had an important industrial potential has pushed research towards the quest of performance in automatic recognition of emotions in speech. For our research, detecting the emotional state of a

(31)

person can has the ability to recognize affective states of a person we are talking to in the IVR of call center. We will describe speech emotion recognition in succeeding subsection. 2.6.2 Emotion Recognition from Mandarin Speech Signals.

Speech signal conveys not only words and meanings, but also emotions. Nowadays, the role of the vocal expression of emotion is gaining increasing importance in the computer speech community. Emotion recognition technology is very desirable because it adds to the appeal of electronic systems by contributing to the user’s perception of the system’s intelligence and adaptability [10].

Figure 2.5 shows the block diagram of emotion recognition system. In the system, to perform the emotion recognition, it need to have an emotional speech corpus. there are five emotions are investigated: anger, happiness, sadness, boredom, and neutral.

Figure 2.6: Block diagram of emotion recognition system [9]

Table 2.2 shows the human performance confusion matrix. The rows and columns represent portrayed and evaluated categories, respectively. For example, first row shows that 89.6% of utterances that were portrayed as angry were evaluated as angry, 4.3% as happy, 0.9% as sad, 0.8% as bored, 3.3% as neutral, and 1.0% for none of above. As can be seen from the table, the average accuracy is 80.9% and the most easily recognizable category is anger and the poorest category is happiness. These results reveal that even normal people sometimes cannot recognize the emotional states only from speech and

(32)

sometimes are confused in differentiating anger from happiness, and boredom from sadness.

Table 2.2: Human performance confusion matrix (%) [9]

Feature set selection is one of the most critical parts in all recognition systems Various features relating to pitch, energy, formants, tunes, spectral intensity, etc., had been studied. In the speech emotion recognition engine used, it calculated several features. Especially, the engine used a weighted discrete k-nearest neighbor (weighted D-KNN) to improve the performance of modified k-nearest neighbor (M-KNN). The purpose of weighting is to find a vector of real-valued weights that would optimize classification accuracy of the classification or recognition system. Table 2.3 shows the results of the weighted D-KNN with different weighting series. In the engine used, the best accuracy of 81.4% is obtained. The result reveals that recognition rates of all categories are above 70%. Besides, it can distinguish angry from bored perfectly. Like all other classifiers, the most easily recognizable category is anger and the most difficult one is happiness. [9]

(33)

Figure 2.7: Block diagram of emotion evaluation system [9]

Emotion evaluation is to evaluate the emotion expression of a speaker. In the speech emotion recognition engine used, the evaluation method is based on weighted D-KNN classification. Figure 2.7 shows the block diagram of the emotion evaluation system.

After applying the weighted D-KNN, we obtained five evaluation values, each from one of the five emotion categories. Then, these values were plotted in the emotion radar chart. Each of the axes stands for one emotion category [9], as shown in Fig 2.8.

(34)

Figure 2.8: Emotion radar chart of test data with angry emotion [9]

An emotion radar chart is built to show the recognized emotion of the input utterance. This chart shows the relative strength of an utterance in each of the emotional categories. An example emotion radar chart is shown in Fig. 2.8.

Figure 2.9 is the human-computer interface of our emotion recognition system (ERS). The source of the test speech is chosen from source block. Test speech can be obtained from disk or from real-time recording. And Result block shows the recognition result of emotion of the test speech. Similarly, the emotion radar chart can be showed to the CSM of call center. Let he/she realizes the emotion of customer before talking to customer on the line transferred by IP-PBX.

(35)

CHAPTER 3 CONSTRUCTION OF IVR WITH AUTOMATIC

SPEECH EMOTION RECOGNITION

One way to detect the emotion in a telephone conversation is to pick up some keywords to identify the emotion of the speaker. However, this technique has its limitation. The speech emotion recognition engine can detect the emotion of speakers at the time the conversation between the customer and the representative is undergoing. Thus, it will improve the customer satisfaction by fast response from manager when the service representative cannot pacify the customer. Therefore, our research can offer the enterprise some suggestions in building the Call Center in the future.

3.1 Model of the

3.1.1 Setup of a SIP-PBX Server

Asterisk is an open source SIP-PBX server which is a programmable platform. Asterisk can support several PBX features including voice mail, conferencing, call transfer, etc [11]. After we install Asterisk in a Linux machine, registration of all the extension is important. We write the extension information in /etc/asterisk/sip.conf in order to emulate the conversation between inbound customer and Call Center representative, as shown in Fig. 3.1.

Contact Interface

In order to build the model of the contact interface between the speech emotion recognition engine and IP-PBX server in a VoIP network, we construct the mechanism to simulate the situation that can get rid of the emotion problem out of control between the customer and the operator of the call center.

(36)

Figure 3.1: IP-PBX extension configuration

3.1.2 Consideration of the Speech Emotion Recognition Engine

In Fig.3.2, the speech emotion recognition engine is designed to detect the angry event by analyzing SIP packet at the time the conversation between customer and Call Center representative is undergoing. The on-going call can be transferred to different extension by the Redirect Server. This happens when the Customer Service Representative (CSR) can not ppaacciiffyyccuussttoommeerr’’sseemmoottiioonn..

In this thesis, we are not considered the technology of analyzing SIP packet to detect an angry event. The purpose of this thesis is to study the combination of speech emotion recognition and the VoIP Call Center System and to emulate such a situation in an IP-PBX.

(37)

3.1.3 Construction of the IVR System Mechanism

The IVR System of the Call Center with automatic speech emotion recognition are constructed by combining a SIP-PBX server and speech emotion recognizer. The interface between the speech emotion recognition engine and the IP-PBX Server in a VoIP Network is shown in Fig. 3.2.

Figure 3.2: Interface between the speech emotion recognition engine and the IP-PBX server in a VoIP network

3.2 Emulation of the VoIP-IVR System

In order to emulate the mechanism of IVR in VoIP System with automatic speech emotion recognition, a system architecture is built as shown in Fig. 3.3. The system can help company to avoid the bad conversation problem between customer and service representative whose emotion is out of control.

I

Inn oouurreexxppeerriimmeenntt,,wweeiinnssttaallllaaSSIIPP--PPBBXXSSeerrvveerrbbaasseeddoonntthhee open source platform, Asterisk®. The client VoIP phone is X-Lite..

(38)

(39)

CHAPTER 4 EXPERIMENTAIL RESULTS

4.1 E

Em

mu

ul

la

at

ti

io

on

n

o

of

f

D

Di

ia

al

li

in

ng

g

E

Ex

xt

te

en

ns

si

io

on

ns

s

o

of

f

I

IP

P-

-P

PB

BX

X

S

Se

er

rv

ve

er

r

In order to emulate the interaction between the customer and the CSR, we use three notebooks to emulate the situation of three party call control as show in Figures 4.1 to 4.3 .

First, we configure our CounterPath’s X-Lite softphone for connecting to our Asterisk IP-PBX Server, as shown in Fig.4.1, whose IP address is 59.120.214.121. And the IP address of inbound customer is 59.120.214.123.

Figure 4.1: The inbound customer with extension 200 use X-Lite software to connect to Asterisk IP-PBX Server

(40)

Figure 4.2: The customer use X-Lite software to dial extension 201, the CSR The customer with extension 200 uses X-Lite IP phone to dial extension 201, the extension for CSR, as shown in Fig. 4.2. AAtt tthhee ttiimmee,, we assume that the SIP packet streaming of the call will feed to an speech emotion recognition engine. When an angry emotion is detected by the speech emotion recognition engine, the engine will issue a redirect call request to the Redirect Server (RS) . The RS now creates a three way conference which brings the CSM in to join the conversation. After the CSM knowing what happened, he or she can now ask the CSR to disconnect and takes over the conversation from there. Or, if the conversation becomes normal, he or she can disconnect from the conference silently.

(41)

Figure 4.3: The X-Lite software used by CSM

Because of the customer’s emotion is out of control, the X-Lite software used by the CSM is connected to the conference by the Redirect Server, as shown in Fig. 4.3. A emulated conversation is listed in the Appendix.

(42)

4.2 SIP Registration and Analysis a Phone Call Setup Analysis

In order to analyze the call connection status, we install and use the freeware Wireshark. In our experiment, the IP address of the SIP server is 59.120.214.121, and the standard port for SIP traffic is UDP 5060. We want to capture traffic in both directions, that is, to the SIP server and to the softphone running on the same host as Wireshark [11]. According to the information of the SIP server, we enter the string “host 59.120.214.121 and UDP port 5060” into the field of the capture filter and then start the packet capture process, as shown in Fig. 4.4 and 4.5.

Figure 4.4: Wireshark capture interfaces

(43)

When the incoming customer uses X-Lite to dial in extension 201, the CSR, we see the call is established from sip:[email protected] to sip:[email protected].

When the customer’s emotion is out of control and the X-Lite used by the CSM will be brought into the conversation by the Redirect Server. We can see the call is established for sip:[email protected] to sip:[email protected] too, as shown in Fig. 4.6.

(44)

Figure 4.7: A three-way handshake of SIP between caller and callee

Figure 4.7 illustrates the connection establishment flow. The first area of the graph is the conversation between inbound customer and CSR of Call Center. And the second area of the graph is the conversation between inbound customer and CSM of Call Center. From the right portion of Fig. 4.7, we can observe that the extension 202 is jointed in. It is to prove that our automatic forwarding mechanism of IVR in a VoIP network is possible.

4.3 Redirection of the conversation to CSM

The customer was redirected to extension 202, the CSM due to the angry emotion being detected by the speech emotion recognition engine when the customer talked to the CSR,

(45)

the extension 201. Therefore, our simulation of customer’s complaint against the operator, extension 201 is similar to the situation of Call Center. This thesis focuses on the application of the speech emotion recognition in the VoIP Call Center. And our experiment indicated that the emotion state can be detected by the speech emotion recognition engine, and then it is possible that the automatic forwarding of SIP call in VoIP PBX system is feasible.

(46)

CHAPTER 5 CONCLUSION AND FUTURE WORK

In this thesis, we emulate the mechanism of IVR in a VoIP System with speech emotion recognition engine to fit our research needs. We setup a VoIP PBX System to emulate the situation how a Call Center faces the customer’s complaint. For a Call Center, it could provide better customer service to the customer with the proposed mechanism.

The mechanism of dealing with the customer complains is one of the subjects that affect the enterprise image. Because the treatment to the customer will influence enterprise’s overall image, there are many investigation to improve it. An IVR PBX Server based on VoIP with embedded speech emotion recognition engine can be one of the solutions.

(47)

REFERENCE

[1] Van Meggelen, Madsen and Smith, Asterisk – The Future of Telephony, 2nd Edition, O'Reilly USA, 2007.

[2] IETF SIP Working Group, “RFC3261 SIP: Session Initiation Protocol,” USA, 2002.

[3] Samrat Ganguly, Sudeept Bhatnagar, VoIP: wireless, P2P and new enterprise voice

over IP, Wiley USA, 2008.

[4] IETF SIP Working Group, “RFC2327 SDP: Session Description Protocol,” April 1998.

[5] Dang, Jennings and Kelly, Practical VoIP, O'Reilly USA, 2002.

[6] 林琬儒, 電話服務中心之電話品質分析:以中華電信障礙服務為例, 資訊管理

研究所碩士論文, 中山大學, 2000年6月.

[7] Annys Shin, “Call Centers Using Emotion Detection,” Washington Post Staff Writer, October 18, 2006.

[8] Yu-Te Chen, “Comparison of Classification Methods for Detecting Emotion from Mandarin Speech,” IEICE TRANS. INF. & SYST., VOL.E91–D, April 2008. [9] Tsang-Long Pao, Yute Chen and Junheng Yeh, “Emotion Recognition And

Evaluation From Mandarin Speech Signals,” International Journal of Innovative

Computing, Information and Control Volume 4, Number 7, July 2008.

[10] Tsang-Long Pao, Yun-Maw Cheng, Yu-Te Chen, and Jun-Heng Yeh, “Performance Valuation Of Different Weighting Schemes On KNN-Based Emotion Recognition In Mandarin Speech,” International Journal of Information Acquisition, Vol. 4, No. 4, September 2007.

(48)

APPENDIX

The following interaction is designed to emulate the conversation between

the call center of 101 network Telecommunication Company and the

executive of network department of East TV Company.

1.【第一線客服與客戶東方電視公司網路部主管通話的內容】第一線客服(語氣甜美柔和)：您好！我是101網路電信公司客服,編號201-王安娜，請問有什麼地方可以為您服務的嗎? 東方主管(語氣柔和)：妳好！我是東方電視公司網路部主管,林約翰。日前我們公司跟貴公司所購買的視訊會議軟體在經過1個月的測試評估後,我們認為你們的產品通訊品質不夠穏定，依照合約的條款我們除了取消貴公司接下來的100萬美元訂單外，也會向貴公司要求10萬美元的產品品質不良賠償。不知你們有何解決方案? 第一線客服(語氣甜美柔和)：林先生您好,我們希望您可以再接受我們安排最頂尖的技術工程師為您找出並修正此問題，請不要急著取消合約並提告. 東方主管(情緒有點生氣)：我們打算不想再讓貴公司向上次一樣再派個三流的工程師來搞砸我們二個月後的全球經貿會議！第一線客服(語氣有點尖銳及嘲諷)：林先生～您一定有點誤會，我們的工程師並不是您所說的三流的工程師，他們絕對可以在下個月幫貴公司將軟體品質問題解決，也許真的如他們上次維修後，回報所說的貴公司網路規劃有問題而導致我們視訊會議軟體執行起來一直不順暢！東方主管(情緒變得很生氣)：妳是什麼東西,竟敢批評起我們公司來了，我決定退回你們正在測試中的產品及接下來的訂單，並且向貴公司要求產品品質不良的10 萬美元違約金。

(49)

「當前面這段通話內容被偵測出有生氣情緒時，RS 將資深客服加入三方通話中，來聆聽第一線客服和客戶間的衝突內容，直到資深客服認為第一線客服無法安撫客戶情緒時，資深客服便中斷客戶和第一線客服之通話，由其接手來安撫客戶情緒；另一方面第一線客服也接獲訊息，自行斷線離開。」 2.【資深客服與客戶-東方電視公司網路部主管通話的內容】資深客服(語氣甜美柔和)：您好,我是101網路電信公司資深客服,編號202-陳瑪麗，可以由我來幫忙解決貴公司所遇到的問題嗎? 東方主管(情緒巳不再那麼生氣)：我覺得貴公司不僅產品品質不良，且客服小姐的態度也欠佳,我決定退回貴公司正在測試中的產品及取消接下來的訂單，並且向你們要求交貨品質不良的10萬美元違約金。資深客服(語氣宛約柔和)：林先生您好,很高興為您服務，我們公司將會為你連絡最專業的網路工程師及頂尖的軟體設計師一起為貴公司診斷並解決目前產品測試遇到的問題，保證全力且儘速地解決貴公司這次所購買產品的問題，也希望貴公司再次讓我們以最誠心的態度為貴公司服務。不知主管您可以接受我們最誠意的服務嗎？東方主管(情緒平復)：謝謝！我會再針對貴公司這次的服務態度重新評估貴公司產品的，希望我們的交易會持續下去！資深客服(語氣甜美柔和)：沒問題，我們公司一定會給您最滿意的產品與服務！

語 音 情 緒 辨 識 在 VoIP 客 服 系 統 上 的 應 用 Application of Speech Emotion Recognition in