Bio-inspired algorithms for pattern recognition in audio and image processing

(1)

University of Groningen

Bio-inspired algorithms for pattern recognition in audio and image processing

Strisciuglio, Nicola

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2016

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Strisciuglio, N. (2016). Bio-inspired algorithms for pattern recognition in audio and image processing [Groningen]: University of Groningen

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

UNIVERSITY OF GRONINGEN

J

OHANN

B

ERNOULLI

I

NSTITUTE FOR

M

ATHEMATICS AND

C

OMPUTER

S

CIENCE

UNIVERSITY OF SALERNO

D

EPARTMENT OF

I

NFORMATION AND

E

LECTRICAL

E

NGINEERING AND

A

PPLIED

M

ATHEMATICS

B

IO

-

INSPIRED ALGORITHMS FOR

PATTERN RECOGNITION IN

AUDIO AND IMAGE PROCESSING

A dissertation supervised by promotors

P

ROF

. D

R

.

SC

.

TECHN

. N

ICOLAI

P

ETKOV

P

ROF

. D

R

. M

ARIO

V

ENTO

and submitted by

N

ICOLA

S

TRISCIUGLIO

in fulfillment of the requirements for the Degree of

P

HILOSOPHIÆ

D

OCTOR

(P

H

.D.)

May 2016 ISBN: 978-90-367-8931-8 (ISBN ebook: 978-90-367-8932-5)

(3)

(4)

Bio-inspired algorithms for

pattern recognition in audio and

image processing

PhD thesis

to obtain the degree of PhD at the

University of Groningen

on the authority of the

Rector Magnificus Prof. E. Sterken

and in accordance with

the decision by the College of Deans.

This thesis will be defended in public on

Friday 10 June 2016 at 09.00 hours

by

Nicola Strisciuglio

born on 16 November 1987

in Nocera Inferiore, Salerno, Italy

(5)

Supervisors

Prof. N. Petkov

Prof. M. Vento

Co-supervisor

Dr. G. Azzopardi

Assessment committee

Prof. A.C. Telea

Prof. C.N. Schizas

Prof. V. Loia

Prof. X. Jiang

(6)

This research has been conducted at the Intelligent Systems group of Johann Bernoulli Institute for Mathematics and Computer Science (Onderzoeksinstituut: JBI) of University of Groningen and at the MIVIA research group of the Depart-ment of Information and Electrical Engineering and Applied Mathematics (DIEM) of University of Salerno.

This research has been supported by the University of Groningen through an ”Ubbo Emmius” scholarship for international sandwich PhD programs and by the Department of Information and Electrical Engineering and Applied Mathematics of University of Salerno through a research grant on the project ”Embedded systems in critical domains” (cod. 4-17-12, P.O.R. Campania FSE 2007-2013).

Bio-inspired algorithms for pattern recognition in audio and image processing Nicola Strisciuglio

ISBN: 978-90-367-8931-8 (printed version) ISBN: 978-90-367-8932-5 (electronic version)

(7)

(8)

(9)

(10)

Abstract

This thesis investigates the construction of pattern recognition systems that are based on the computation of features inspired by the characteristics of human au-ditory and visual systems. The thesis addresses two important applications in the fields of intelligent audio surveillance and medical image analysis. In particular, we propose two algorithms for the detection of audio events that can occur with vari-ous levels of signal-to-noise ratio (SNR) and two algorithms for the delineation of blood vessels in retinal fundus images.

Audio analysis for detection of events of interest has recently raised large inter-est in the pattern recognition community due to increasing demand for safety in public and private environments and the consequent demand for improved surveil-lance systems. Traditional applications of audio analysis concern speech recogni-tion, speaker identification and music classification. They usually require that the sound source is close to the microphone. This implies a low influence of noise on the functioning of the overall system. In applications like event detection for au-dio surveillance, the source of the sound of interest can be at any distance from the microphone. Thus, the detection system has to be able to detect events at vari-ous levels of SNR, sometimes also negative. Another key requirement for an audio surveillance system is the ability to detect events of interest when they are mixed with different kinds of background noise. Such constraints make the problem at hand very different from traditional applications of audio analysis. Intelligent au-dio surveillance is a recent research field and at the time of this work no public data sets were available for testing event detection algorithms. Thus, we constructed and publicly released two new data sets of abnormal events that can occur in everyday life, which we called MIVIA audio event and MIVIA road event data sets.

We start from the consideration that an audio stream is composed of small, atomic units of sound, similarly to a piece of text that is composed of a number

(11)

of words. We propose a system for the detection of audio events based on the bag of features approach. Since the events of interest can be mixed with various types of background noise, we tailored the training phase of the proposed method in order to build a system robust to such variability. We tested the system for the detection of glass breaking, gun shot and scream events in public and private environments by using the audio clips in the MIVIA audio event data set. We achieved a high recog-nition rate (up to 86.7%) with a very low false positive rate (2.1% on the whole test set). Successively, we extended the system in order to be employed for monitoring and surveillance of roads, with the aim of detecting anomalous situations such as car crash and tire skidding events. We designed a deployment strategy for differ-ent kinds of road (from very calm country roads to very busy cities or motorways), based on an internationally accepted road noise model. We carried out experiments on the MIVIA road event data set and achieved a recognition rate (82%) and a false positive rate (2.85%) that confirm the performance achieved on the MIVIA audio event data set.

In a further study, we take inspiration from some characteristics of the human auditory system to propose trainable filters, which we call CoPE filters, that auto-matically determine the important features from training audio samples. One of the critical steps for the construction of a pattern recognition system is, indeed, the choice of the most appropriate set of features to face the particular problem at hand, i.e. a feature engineering step. The CoPE filters are trainable as their structure is not fixed in the implementation but it is instead learned during a configuration process from training samples. This eliminates the needs of a features engineering step. The important features are learned directly from the events of interest, mak-ing the system easily adaptable to different sound recognition tasks and requirmak-ing less knowledge about the specific domain of application. We employ the responses of a bank of CoPE filters to build feature vectors that we use to describe the input audio stream. We train a classifier with such feature vectors in order to perform the detection task. We carried out experiments on the MIVIA audio event and the MIVIA road event data set, achieving a recognition rate (higher than 94%) and false positive rate (less than 4%) that are considerably better than the results achieved by the approach based on the bag of features architecture.

In the second part of the thesis we address an important application in the field of medical image analysis, i.e. the segmentation of blood vessels in retinal fundus images. Retinal fundus imaging is a non-invasive tool that is widely employed by medical experts to diagnose various pathologies such as glaucoma, age-related mac-ular degeneration, diabetic retinopathy and atherosclerosis. There is also evidence that such images may contain signs of non-eye-related pathologies, including car-diovascular and systemic diseases. In the last years, particular attention by medical

(12)

communities has been given to early diagnosis and monitoring of diabetic retinopa-thy, since it is one of the principal causes of blindness in the world. The manual inspection of retinal fundus images requires highly skilled people, which results in an expensive and time-consuming process. Thus, the mass screening of a popu-lation is not feasible without the use of computer aided diagnosis systems. Such systems could be used to refer to medical experts only the patients with suspicious signs of diseases.

We introduce a novel method for the automatic segmentation of vessel trees in retinal fundus images. We propose a filter that selectively responds to vessels and that we call B-COSFIRE with B standing for bar which is an abstraction of a ves-sel. It is based on the existing COSFIRE (Combination Of Shifted Filter Responses) approach. A B-COSFIRE filter achieves orientation selectivity by computing the weighted geometric mean of the output of a pool of Difference-of-Gaussians filters, whose supports are aligned in a collinear manner. It achieves rotation invariance efficiently by simple shifting operations. The proposed filter is versatile as its selec-tivity is determined from any given vessel-like prototype pattern in an automatic configuration process. The results that we achieve on three publicly available data sets (DRIVE: Se = 0.7655, Sp = 0.9704; STARE: Se = 0.7716, Sp = 0.9701; CHASE DB1: Se = 0.7585, Sp = 0.9587) are higher than many of the state-of-the-art methods.

In the last part of the thesis, we further investigate the flexibility and adaptabil-ity of the proposed B-COSFIRE filters and propose to employ them within a clas-sification pipeline. The framework that we propose automatically determines the most appropriate sub-set of filters for the application at hand. Initially, we config-ure a bank of B-COSFIRE filters and use the responses obtained on training retinal images to form pixel-wise feature vectors, which describe vessel and non-vessel pix-els. Then, we employ various techniques based on information theory and machine learning to select an optimal subset of B-COSFIRE filters. We finally train a classi-fier by using feature vectors constructed with the responses of the selected filters and employ it to classify every pixel in the testing image. The improvement of the results that we achieve on the DRIVE and STARE data sets with respect the unsu-pervised B-COSFIRE filters is statistically significant.

We studied the computational requirements of the proposed algorithms in order to evaluate their applicability in world applications and the fulfillment of real-time constraints given by the considered problems.

This thesis contributes to the development of bio-inspired algorithms for audio and image processing and promotes their use in higher-level pattern recognition systems.

(13)

(14)

Samenvatting

Dit proefschrift onderzoekt de constructie van patroonherkenningssystemen die gebaseerd zijn op kenmerken ge¨ınspireerd door de eigenschappen van het menselijk visueel en auditief systeem. Het proefschrift behandelt twee belangrijke toepassin-gen op het gebied van intellitoepassin-gente audio surveillance en medische beeldanalyse. In het bijzonder leggen we twee algoritmes voor de detectie van audio “gebeurtenis-sen” die voor kunnen komen met verschillende niveaus van signaal-ruis verhoud-ing, en twee algoritmes voor de segmentatie van bloedvaten in retinale fundus beelden.

Audioanalyse voor detectie van “gebeurtenissen van belang” heeft recentelijk aan interesse gewonnen in het vakgebied van patroonherkenning, dankzij de toen-emende behoefte aan veiligheid in het publieke en private domein en het daaruit voortkomende verzoek voor betere surveillancesystemen / beveiligingssystemen. Typische toepassingen van audioanalyse zijn onder andere spraakherkenning, spreker identificatie en muziek classificatie. Normaliter vereisen deze toepassin-gen dat de geluidsbron nabij de microfoon is, om de invloed van ruis op het func-tioneren van het gehele systeem te beperken. In toepassingen zoals “gebeurtenis” detectie voor audiosurveillance kan de bron van het geluid op elke afstand van de microfoon zijn. Zodoende dient het detectiesysteem in staat te zijn om gebeurtenis-sen op verschillende SNR niveaus te detecteren. Een andere vereiste voor een au-dio surveillancesysteem is de mogelijkheid om “gebeurtenissen van belang” te de-tecteren wanneer deze vermengd zijn met verschillende soorten achtergrondgeluid. Dergelijke beperkingen maken het probleem in kwestie zeer afwijkend van klassieke toepassingen van audioanalyse. Intelligente audiosurveillance is een recent onder-zoeksveld en ten tijde van dit onderzoek waren er geen openbare datasets beschik-baar voor het testen van algoritmes voor “gebeurtenisdetectie”. Derhalve hebben we twee nieuwe datasets van abnormale gebeurtenissen ontworpen en vrijgegeven.

(15)

De datasets, genaamd MIVIA audio event en MIVIA road event, bevatten abnor-male gebeurtenissen die zich in het alledaagse leven voor kunnen doen.

Ons uitgangspunt is de overweging dat een geluidsstroom bestaat uit kleine, atomische geluidseenheden, zoals een stuk tekst bestaat uit een aantal woorden. We dragen een systeem voor de detectie van audio gebeurtenissen aan dat gebaseerd is op de bag of features benadering. Aangezien de gebeurtenissen van belang gemengd kunnen zijn met verschillende soorten achtergrondgeluid, hebben we de trainingsfase van de voorgedragen methode afgesteld, om een systeem te ontwer-pen dat dergelijke variabiliteit kan weerstaan. Het systeem is getest op detectie van brekend glas, geweerschoten en schreeuwen in publieke en priv´e omgevingen met gebruik van de audiofragmenten in de MIVIA audio event dataset. We behaalden een hoog herkenningspercentage (tot 86.7%) met een zeer laag fout-positief percent-age (2.1% op de hele test set). Vervolgens hebben we het systeem uitgebreid voor in-gebruikstelling bij het toezicht van wegen, met het doel om anomale situaties zoals botsingen of bandenslippingen te detecteren. We hebben een invoeringsstrategie ontworpen voor verschillende type wegen (van zeer rustige landwegen tot drukke steden of snelwegen), gebaseerd op een internationaal erkend weggeluidsmodel. De experimenten met de MIVIA road event dataset behaalden een herkenningsper-centage (82%) en een fout-positief perherkenningsper-centage (2.85%) die de behaalde resultaten met de MIVIA audio event dataset onderschrijven.

In een vervolgonderzoek ge¨ınspireerd op enkele eigenschappen van het menselijk auditief systeem dragen we trainbare filters voor, genaamd CoPE filters, die automatisch de belangrijke onderdelen van training audio samples bepalen. Een cruciale stap in de constructie van een patroonherkenningssysteem is de keuze van de meest geschikte set van kenmerken voor de ophanden taak, oftewel de feature engineering stap. De CoPE filters zijn te trainen, aangezien hun structuur niet vast ligt in de implementatie; het wordt in plaats daarvan aangeleerd tijdens een con-figuratieproces van traningsmonsters. Hierdoor is een “features egineering” stap overbodig. De belangrijke kenmerken worden direct verworven uit de gebeurtenis-sen van belang, wat de systemen adaptief maakt voor verschillende geluidherken-ningstaken en de vereiste kennis van het specifieke toepassingsdomein vermindert. We hanteren de resultaten van een bank van CoPE filters om feature-vectoren te bouwen die we gebruiken om de geluidsstroom van de input te beschrijven. Een classifier wordt getraind met dergelijke kenmerkvectoren om de detectietaak uit te voeren. We hebben de experimenten op de MIVIA audio event en de MIVIA road event datasets uitgevoerd, en deze behaalden een herkenningspercentage (hoger dan 94%) en een fout-positief percentage (minder dan 4%) die de resultaten behaald met de benadering gebaseerd op het bag of features ontwerp aanzienlijk verbeteren. In het tweede deel van dit proefschrift stellen we een belangrijke toepassing op

(16)

het gebied van medische beeldanalyse aan de orde, nl. de segmentatie van bloed-vaten in retinale fundus beelden. Retinale fundus beeldvorming is een niet-invasief middel dat veel gebruikt wordt door medisch specialisten om verscheidene ziekten te diagnosticeren, waaronder glaucoom, leeftijdsgebonden maculadegeneratie, dia-betische retinopathie en atherosclerose. Er is ook bewijs dat dergelijke beelden sig-nalen van niet-oog gerelateerde ziektebeelden kunnen bevatten, waaronder cardio-vasculaire en systemische ziekten. In de afgelopen jaren hebben medische gemeen-schappen bijzondere aandacht geschonken aan vroegtijdige diagnostisering en con-trole van diabetische retinopathie, aangezien het een van de voornaamste oorzaken van blindheid is ter wereld. De handmatige inspectie van retinale fundus beelden vereist zeer vakkundig personeel, wat het een zeer duur en tijdrovend proces maakt. Zodoende is massale screening van een populatie niet haalbaar zonder de aanwend-ing van computerondersteunde diagnosesystemen. Dergelijke systemen zouden ge-bruikt kunnen worden om enkel de pati¨enten met verdachte symptomen van ziekte door te verwijzen naar medisch specialisten.

Wij introduceren een nieuwe methode voor de automatische segmentatie van bloedvatenbomen in retinale fundus beelden. We leggen een filter voor dat selec-tief reageert op bloedvaten, genaamd B-COSFIRE noemen, de B refererend naar bar; een abstractie van een bloedvat. Het is gebaseerd op de bestaande COSFIRE (Combination of Shifted Filter Responses) benadering. Een B-COSFIRE filter behaalt ori¨entatieselectiviteit door het gewogen geometrisch gemiddelde te berekenen van de output van een poel van Difference-of-Gaussians filters waarvan de steunen op collineaire wijze zijn uitgelijnd. Het bereikt op effectieve wijze rotatie-invariantie middels simpele shift operaties. Het voorgelegde filter is veelzijdig, aangezien de selectiviteit van het filter bepaald wordt door elk gegeven bloedvatachtige proto-type patroon in een automatisch configuratieproces. De resultaten die we behaald hebben op drie publiekelijk beschikbare datasets (DRIVE: Se = 0.7655, Sp = 0.9704; STARE: Se = 0.7716, Sp = 0.9701; CHASE DB1: Se = 0.7585, Sp = 0.9587) zijn hoger dan vele state of the art methoden.

In het laatste gedeelte van het proefschrift wordt er een vervolgonderzoek om-schreven omtrent de flexibiliteit en het aanpassingsvermogen van de voorgedra-gen B-COSFIRE filters en stellen we voor ze in gebruik te stellen binnen een clas-sificatiekanaal. Het raamwerk dat we voordragen, bepaalt automatisch de meest geschikte subset van filters voor de toepassing ophanden. In eerste instantie con-figureren we een bank van B-COSFIRE filters en gebruiken de verkregen responsies om retinale beelden te trainen in het vormen van pixelmatige kenmerkvectoren die bloedvaten- en non-bloedvatenpixels beschrijven. Daarna hanteren we verschei-dene technieken gebaseerd op information theory en machine learning om een op-timale subset van B-COSFIRE filters te selecteren. De verbetering van de resultaten

(17)

die we bereiken met de DRIVE en STARE datasets ten opzichte van de B-COSFIRE filters zonder supervisie is statistisch significant.

We bestudeerden de computationele eisen van de voorgedragen algoritmes om zowel hun toepasbaarheid in werkelijke toepassingen, als de uitvoering van echtijd beperkingen, ingegeven door de overwogen problemen, te evalueren.

Dit proefschrift draagt bij aan de ontwikkeling van bio-ge¨ınspireerde algoritmes voor audio- en beeldverwerking en bevordert hun toepassing in hogere niveaus van patroonherkenningssystemen.

(18)

6.2.5 Classification . . . 107 6.2.6 Application phase . . . 107 6.3 Materials . . . 109 6.3.1 Data sets . . . 109 6.3.2 B-COSFIRE implementation . . . 109 6.4 Experiments . . . 109 6.4.1 Pre-processing . . . 109 6.4.2 Evaluation . . . 110 6.4.3 Results . . . 111 6.4.4 Statistical analysis . . . 115

6.4.5 Comparison with existing methods . . . 115

7 Summary and Outlook 121 7.1 Summary . . . 121

7.2 Outlook . . . 123

Bibliography 127

Research Activities 139

Bio-inspired algorithms for pattern recognition in audio and image processing

University of Groningen

Bio-inspired algorithms for pattern recognition in audio and image processing

Strisciuglio, Nicola

UNIVERSITY OF GRONINGEN

J

B

I

M

C

S

UNIVERSITY OF SALERNO

D

I

E

E

A

M

B

IO

-

INSPIRED ALGORITHMS FOR

PATTERN RECOGNITION IN

AUDIO AND IMAGE PROCESSING

A dissertation supervised by promotors

P

. D

.

.

. N

P

P

. D

. M

V

and submitted by

N

S

in fulfillment of the requirements for the Degree of

P

D

(P

.D.)

Bio-inspired algorithms for

pattern recognition in audio and

image processing

PhD thesis

to obtain the degree of PhD at the

University of Groningen

on the authority of the

Rector Magnificus Prof. E. Sterken

and in accordance with

the decision by the College of Deans.

This thesis will be defended in public on

Friday 10 June 2016 at 09.00 hours

by

Nicola Strisciuglio

born on 16 November 1987

in Nocera Inferiore, Salerno, Italy

Supervisors

Prof. N. Petkov

Prof. M. Vento

Co-supervisor

Dr. G. Azzopardi

Assessment committee

Prof. A.C. Telea

Prof. C.N. Schizas

Prof. V. Loia

Prof. X. Jiang

Abstract

Samenvatting

Contents