Adaptive methods for score function modeling in blind source separation

(1)

Teknillinen korkeakoulu Signaalinkäsittelytekniikan laboratorio

Espoo 2002

Report 35

ADAPTIVE METHODS FOR SCORE FUNCTION MODELING IN

BLIND SOURCE SEPARATION

(2)

Helsinki University of Technology Signal Processing Laboratory

Teknillinen korkeakoulu Signaalinkäsittelytekniikan laboratorio

Espoo 2002

Report 35

ADAPTIVE METHODS FOR SCORE FUNCTION MODELING IN

BLIND SOURCE SEPARATION

Juha Karvanen

Dissertation for the degree of Doctor of Science in Technology to be presented with due

permission for public examination and debate in Auditorium S4 at Helsinki University of

Technology (Espoo, Finland) on the 26th of August, 2002, at 12 o’clock noon.

Helsinki University of Technology

Department of Electrical and Communications Engineering

Signal Processing Laboratory

Teknillinen korkeakoulu

Sähkö- ja tietoliikennetekniikan osasto

Signaalinkäsittelytekniikan laboratorio

(3)

Distribution:

Helsinki University of Technology

Signal Processing Laboratory

P.O. Box 3000

FIN-02015 HUT

Tel. +358-9-451 2486

Fax. +358-9-460 224

E-mail: [email protected]

 Juha Karvanen

ISBN 951-22-5990-7

ISSN 1458-6401

Otamedia Oy

Espoo 2002

(4)

Insignal processing and related elds, multichannel measurements are oftenencountered.

Dependingontheapplication,forinstance,multipleantennas,multiplemicrophonesor

mul-tiplebiomedicalsensorsareusedforthedataacquisition. Suchsystemscanbedescribed

us-ingMultiple-InputMultiple-Output(MIMO)systemmodels. Inmanycases,severalsource

signalsarepresentatthesametimeandthereisonlylimitedknowledgeoftheirproperties

andhowtheycontributetoeachsensoroutput. Ifthesourcesignalsandthephysicalsystem

areunknown andonlythe sensoroutputs areobserved,the processingmethods developed

forrecoveringtheoriginalsignalsarecalledblind.

In Blind Source Separation (BSS) the goal is to recover the source signals from the

observedmixedsignals(mixtures). Blindnessmeansthatneitherthesourcesnorthemixing

system is known. Separation can be based on the theoretically limiting but practically

feasibleassumptionthatthesourcesarestatisticallyindependent. Thisassumptionconnects

BSS and Independent Component Analysis (ICA). The usage of mutual information as a

measureofindependenceleadstoiterativeestimationofthescorefunctionsofthemixtures.

ThepurposeofthisthesisistodevelopBSSmethods thatcanadapttodierentsource

distributions. Adaptationmakesitpossibletoseparatesourceswithoutknowingthesource

distributionsoreventhecharacteristicsofsourcedistributions. Special attentionispaidto

methodsthatallowalsoasymmetricsourcedistributions. Asymmetricdistributionsoccurin

importantapplicationssuchascommunicationsandbiomedicalsignalprocessing. Adaptive

techniquesareproposedforthemodelingofscorefunctions orestimatingfunctions. Three

approaches basedonthe Pearsonsystem, theExtended GeneralizedLambdaDistribution

(EGLD) and adaptively combined xed estimating functions are proposed. The Pearson

systemandtheEGLDareparametricfamilies ofdistributions andtheyare usedto model

(5)

containawide classof distributions,includingasymmetric distributions withpositiveand

negativekurtosis,whiletheestimationoftheparametersisstillarelativelysimpleprocedure.

Themethodsmaybeimplementedusingexisting ICAalgorithms.

The reliable performance of the proposed methods is demonstrated in extensive

sim-ulations. In addition to symmetric source distributions, asymmetric distributions, such

as Rayleigh and lognormal distribution, are utilized in simulations. The score adaptive

methods outperform commonlyused methods due to theirabilityto adapt to asymmetric

(6)

Jaminakaansinsydamenitutkimaanviisautta jatietoa,mielettomyyttajatyhmyytta,

jaminatulintietamaan,ettasekin olituulentavoittelemista. Saarn. 1:17

The work reported in this thesiswas carried out in the Signal Processing Laboratory,

Helsinki University of Technology during the years 1999{2002. The period included three

months in the Signal Processing Laboratory at University of Pennsylvania, Philadelphia,

USA.

I wish to express my gratitude to my supervisor Prof. Visa Koivunen. I would like

to thankmy thesisreviewersDr. AapoHyvarinen andDr. JyrkiMottonen fortheir

con-structive comments. I wish to thank Prof. Saleem Kassam for my visit to University of

Pennsylvania. Iamalso gratefulto mycolleaguesandco-workers,especially JanEriksson,

Dr. YingluZhangandDr. CharlesMurphy. Inaddition,I wouldliketothankmyparents

ToivoandMarja-Liisafortheirsupport.

Theresearchwasfunded bytheAcademyof Finlandand theGraduateSchoolin

Elec-tronics, Telecommunications and Automation (GETA). Additional nancial support was

providedbyElektroniikkainsinooriensaatio,theNokiaFoundationandTekniikanedistamissaatio.

Theseorganizationsandfoundationsaregreatlyacknowledgedformakingthisworkpossible.

Espoo,Finland

May13, 2002

(7)

(8)

1 Introduction 13

1.1 Motivation . . . 13

1.2 ScopeoftheThesis. . . 15

1.3 ContributionoftheThesis. . . 15

1.4 SummaryofPublications . . . 16

2 Blind Source Separation 19 2.1 Overview . . . 19

2.2 IndependentComponentAnalysisModel. . . 19

2.2.1 ThebasicICAmodel . . . 19

2.2.2 ExtensionsofthebasicICA model . . . 20

2.3 AnatomyofanICAMethod . . . 22

2.4 MeasuresofIndependence . . . 22

2.5 ObjectiveFunctionsandEstimatingFunctions . . . 24

2.6 MutualInformation andSourceAdaptation . . . 27

2.7 Algorithms . . . 28

2.7.1 Naturalgradientalgorithm . . . 29

2.7.2 Fixed-pointalgorithm . . . 29

2.7.3 Jacobialgorithms. . . 29

2.8 Characterizationof SourceDistributions . . . 30

2.9 Discussion . . . 32

3 Review of sourceadaptive ICA methods 33 3.1 Overview . . . 33

(9)

3.2 Kernelestimationofdensities . . . 34 3.3 Parametricmodels . . . 34 3.3.1 Distributionfamilies . . . 34 3.3.2 MixtureofDensities . . . 36 3.4 Adaptivenonlinearities . . . 37 3.4.1 Polynomialexpansions. . . 37 3.4.2 Basisfunctions . . . 37

3.4.3 Thresholdfunctions andquantizers. . . 38

4 Adaptive ScoreModels 41 4.1 Overview . . . 41

4.2 PearsonSystem . . . 42

4.2.1 EstimationofthePearsonsystemparameters . . . 43

4.2.2 ExtensionsofthePearsonsystem. . . 45

4.3 ExtendedGeneralizedLambdaDistribution . . . 46

4.3.1 Parameterestimationviasamplemoments . . . 47

4.3.2 ParameterestimationviaL-moments. . . 47

4.3.3 Otherestimationtechniques. . . 49

4.4 AdaptiveEstimatingFunctions . . . 49

4.4.1 Estimatingfunctions basedoncumulantsandabsolutemoments . . . 50

4.4.2 Gaussianmomentsbasedestimatingfunctions. . . 52

4.5 Performance. . . 53

(10)

I J.Karvanen,J. Eriksson andV. Koivunen. PearsonSystem basedMethod for Blind

Separation. InProc. of TheSecondInternationalWorkshopon Independent

Compo-nentAnalysisandBlind SignalSeparation,ICA2000,pages585{590,June2000.

II J. Eriksson, J. Karvanenand V. Koivunen. Source Distribution Adaptive Maximum

LikelihoodEstimationofICAModel. InProc. ofTheSecondInternationalWorkshop

on Independent Component Analysis and Blind Signal Separation, ICA2000, pages

227{232,June2000.

III J.Karvanen,J. Erikssonand V. Koivunen. MaximumLikelihood Estimation of ICA

modelforWideClassofSourceDistributions. InProc. ofthe2000IEEEWorkshopon

NeuralNetworksforSignalProcessingX,pages445{454,December2000.

IV J. Karvanen and V. Koivunen. Blind Separation of Communication Signals Using

PearsonSystem Based Method. In Proc. of TheThirty-Fifth Annual Conferenceon

InformationSciencesandSystems,VolumeII,pages764{767,March2001.

V J. Karvanen and V. Koivunen. Blind Separation Methods Based on Pearson system

anditsExtensions. SignalProcessing Volume82,Issue4,pages663{673,April2002.

VI J. Karvanen, J. Eriksson and V. Koivunen. Adaptive Score Functions for Maximum

LikelihoodICA.JournalofVLSISignalProcessing,Volume32,pages83{92,2002.

VII J.KarvanenandV.Koivunen. BlindSeparationusingAbsoluteMomentsBased

Adap-tiveEstimatingFunction. InProc. oftheThirdInternationalConferenceon

Indepen-dentComponentAnalysis andSignalSeparation, ICA2001,pages218{223, December

(11)

(12)

symbols

Abbreviations

BER biterrorrate

BSS blindsourceseparation

cdf cumulativedistributionfunction

EGLD extendedgeneralizedlambdadistribution

FIR niteimpulseresponse

GBD generalizedbetadistribution

GLD generalizedlambdadistribution

GMSK gaussianmeanshiftkeying

ICA independentcomponentanalysis

i.i.d. independentidenticallydistributed

I-MIMO instantaneousmulti-inputmulti-output

ISI intersymbolinterference

MIMO multiple-inputmultiple-output

MSE meansquareerror

PCA principalcomponentanalysis

pdf probabilitydensityfunction

(13)

Symbols

x vectorofoutputsignals

A mixing matrix

s vectorofsourcesignals

m numberofsensors,dimensionofthedata

y vectorofsourceestimates

W demixing matrix A T TransposeofmatrixA A 1 InverseofmatrixA

det(A) determinantofmatrixA

I identitymatrix

f probabilitydensityfunction

f G

Gaussiandensity

F cumulativedistribution function

K(f(y );g(y )) Kullback-Leiblerdivergencebetweendensitiesf(y )andg(y )

K yjjs

Kullback-Leiblerdivergencebetweendensitiesofrandomvariablesyands

ML

maximumlikelihoodcontrast

MI

mutualinformationcontrast

(x;W ) matrix-valuedestimatingfunction

' y

(y ) scorefunctionofy

'(y i

) (one-unit)estimatingfunction

characteristicfunction ofadistribution

t timeindex T numberofobservations f 0 (x) derivateoff(x) 1 ; 2 ; 3 ;::: central moments 1 ; 2 ; 3 ;::: cumulants L 1 ;L 2 ;L 3 ;::: L-moments 1 ; 2 ; 3 ;::: absolutemoments 1 ; 2 ; 3

;::: skewedabsolutemoments

G 0 ;G 1 ;G 2 ;::: Gaussianmoments

(14)

Æ 3 ; Æ 4

cumulantbasedskewnessandkurtosis

3

;

4

L-momentbasedskewnessandkurtosis

Æ 2 ; Æ 3

absolutemomentsbasedskewnessandkurtosis

Æ 0 ; Æ 1

Gaussianmomentsbasedskewnessandkurtosis

Efxg expectedvalueofx

a 0 ;a 1 ;:::;a p

numeratorpolynomialparametersofthePearsonsystem

b 0 ;b 1 ;:::;b q

denominatorpolynomialparametersofthePearsonsystem

a;b vectorsofthePearsonsystemparameters

M;Q matricescontainingcentralmoments

1 ; 2 ; 3 ; 4 parametersoftheGLD 1 ; 2 ; 3 ; 4 parametersoftheGBD ! 1 ;! 2 ;::: weightingparameters E 1 Performanceindex # i Localstability i

varianceof stabilitysolution

BSSeÆcacy

kernelfunction

T

(15)

(16)

Introduction

1.1 Motivation

Insignal processing and related elds, multichannel measurements are oftenencountered.

Theobtaineddatacan berepresentedasmultivariatetimeseries. Dependingonthe

appli-cation,forinstance,multipleantennas,multiplemicrophonesormultiplebiomedicalsensors

are used for the data acquisition. Such systems can be described using Multiple-Input

Multiple-Output (MIMO) system models. The observed sensor outputs are dierent

be-causethesensorshavedierentproperties,e.g. separatelocations. Ontheotherhand,the

sensor outputs are related becausethe sensors are observing the same source signals. In

many cases, several source signalsare presentat thesame time and there is onlylimited

knowledgeoftheirpropertiesand howtheycontributeto eachsensor output. Ifthesource

signals and the physical system are unknown and only the sensor outputs are observed,

the processing methods developed for recoveringtheoriginal signals are called blind. An

illustrationofaninstantaneousmixingMIMO-modelispresentedinFigure1.1.

InBlind SourceSeparation(BSS,alsoknownasBlind SignalSeparation)thegoalisto

recoverthesource signalsfrom theobservedmixed signals. Blindness meansthat neither

thesourcesnorthemixing systemis known. Separationcanbebasedonthetheoretically

limitingbut practicallyfeasibleassumption that the sourcesare statistically independent.

This assumption connects BSS and Independent Component Analysis (ICA). The terms

(17)

Figure 1.1: An illustrationof aninstantaneousnoise-free mixing system. Thesystemand

thesourcesareunknownandonlythesensoroutputsareobserved.

toseparatecertaintransmittedsignalswhereasinICA thegoalistondsomecomponents

that arestatistically as independentaspossible. Thus, ICA canbe seenasatoolto solve

theBSSproblem.

BSS andICAhavebeenapplied,forexample,inthefollowingapplicationdomains

Audio andspeechsignalseparatione.g. [113,102,62]

Multiple-InputMultiple-Output(MIMO)communicationssystemse.g. [60,19,42,10,

29,123,116]

Biomedicalsignalprocessinge.g. [82,85,66,24,118]

Imageprocessingandfeatureextractione.g. [9,59,23]

Econometricsandnancialapplicationse.g. [8, 76,49]

Duringthelasttenyears,aconsiderableamountofworkhasbeenfocusedonBSS/ICA.

ConferencesandspecialsessionsconcentratingonICAhavebeenorganized. Thetheoretical

(18)

1.2 Scope of the Thesis

The purpose of this thesis is to develop ICA methods that can adapt to dierent source

distributions. Adaptationmakesitpossibletoseparatesourceswithoutknowingthesource

distributionsoreventhecharacteristicsofsourcedistributions. Special attentionispaidto

methods that allow not onlysymmetric but also asymmetric source distributions.

Asym-metricdistributionsoccurinkeyapplicationareas,suchas,communicationsandbiomedical

signalprocessing.

TheICAmodelhastwogroupsofparameters: themixing systemandthesource

distri-butions. It hasbeen shown that ifthesource distributions areknown, optimal separation

algorithmsmaybederived[17]. Thisisdonebyutilizingthescorefunctionsofthesources.

Blindness means, however,that no explicitknowledge onthe sourcedistributions is

avail-able. Itfollowsthatthebetterthesourcesorthescorefunctionsofthesourcesareestimated

thebetterseparationresultwecanexpect.

Therst goalofthis thesisisto developmethods forlearningthesource distributions.

Inpractice,itisadequatetoconcentrateontheapproachesthatcapturetheessential

prop-erties of the source distributions. The second goalis to nd eÆcient implementations of

theproposedmethods. This includes thechoice of theoptimization algorithm,robustness

considerationsand simulation studies of practical performance. Theobjective is to show

thatadaptiveestimationmethodsarenecessaryandontheotherhand,showthattheprice

paidfortheincreasedexibilityisnottoohigh.

1.3 Contribution of the Thesis

ThecontributionsofthisthesisareindevelopingnewmethodsforICA.Adaptivetechniques

are proposed for the modeling of score functions or estimating functions. Score functions

aremodeledusingparametricfamilies. ThemethodsmaybeincorporatedintoexistingICA

algorithms. Thecontributionscanbesummarizedasfollows:

Therelationshipbetweenscoreadaptiveestimationandminimizationofmutual

(19)

An extendedPearson systemmodel allowingmultimodal distributions isintroduced.

The case of bimodal distributions is considered in more detail. The obtained score

functions are bounded and dened everywhere. The parameters can be estimated

using themethod ofmoments.

TheuseoftheExtendedGeneralizedLambdaDistribution(EGLD)intheICA

prob-lem isintroducedin co-operationwiththeco-authors[PublicationII].

The method of L-moments is proposed for the estimation of the parameters of the

GeneralizedLambdaDistribution(GLD).

The optimalweighting isderivedfor theadaptiveestimating functions comprisedof

twoxedcomponentsusingtheconceptofBSSeÆcacy.

Absolutemomentsareproposedasestimatingfunctions.

Theperformanceoftheproposed methodsisstudied quantitativelyandqualitatively

insimulations. Thesimulationsdemonstratethereliableperformanceofthemethods.

1.4 Summary of Publications

This thesis consists of 7publications and a summary. Thesummary part of the thesisis

organizedasfollows: Chapter2introducesthebasicconceptsandmethodsofBSS.Chapter

3containsan overviewof theexisting methods forthesource adaptiveICA. InChapter 4

themaincontributionofthethesisissummarizedandmethodsofPearson-ICA,EGLD-ICA

andadaptiveestimating functions arepresented. Chapter5providesabriefsummaryand

outlinesfutureresearch.

InPublicationIaPearsonsystembasedBSSmethodisintroduced. Analgorithmusing

themethodofmomentsisproposedforndingtheparametersofthePearsonsystem. The

actual separationis performed using xed point algorithm [58]. Thesimulation examples

demonstratethatthemethodcanseparatesuper-and sub-Gaussiansourcesandeven

non-Gaussiansourceswithzerokurtosis.

InPublicationIIanEGLDbasedBSSmethodisintroduced. Analgorithmutilizingthe

(20)

In PublicationIIIthe algorithmsproposed in PublicationsII andI are further studied

and compared. It is demonstrated in simulations that the standard BSS methods may

performpoorlyinthecaseswherethesourceshaveasymmetricdistributions. Duetosource

adaptationtheEGLDandPearsonsystembasedmethods reliablyseparatethesources.

InPublicationIVtheapplicationsofPearson-ICAareconsideredfromtheviewpointof

telecommunications. SeparationofbinarysourcesandinstantaneousmixingofRayleighor

lognormalfadedsignalsareusedasexamples. Simulationresultsareprovided.

In Publication V the use of the Pearson system is further developed. The dierent

typesof distributionsin Pearsonfamily arestudied in ICA context. It is shown usingthe

results by Pham [97] that the minimization of the mutual information contrast leads to

iterativeuseofscorefunctionsasestimationfunctions. AnextensionofthePearsonsystem

that can model multimodal distributions is introduced. The applicability of the Pearson

systembasedmethod isdemonstratedinsimulationexamples,includingblind equalization

ofGMSKsignals.

Publication VIis anextended versionof Publication III. Theperformanceof the

pro-posed methods is studied in moredetail. Theadditionalcontribution is themethod of

L-momentsproposedfortheestimationofGLDparameters. ItisarguedthattheL-moments

areamorenaturalwaytoestimatetheGLDparametersthantheconventionalsample

mo-ments. Additionally, theL-momentshaveattractivetheoreticalproperties,includinglower

samplevariancecompared tothesamplemoments.

Publication VII considers the problem of adaptive score estimation from a dierent

viewpoint. The proposed estimating functions comprised of symmetric and asymmetric

partcancapture theessential features ofthe sourcedistributions. Theoptimal weighting

betweenthe symmetric andasymmetric part is derived using theconcept of BSSeÆcacy.

Generalresultsarederivedand absolutemomentbasedestimatingfunctions arepresented

asanexample.

Authorderivedalltheequations,performedallthesimulationsandwasmainly

respon-sible for writingin Publications I, III, IV, V, VI and VII. Theco-authors contributed in

steeringtheresearch,in designingexperimentsandin writingthepapers.

TherstauthorwasmainlyresponsibleforwritingPublicationII. Theideaofusingthe

(21)

(22)

Blind Source Separation

2.1 Overview

This chapter provides ashort overviewto blind source separation(BSS) andindependent

componentanalysis (ICA).ThekeyconceptsandassumptionsneededinICA andBSSare

described. BasicICAmodelanditsextensionsareconsidered. Theelementsandprinciples

ofanICA method areexplained. Moreextensiveoverviewsaregiveninseveralbooks and

tutorialarticles[60,49,50,80,17,4].

2.2 Independent Component Analysis Model

2.2.1 The basic ICA model

Inthis thesisweconsider thenoiselessinstantaneousICAmodel

x=As; (2.1) wheres=[s 1 ;s 2 ;:::;s m ] T

is anunknownsourcevectorandmatrixA

mm

isanunknown

real-valued mixing matrix. The observed mixtures x = [x

1 ;x 2 ;:::;x m ] T are sometimes

called as sensor outputs. The following assumptions for the model to be identiable are

neededaccordingto[27,68]

(23)

2. AtmostoneofthesourceshasGaussiandistribution.

3. MixingmatrixAisinvertible.

4. Momentsofthesourcesexistuptothenecessaryorder.

SeparationmeansthatwendaseparatingmatrixWthatmakesthecomponentsof

y=Wx (2.2)

mutually independent. The ith row vectorof W is marked by w

i

. It is possible to nd

solution up to some scaling and permutation. If W is a separating matrix, any matrix

PW ,whereisadiagonalmatrixandPisapermutationmatrix(inpermutationmatrix

exactlyoneelement oneveryrowandcolumn is 1andtheother elementsare 0), isalso a

separatingmatrix[27].

The ICA model has two types of parameters: the mixing coeÆcients in A and the

source densities. Usually, we are interested in the mixing matrix A or the actual source

values,andthesourcedensitiesaretreatedasnuisanceparameters. Withoutanyadditional

assumptions, the estimation of the densities is considered as a nonparametric problem.

Together withthe parametricestimation of themixing matrix,the estimation of theICA

model isreferredtoasasemiparametricproblem[3].

2.2.2 Extensions of the basic ICA model

The basic ICA model may be extended several dierent ways. The noisy ICA model is

expressedas

x=As+n; (2.3)

wherenis aGaussian noisevectorindependentfromthesources. Addingthenoisemakes

themodelmorerealisticbecause thereis alwaysnoiseinphysicalsensor measurements. If

the noisevariancesare small comparedto the output variances,the methods fornoiseless

ICA canbe utilizedwith good results. At thepresenceof heavynoiseadditionalmethods

(24)

Complex-valued sources and mixing matrices occurespecially in communication

prob-lems. Thereexist ICAmethods developedforthecomplex-valuedproblem[15,11].

Some-timestheproblem reducestothereal-valuedproblem,e.g. [123].

In someapplicationsthe mixing matrixAis notasquare matrix. Thecase wherethe

number of mixtures is higher than the number of sources we essentially have the basic

problem with extra information. The rank of the model or the number of sources might

be unknown and should be also estimated e.g. [22]. The case where the number of the

mixtures is lowerthan the numberof thesources is adiÆcult problem. Since the mixing

is not invertible the identication of the mixing matrix and the recovery of the sources

are individual problems. Generally, the sources cannot be recovered without additional

assumptions. Theproblemhasbeenconsidered in[108,28,20,32,79,63].

In convolutivemixing,theobserveddiscrete-time signalsx

i

(t);i=1;:::;m are

gener-atedfromthemodel

x i (t)= m X j=1 X k a ik j s j (t k): (2.4)

This isaFiniteImpulse Response Multi-inputMulti-output (FIR-MIMO)model, whereas

the basic instantaneous mixing model (2.1) can be seen as an instantaneous MIMO

(I-MIMO)system. Inmodel (2.4)each FIRlter(forxed indicesi andj)isdened bythe

coeÆcientsa

ijk

. Convolutivemodelsareconsiderede.g. in [6, 114,93,45,46].

NonlinearICAmodelisgivenby

x=h(s) ; (2.5)

wherehisanunknownm-componentmixingfunction. Ifthespaceofthenonlinearfunctions

his notlimitedthere exist aninnityof solutions[61, 40]. Recently, theinteresttowards

nonlinearICAhasincreased. TheuniquenessproblemsareavoidedusingBayesianapproach

[117],regularizationtechniques[1]orstructuredmodels[109,40]. Animportantspecialcase

ofthegeneralnonlinearmodel(2.5)ispost-nonlinearmixturemodel[111]

x i =h i 0 @ m X j=1 a ij s j 1 A ; i=1;:::;m; (2.6)

(25)

wherenonlinearfunctionsh i

;i=1;:::;m;areapplied tothelinearmixtures.

2.3 Anatomy of an ICA Method

InthisthesisICAmethodsarestudiedinthefollowingframework. AnICAmethodconsist

ofthreeparts:

1. Measureforindependence(theoreticalcontrast)

2. Estimator ofthemeasure,orobjectivefunction

3. Algorithmforoptimization

These parts are considered in the followingsections. We make adistinction between

the-oretical measuresof independence andtheestimatorsof independence calculatedfrom the

data. From the theoretical point of viewthe linearinstantaneous ICA problem is solved:

independentcomponentsarefoundwhenthechosenmeasureforindependenceisminimized.

However, thegreat number of theproposed ICA methods showsthat there is work to do

withestimatorsand algorithms.

2.4 Measures of Independence

Mutualindependence ofrandomvariablesy=[y

1 ;y 2 ;:::;y m ] T

meansthatthejoint

distri-butioncanbefactorizedandpresentedasaproductofthemarginals. Thefactorizationcan

bedened usingcumulativedistributionfunctions

F(y )=F 1 (y 1 )F 2 (y 2 ):::F m (y m ); (2.7) probabilitydensities f(y )=f 1 (y 1 )f 2 (y 2 ):::f m (y m ); (2.8) orcharacteristicfunctions (t)= 1 (t 1 ) 2 (t 2 )::: m (t m ) (2.9)

(26)

wherecharacteristicfunctionis denedby (t)= Z e |ty dF = Z e |ty f(y)dy; (2.10)

where|istheimaginaryunit. Thesedenitionscharacterizeindependencebuttheydonot

directlytellhowtomeasuredependencies. Anaturalwaytodothisistoconstructameasure

using,forinstance,thedierencebetweenthejointcharacteristicfunctionsandtheproduct

ofmarginalcharacteristicfunctions[122,38]oralternatively,thedierencebetweenthejoint

cdfandtheproductofthemarginalcdfs,e.g. Kolmogorov-Smirnov[47]teststatistics

KS =sup x jF(x) F 1 (x 1 )F 2 (x 2 ):::F m (x m )j: (2.11)

Acontrastfunction orbriey acontrastisoneofthekeytermsin ICA.A contrastisa

function to beminimized in orderto separate thesources. Formallyacontrastfunction is

denedas[27]

Denition 1 A contrast is a mapping from the set of densities ff

y

;y 2 R

m

g to R

satisfyingthe following threerequirements

1. (f Py )=(f y ); 8Ppermutation, 2. (f y )=(f y ); 8diagonal invertible,

3. If y hasindependent components,then (f

Ay

)(f

y

); 8A invertible.

Accordingto Denition 1acontrastisafunction of densities. Under theassumption that

thedensitiesarecorrectlyestimated, acontrastbecomesafunction ofthecurrentmixture

yorequivalentlyafunctionof theseparatingmatrixW .

TwofundamentalICAcontrasts,themaximumlikelihoodcontrastandthemutual

infor-mationcontrast,arebasedonKullback-Leiblerdivergence.TheKullback-Leiblerdivergence

betweentherandomvariablesy

1 andy 2 isdenedas K(y 1 jjy 2 )= Z f 1 (y )log f 1 (y ) f 2 (y ) dy ; (2.12)

(27)

contrastcanbedenedastheKullback-Leiblerdivergencebetweenyands ML (y )=K yjjs (2.13)

andthemutualinformationcontrastcanbedened as

MI

(y )=K yjjy~

; (2.14)

where ~y denotes the vector with independent entries with each entry distributed as the

correspondingmarginalofy . Nowtheconnectionbetweenmutualinformationandlikelihood

canbewrittenas K yjjs =K yjj~y +K ~yjjs ; (2.15)

Mutual information is a suÆcient statistics in ICA [17]. Likelihood is a sum of mutual

informationandanuisancetermthatgivesthemarginalmismatchbetweentheoutputand

theassumedsources.

2.5 Objective Functions and Estimating Functions

An estimatorofacontrastfunction is oftencalled asobjectivefunction, criterionfunction

or cost function. In addition, the term contrast is sometimes used also for the estimator

calculatedfromthedata. ItshouldbementionedthatthemeaningofcontrastinICAdiers

from themeaning contrasthasin statistics [103]. TheICA terminologymaybeconfusing

herebut thebasicideaisthatwehaveameasureofindependence andanestimatorforit.

Thederivativeofanobjectivefunctionmaybecalledanestimatingfunction. Estimating

functions aresometimesalso called separatingfunctionsoractivationfunctions. Since the

objectivefunctions mustbeminimizednumerically,theestimatingfunctionshavean

essen-tialrole in practicalICA algorithms. Formally, theestimating function [3] canbe dened

asamatrix-valuedfunction (x;W )such that

Ef (x;W

)g=0; (2.16)

(28)

(x;W )=I '(y )y T where '(y )=[' 1 (y 1 );' 2 (y 2 );:::;' m (y m )] T isavectorofone-unit

estimating functions. The term 'estimating function' is commonly used to refer to these

one-unitestimating functions,asdonealso inthis thesis. This denition oftheestimating

function is related to the projection pursuit [53, 65] and the deation approach [33, 54]

whereone-unitobjectivefunctionsareusedtoextractthesourcesonebyone.

If the source distributions are known, the maximum likelihood principle leads to the

estimatingfunctionsthat arethescorefunctionsofthesources[17]:

' y (y )= d dy logf y (y ): (2.17)

This is a fundamental result but it applies only when the source densities are positive

everywhere. For example, if uniformly distributed sources are mixed we cannot use score

functionsinseparationbecausethescorefunctionsarezeroinaniteintervalandundened

elsewhere.

In practice,thesource distributions are notknown. Themaximum likelihood contrast

canbeemployedwithsomepre-chosendensitiesforthesources. An equivalentapproachis

tochoosedirectlyasuitablenonlinearfunctionasestimatingfunction. Weusethenotation

where '

y

refers to the true score function of random variable y , as dened in equation

(2.17). The notation without the subindex ' refers to an estimating function, or to the

estimatedscorefunction. Thisemphasizesthecloserelationshipbetweenthescorefunction

modeling andthe nonlinearityselection. Ifestimating function ' isused, weobservethat

thefollowingexpressionfortheassumedsourcedensityisobtained

f(y i )= exp( R '(y i )dy i ) R 1 1 exp( R '(y i )dy i )dy i : (2.18)

Itshouldbenotedthat(2.18)isnotalwaysavaliddensityintraditionalsense. Forsome

typ-icalchoicesoftheestimatingfunction,thedenominatorin(2.18)tendstoinnity. Thiscan

beavoidedmakingaworkingassumptionthaty

i

belongstoanite intervalandevaluating

theintegralsoverthisinterval.

InlinearICAaccurateestimationofsourcedistributionsisnotalwayscrucial. However,

better separationmaybeachievedif thesourcedistributions areestimated. This becomes

(29)

Cumulantshavebeenusedasobjectivefunctionssincetheearlydaysofblindseparation

[67, 27,15, 33]. Cumulantsareemployedalsoin somerecentworks;seee.g. [86]and [30].

Cumulants 1 ; 2 ; 3

;::: aredenedviacharacteristicfunction(2.10)bytheidentity

exp X q (|t) q =q! =(t) (2.19)

Cumulants may be estimated from the sample moments of same and lower orders. The

estimatesofthesamplemoments(centralmoments) areobtainedasfollows

x= T X t=1 x(t)=T (2.20) ^ 2 = ^ 2 = T X t=1 (x(t) x) 2 =T (2.21) ^ 3 = T X t=1 (x(t) x) 3 =T (2.22) ^ 4 = T X t=1 (x(t) x) 4 =T; (2.23)

where T is the numberof observations. In this thesis, notation

1 ; 2 ; 3 ;::: is used for

boththe theoreticalsample momentsand theirestimators. The cumulant-basedskewness

andkurtosismaybedened asfollows

Æ 3 (y i )= 3 (y i ) 2 (y i ) 3=2 =E ( y i y i yi 3 ) (2.24) Æ 4 (y i )= 4 (y i ) 2 (y i ) 2 =E ( y i y i yi 4 ) 3; (2.25) where y i and y i

aretheexpectedvalueandthestandarddeviationofy

i

,respectively. The

separationcanbebasedonthefactthatforGaussiandistributionthehigherordercumulants

equal to zero. Maybe the simplest technique to separate the sources is to maximize or

minimizekurtosis.

Whensample varianceandrobustnessto outliers (in noisyICA model)are ofconcern,

boundednonlinearfunctionsmaybemoreadvisablethancumulants. However,thepractical

(30)

typi-Objectivefunction Estimatingfunction kurtosis (y i )=y 4 i cubic'(y i )=y 3 i skewness(y i )=y 3 i '(y i )=y 2 i (y i )=log (cosh(y i

)) hyperbolictangent'(y

i

)=tanh(y

i )

Gaussian moments,e.g. (y

i )=e y 2 i =2 '(y i )= y i e y 2 i =2 3rdabsolutemoment(y i )=jy i j 3 '(y i )=y i jy i j

Table2.1: Sometypicalone-unitobjectivefunctionsandthecorrespondingestimating

func-tions. Thescalingconstantsareomitted.

These simpleestimating functions are good benchmark for moreadvanced methods: they

areeasytoimplementandtheysuccessfullyseparatemostoftypicalsources.

The objectivefunctions in Table 2.1, expect the skewness, employ only even moments

or symmetric properties of the source distributions. This means that there is animplicit

assumptionthatthesourceshaveasymmetricdistribution. Theexplicitconnectioncanbe

foundusingtheequation(2.18). InPublicationVII adaptivemethodsforndingobjective

functions with the optimal weighting between the symmetric and asymmetric properties

havebeenproposedandtheywillbeconsidered inSection4.4of thisthesis.

2.6 Mutual Information and Source Adaptation

TheICAmethods proposedinthisthesisarebasedondirectminimizationofmutual

infor-mation. Thedirectminimization ofmutualinformationleadstotheadaptiveestimationof

the scorefunctions of the mixtures asshown in [97] and in Publication V. Starting from

mutual information contrast

MI

(W ) dened as afunction of W , the following gradient

(calledasrelativegradientin [97])isobtained

0 MI (W )= Z ' y (y )y T f x (x)dx I: (2.26)

Usingtherelationy=Wx,whereWisorthogonal,wecanwrite(2.26) intheform

0 MI (W )= Z ' y (y )y T f y (y )dy I: (2.27)

(31)

Ify (t)isanergodicrandomprocess,wheretheindividualsamplesaredistributedaccording

tof

y

(y ),weobtainthefollowingestimator

^ 0 MI (W )= 1 T T X t=1 ^ ' y (y (t))y (t) T I; (2.28) where'^ y

isanestimatorforthescorefunctionofyandT isthesamplesize.Inthecaseof

mutualinformationcontrast,theestimatingfunctionisthescorefunctionofy . Becausethe

outputy changes oneveryiterationofthe optimizationalgorithm,theoptimalestimating

functions alsochangeineachiteration.

Aprocedureforparametricminimizationofmutualinformationmaybegivenasfollows:

Afterthechoiceofmodelfamilyandsomesuitablealgorithm,suchasnaturalgradient(2.29)

orxedpointalgorithm(2.30),thefollowingstepsarerepeateduntil theconvergence:

1) Appropriatesamplestatistics(e.g. moments)arecomputedfromthecurrentdatay

k =

W k

x.

2) The parameters of score function are estimated for each component using the sample

statistics.

3) ThescorefunctionsareutilizedasestimatingfunctionsintheICAalgorithmperforming

theseparation.

2.7 Algorithms

Numericalmethods areneededinorder tooptimizeanICAobjectivefunction. Ingeneral,

the choice of the algorithm is independent from the choice of the objective function. Of

course,theremaybedierencesin thecomputationalcomplexity. Itiscommonlyassumed

thatthedataiscenteredandwhitened(zeromean,uncorrelated,unitvariance)priortothe

actualseparation. Afterwhiteningtheseparatingmatrixis(asymptotically)orthogonaland

thenumberofparameterstobeestimatedissmaller. Prewhiteningimprovestheconvergence

(32)

2.7.1 Natural gradient algorithm

Abasicprincipleof thegradienttypeoptimization methods isto moveto thedirection of

(negative) gradient. InICA, thegradient can be adjusted to correspond to the geometry

ofthe problem. This leadstonaturalgradient[2] orrelativegradient[17] algorithm. The

updatingrulefortheseparatingmatrixisthefollowing

W k +1 =W k + I '(y )y T W k ; (2.29) where '(y ) =[' 1 (y 1 );' 2 (y 2 );::: ;' m (y m )] T

isthe vectorof estimating functions and is

thelearningrate.

2.7.2 Fixed-point algorithm

Fixed-point algorithm [57, 58] can be seen as a computationally moreeÆcient versionof

naturalgradientalgorithm. Theupdate rulecanbeexpressedas

W k +1 =W k +D Ef'(y )y T g diag(Ef'(y i )y i g) W k ; (2.30) where D = diag 1=(Ef'(y i )y i g Ef' 0 (y i )g

. After every iteration, the separating

ma-trix is projected to the set of orthogonal matrices (in the case of prewhitened data)

us-ing symmetric orthogonalizationW

orth

= (WW

T )

1=2

W . The algorithm convergeswhen

jjW k +1 W k jj<"withe.g. "=0:0001. 2.7.3 Jacobi algorithms

Jacobi-typealgorithmsarebasedonthetheorem [27]stating thatin thecaseofthelinear

ICA model, pairwise independence implies mutual independence. This leads to the

algo-rithms wherepairwise costfunctionsare sequentiallyoptimized. Such algorithmsconverge

when all the pairs are optimized in the limits of some predetermined converge criterion.

The best-known Jacobi type algorithm is probablyJoint Approximate Diagonalization of

(33)

2.8 Characterization of Source Distributions

Inmanyapplicationsthenatureofthesourcesignalsisknowneveniftheexactsource

dis-tributions areunknown. Commonly, distributions are dividedto super-and sub-Gaussian

distributions. A symmetric zero mean distribution f(x) is super-Gaussian (respectively

sub-Gaussian)if9x 0 >0j8xx 0 ;f(x)>f G (x)(f(x)<f G

(x)forsub-Gaussian),where

f G

(x) isthenormalizedGaussianpdf. Inthecaseofunimodal symmetricsourcesthesign

of kurtosis (2.25) depends on super- andsub-Gaussianity[83]. Theconcept ofsuper-and

sub-Gaussianityisnotveryinformativein thecaseof asymmetric ormultimodal

distribu-tions. Measuresof boththe skewness andthe kurtosis areneeded to describeasymmetric

distributions. Multimodaldistributionsmaybecharacterizedbythelocationsofthemodes.

(34)

-4

-2

0

2

4

0.1

0.2

0.3

0.4

0.5

(a) Asuper-Gaussian distribution(the GGD

(equation(3.2))witha=1:4)

-4

-2

0

2

4

0.1

0.2

0.3

0.4

0.5

(b) A sub-Gaussian distribution (the GGD

(equation(3.2))witha=3:5)

-4

-2

0

2

4

0.1

0.2

0.3

0.4

0.5

(c) An asymmetric distribution (Centered

Rayleigh(2)distribution)

-4

-2

0

2

4

0.1

0.2

0.3

0.4

0.5

(d)Anasymmetricbimodaldistribution

(mix-tureoftwoGaussiandistributions)

(35)

2.9 Discussion

InthischaptertheICAmodelsandterminologywerereviewed. Measuresforindependence,

theirestimatorsandoptimizationalgorithmswereconsidered. Whenthefamilyofthe

pos-siblesourcedistributionsisexpandedfromsymmetricunimodaldistributionstoasymmetric

andmultimodaldistributionstheneedforthesourceadaptationbecomesobvious. The

con-nection withthesourceadaptationandminimization ofmutualinformationis established.

This suggests the adaptive estimation of the score functions of the mixtures. Methods

(36)

Review of source adaptive ICA

methods

3.1 Overview

AsdiscussedinChapter2,theoptimalseparationrequiresthatthesourcedistributionsare

known. Inpractice,thesourcedistributionsarenotknownandneedtobeestimatedreliably.

Inthepuremaximumlikelihood approach theprior knowledgeonthesourcesis renedto

adensitymodel oranestimating function. Inthe adaptivemaximumlikelihood approach

ormutualinformationapproach,densitiesorscorefunctionsareiterativelyestimatedfrom

thedata. Inthischapter,methods formodeling andestimatingthesourcedistributions in

ICAarereviewed. Theestimationmethods maybedividedintothree classes:

nonparametricmethods,e.g. kernelestimation,

parametricmodelsfordensitiesandscorefunctions,

modelsforestimating functions.

In this chapter,modelsand methods suitable forthe sourceadaptiveapproach are

re-viewed. Thechapterprovidesabackgroundforthescoreadaptivemodelsthatarepresented

(37)

3.2 Kernel estimation of densities

An overviewofkernelestimationandrelatednonparametrictechniquesisgivenin [104]. A

separationmethodwithkernelestimatesforthesourcedensitiesisproposedin[97]. Kernel

estimation of densities is also applied to nonlinear ICA problem [110, 111]. The kernel

densityestimate[104]isdened by

^ f i (u)= 1 T T X t=1 1 T u y i (t) T ; (3.1)

where is the kernel function and

T

is abin-width parameterdepending onthe number

ofobservationsT. Toguaranteethat

^ f

i

(u)isadensity,itsuÆcestotake adensityitself.

Thebin-width parameteraectsonthesmoothness oftheestimate. Pham[97]providesa

detailed theoreticalanalysis ontheuseofkernelestimatesinICA.

Somecomputationalproblemsneedtobesolvedinordertoapplykernelestimation. The

integralsinthegradientofmutualinformationcontrastmustbediscretizedbychoosingthe

spacing for theestimation grid, i.e. the points uwhere estimator(3.1) is computed. The

computationcanbemadefasterusingFastFourierTransform(FFT)[104]. Thekernel-based

methodisfurther developedin somerecentpapers[119,12].

3.3 Parametric models

3.3.1 Distribution families

The main contributions of this thesis are in using parametric families of distributions for

modelingthescorefunctions. ThesemethodsareconsideredinChapter4andinPublications

I-VI. Dierent parametric families for ICA are also employed in [21, 14, 41]. The models

used inthese papersare theGeneralizedGaussian Distribution(GGD)and t-distribution.

Botharefamiliesofsymmetricdistributionswithshapedependingontheparameters. The

pdfoftheGGDisdened as f(y;a; a )= a a 2 ( 1 ) exp( j a yj a ); (3.2)

(38)

whereaistheparameterofthedistribution, a

ascalingfactorand (x)isGammafunction

givenby (x)= Z 1 0 u x 1 exp( u)du: (3.3)

The parameter a controls the peakiness of the distribution. If a = 2, the distribution is

reducedtoGaussiandistribution;ifa<2,thedistributionissuper-Gaussian;andifa>2,

thedistributionis sub-Gaussian. Examplesarepresentedin Figure2.1. Theparameter

a

isascalingfactorcontrollingthevariance. ThescorefunctionoftheGGDisgivenby

'(y i )=a a sign(y i )j a y i j a 1 : (3.4)

TheparametersoftheGGDcanbesolvedfrom thefollowingmomentequations

4 = ( 5 a ) ( 1 a ) 2 ( 3 a ) 3; (3.5) a = s ( 3 a ) 2 ( 1 a ) ; (3.6) where 4

is thekurtosis and

2

is thesecond order moment. In practice, to estimatethe

parameters,thesamplekurtosisiscalculatedfromthedataandthevaluesoftheparameters

aand

a

aresolvednumerically.

Anothermodel,t-distribution,isfamiliarfromt-test[107,106]. Thepdfoft-distribution

withbdegreesoffreedomandthescaling factor

b is f(y;b; b )= b ( b+1 2 ) p b b 2 1+ 2 b y 2 b 1 2 (b+1) : (3.7)

Thescorefunction oft-distributioncanbewrittenas

'(y i )= (1 b)y i y 2 i b 2 b : (3.8)

(39)

4 = 3 ( b 4 2 ) ( b 2 ) 2 ( b 2 2 ) 3; (3.9) b = s b ( b 2 2 ) 2 2 ( b 2 ) : (3.10)

A simpler way to estimate the parameters of t-distribution using the Pearson system is

presentedlaterinSection 4.2.

In[21],onlytheGGDisemployedasamodelforthesources. In[14]thechoicebetween

theGGDandt-distributionisdonebasedonthesamplekurtosis

'(y i )= 8 > > < > > : a a sign(y i )j a y i j a 1 ; if ^ 4 (y i )0 (1 b)yi y 2 i b 2 b ; if ^ 4 (y i )>0: (3.11) 3.3.2 Mixture of Densities

MixtureofGaussiansmodel(MOG)isemployedasthemodelofsourcedensitiesespecially

in Bayesianapproach[7,117,78]. Thedensitymodelisthefollowing

f(x)= P j ! j 1 j p 2 exp( (x j ) 2 =2 2 j ) P j ! j ; (3.12) where i and 2 i

are mean and variance and !

j

is a weighting parameter. Mixtures of

Gaussianscanapproximatevirtuallyanycontinuoussourcedistributionbut thenumberof

requiredGaussiansdependsonthesourcedistribution. Forinstance,severalGaussiansare

neededtoapproximateuniformdensity. Theexpectation-maximization(EM)algorithm[34]

isoftenused inthelearningoftheMOGparameters. Due tocomputationalcomplexityof

MOG-basedICA,thenumberofGaussiansisusuallyxedtosomesmallnumber. Thismay

limittheperformanceinsomecaseseventhoughtheperformanceofthemethodisgenerally

good.

Mixture of densities models are also proposed in [120, 43, 48, 81]. In [120] amixture

of Gaussian or logisticdensities is proposed. In [43] a closely relatedmethod of adaptive

activationfunction neuronsisstudied. In[48]and[81]MOGandhyperbolic-Cauchy

(40)

3.4 Adaptive nonlinearities

ThemethodspresentedinSections3.2and3.3startedfromtheestimationofdensities. The

methodspresentedinthissectionapproachtheproblemfromadierentviewpoint. Instead

ofdensities,estimatingfunctionsaredirectlyworkedout. AsmentionedinSection2.5these

twoapproachesare theoretically equivalent. In practice,adaptive nonlinearities may have

someappealingcomputationalpropertiesalthoughadhocadaptationrulesareoftenneeded.

3.4.1 Polynomial expansions

EdgeworthandGram-Charlierexpansions[106]provideapproximationsfordensitiesinthe

vicinity of a Gaussian density. Theexpansions can be used to obtain approximationsfor

negentropy[60] Neg = 1 12 2 3 + 1 48 2 4 ; (3.13)

orforthescorefunction ofasymmetricdensity

'(s)=s 4 6 (s 3 3s); (3.14) where 3 and 4

are the third and fourth cumulant, respectively. Polynomial expansions

areconsidered e.g. in [65,27, 5,121]. Theapproximationof entropy canbealsobasedon

other functions than polynomials asproposed in [56]. Forinstance, Gaussian density and

its derivativesmay beemployed. These approximationsare usually moreexact and more

robustthantheapproximationsbasedonpolynomials.

3.4.2 Basis functions

Quasi-maximumlikelihoodapproachemployingasetofarbitrarybasisfunctionsisproposed

byPham[98](see[17]forabriefsummary). Thescorefunctionisapproximatedbyalinear

combination '(y i )= N X ! n ' n (y i ) (3.15)

(41)

ofaxedsetf' 1 ;' 2 ;:::;' N

gofarbitrarybasisfunctions. Itturnsoutthattheweighting

parameters ! 1 ;! 2 ;:::;! N

can be solved without knowing the true scorefunction. Mean

squareerrorbetweenthetruescorefunction anditsapproximationisminimized when

'(y i )=(EfR 0 (y i )g) T (EfR (y i )R (y i ) T g) 1 R (y i ); (3.16) whereR (y i )=[' 1 (y i );' 2 (y i );::: ;' N (y i

)]istheN1columnvectorofbasisfunctionsand

R 0

(y i

) isthe column vectorof theirderivatives. Inpractice, theexpectations arereplaced

bysampleaverages.

Algorithms where the nonlinearities are adaptively chosen on the basis of

sub/super-Gaussianityareusede.g. in[35,49,80]. Typically,thenonlinearitiesarebasedonfunctions

suchastanh(y)andy

3

andthesignofthenonlinearityis chosenadaptively.

3.4.3 Threshold functions and quantizers

Verysimplealgorithmscanbeconstructedusingadaptivethresholdfunctions. A threshold

activationfunction [84]isdened as

'(y i )= 8 > < > : 0; jy i j<b i ; a i sign(y i ); jy i jb i ; (3.17) wherea i andb i

aredatadependentparameters. Thethresholdb

i

maybechosensothatthe

local stability ismaximized. However,thismaximizationrequires knowledge ofthesource

distribution. Asapracticalsolution,theauthorsin[84]proposethefollowingupdatingrules

a i (t+1)=a i (t) a (1 ^ 2 (y i ;t)); (3.18) b i (t+1)=b i (t) b ^ Æ 4 (y i ;t); (3.19) where^ 2 (y i

;t)isthesamplevarianceofy

i aftertobservations,^ Æ 4 (y i

;t)isthesamplekurtosis

and

a

and

b

arethelearningrates. Additionally,thevaluesofb

i

areforcedtotheinterval

[0;1:5].

Thesimplethresholdfunctioncanbegeneralizedintroducingmorethresholdsandlevels.

(42)

ofquantizersandthresholdfunctionsisthattheycanbeeasilyimplementedindigitalsignal

processing.

3.5 Discussion

Thepresentedestimationmethodsillustratethetrade-obetweengeneralityandsimplicity.

Thenonparametricestimation isapparentlythe mostexibleconcept. However,acertain

implementation withaxed kernelisalreadyamorerestrictedmodel. Thecriticalpartof

kernel estimationis thechoiceof thekernelfunction and thebin-width parameter. There

existopposingopinionsonthecomplexityandthecomputationalcostofkernelestimation

in ICA [97, 60]. The speed requirement depends of course on the particular application

but itseems that kernelestimation is relatively complexmethod whencompared to other

methods.

Theexibilityofparametricestimationdependsonthechosendistributionfamily.

Prob-lemsmayoccurifthechosendistributionfamilycannotmodel theessentialfeaturesofthe

actual distribution. On the other hand, if an appropriate parametric model is used, the

methodswork eÆciently.

Theadvantageoftheadaptivenonlinearitiesisthattheyarecomputationallysimpleand

easytoimplement. Theperformancedependsonthesourcedistributions. Successful

sepa-rationisexpectedifthenonlinearitiescanreacttotheessentialfeaturesofthedistributions.

(43)

(44)

Adaptive Score Models

4.1 Overview

Inthischapterweintroduce methodsfor estimatingscorefunctions adaptively. The

para-metric models employed are the Pearson system and Generalized Lambda Distribution.

Additionally, adaptive estimating functions using iterative weighting are presented. The

guidelinesusedforchoosing anappropriateparametricmodelare

1. The model should adapt to asymmetricormultimodal sources,but the performance

should notdegradeinthecaseofunimodalsymmetricsourcedistributions.

2. Theparametersofthemodel shouldbeeasytoestimatefrom thedata.

3. Thefunctionalformofthescorefunctionshouldbeeasytocomputeandrobustagainst

outliers.

Asymmetric and multimodal source distributions are considered becauseblindness means

thatwecannotrestricttosymmetricsources. Asymmetricandmultimodalsource

distribu-tionsalso occurin the keyapplication areas,such as,telecommunicationsand biomedical

signalprocessing. Therequirementofeasyparameterestimationis naturalfrom thepoint

ofcomputationaleÆciencyandsimplicityoftheconcept. Asuitablefunctionalformofthe

(45)

4.2 Pearson System

ThePearsonsystemis afourparametric familyofdistributions dened bythedierential

equation f 0 (x)= (x a)f(x) b 0 +b 1 x+b 2 x 2 ; (4.1)

wheref(x)isadensityfunctionanda,b

0

,b

1

andb

2

aretheparametersofthedistribution.

ThePearsonsystemhasbeenextensivelystudiedinstatistics. Overviewsaregivenin[90]

andin [106]. Thedistributionfamilyis namedafter KarlPearson[94, 95]. Theestimation

ofthePearsonparametersisconsiderede.g. in[105,13,25,26,51,87,74,96]. Somerelated

distributionsarepresentedin [25,88,64,89,112].

An alternativeparameterizationis f 0 (x)= (a 1 x a 0 )f(x) b 0 +b 1 x+b 2 x 2 ; (4.2) where a 0 , a 1 , b 0 , b 1 andb 2

aretheparametersofthedistribution. Both parameterizations

(4.1)and(4.2)characterizethesamedistributionsbuttheexpression(4.2)hastheadvantage

that a

1

canbezeroand thevaluesoftheparametersareboundwhenthefourth cumulant

exists. Thus,weusetheparameterization(4.2). Thescorefunction ofthePearsonsystem

iseasilysolvedfrom (4.2)

'(x)= f 0 (x) f(x) = a 1 x a 0 b 0 +b 1 x+b 2 x 2 : (4.3)

Thederivativeofthescorefunctionis

' 0 (x)= a 1 b 0 +a 0 b 1 +2a 0 b 2 x a 1 b 2 x 2 (b 0 +b 1 x+b 2 x 2 ) 2 : (4.4)

Severalwell-knowndistributionsbelongtothePearsonfamily. Forinstance,forGaussian

distribution withmean andvariance

2

thevaluesoftheparametersarea

0 =12( 2 ) 3 , a 1 = 12( 2 ) 3 , b 0 = 12( 2 ) 4 , b 1 = 0 and b 2

= 0. Also Gamma, Beta and Student's

t-distributionbelongtothePearsonfamily. Thisisillustratedin Figure4.1.

(46)

0.5

1

1.5

2

1

2

3

4

5

6

7 IV

VI

Impossible for all distributions

III

V

II

4 2 3 I(M) I(J,U)

Figure4.1: AnillustrationofthePearsonsystemin(

2 3

,

4

)-plane. Thelimitforall

distribu-tionsisline

4

=

2 3

+1. TheLatinnumbersrefertothetraditionalclassicationofPearson

distributions. TypesI andIIarebetadistributionsofrstkind. ThenotationI(J,U)refers

toJ-andU-shapeddistributionsandI(M)tounimodaldistribution. Theboundarybetween

I(J,U)andI(M)iscurve4(4

4 3 2 3 )(5 4 6 2 3 9) 2 = 2 3 ( 4 +3) 2 (8 4 9 2 3 12)TypeIII

isGammadistributionforwhich

4 = 3 2 2 3

+3. TypeVIisthebetadistribution ofsecond

kind. TypeVis characterizedbycurve

2 3 ( 4 +3) 2 =4(4 4 3 2 3 )(2 4 3 2 3 6). Type

IVis the casewhere the equationb

0 +b 1 +b 2 x 2

=0has complexroots. TypeVII is the

Student'st-distribution.

problem this classication is more useful than the traditional classication (types I-VII)

[106]. Theclassicationispresentedanddiscussedin PublicationV.

Pearsonsystembasedblindseparationalgorithm, Pearson-ICA[71], wasoriginally

pro-posedinPublicationIandfurtherimprovedinPublicationV. Theimplementationisbased

ontheFastICAalgorithm [55].

4.2.1 Estimation of the Pearson system parameters

The parametersof the Pearsonsystem canbe estimated using method of moments[106].

Themomentequationsarederiveddirectlyfromthedenition(4.2)

x n (b 0 +b 1 x+b 2 x 2 )f 0 (x)=x n (a 1 x a 0 )f(x): (4.5)

(47)

Whentheleftside isintegratedbyparts,(4.5)leadstoarecursionformula nb 0 n 1 (n+1)b 1 n (n+2)b 2 n+1 = (4.6) a 1 n+1 a 0 n ; where n

is nth theoretical central moment. When this recursion formula is successively

appliedforvaluesn=0;1;2;3,thefollowingrelationshipbetweentheparametersa

0 ,a 1 ,b 0 , b 1 andb 2

andthetheoreticalcentralmoments

1 0, 0 1, 1 =0, 2 , 3 and 4 arises a 1 =j10 4 2 12 2 3 18 3 2 j (4.7) a 0 =b 1 = 3 ( 4 +3 2 2 ) (4.8) b 0 = 2 (4 2 4 3 2 3 ) (4.9) b 2 = 2( 2 4 3 2 3 6 3 2 ): (4.10)

Whenthetheoreticalcentralmomentsarereplacedbythesamplemoments,themoment

es-timatorsfortheparametersa

0 ,a 1 ,b 0 ,b 1 andb 2

areobtained. Thenumberoftheparameters

actuallyreducestothree becauseb

1 =a 0 anda 1 isascalingterm.

Iftheapproximateddensityissymmetric(i.e.

3

=0) theestimated scorereducesto

'(x)= (5 4 9 2 2 )x 2 2 4 ( 4 6 2 2 )x 2 (4.11)

It canbe easilychecked that when

4

3this corresponds t-distribution denedin (3.8),

(3.9)and(3.10).

Thetypeofthedistribution,(i),(ii)or(iii),mustberecognizedafter themodelis

esti-mated. Fortypes(i)and(ii)itispossiblethat theestimated densityisnotexactlycorrect

andthussomeobservationslayoutsidethedomain. IntheICAproblemweareonly

inter-ested inndingthescorefunction,which makesiteasytoheuristicallysolvethisproblem.

First,thesampleminimumandmaximumcanbeutilizedin theestimation. Alternatively,

saturated score functions (the values of the score function are bounded between suitable

(48)

4.2.2 Extensions of the Pearson system

Theestimation of thePearsonsystemparameters canbe basedonsample statisticsother

thantherstfour moments. Forinstance,in[87] theparameterestimationisbasedonthe

mean,thevariance,theskewness andtheleft(or right)boundary.

ThedierentialequationdeningthePearsonsystemmayalsobegeneralized. Anatural

generalizationis f 0 (x) f(x) = a(x) b(x) (4.12) where a(x) = a 0 +a 1 x+a 2 x 2 +:::+a p x p and b(x)= b 0 +b 1 x+b 2 x 2 +:::+b q x q are

somepolynomialsofx. Somegeneralizationsofthiskindareconsidered in[25] andbriey

discussedinPublicationV.

InPublicationVweproposeamultimodalgeneralizationofthePearsonsystemdened

asfollows f 0 (x) f(x) = a 3 x 3 +a 2 x 2 +a 1 x+a 0 x 4 +1 (4.13) where a 0 ;a 1 ;a 2 and a 3

are the parameters of the system. Thethird order polynomial in

the numerator enables modeling bimodal distributions. The fourth order polynomial in

the denominator makes sure that the score function behaves robustly when outliers are

encountered by bounding their inuence. Since the denominator is always positive, the

scorefunction doesnothavepointsofdiscontinuity.

Themethodofmomentscanbeusedtoestimatetheparametersof(4.13). Thisleadsto

theuseofthefthand thesixthordersamplemomentsthat areverysensitiveto outliers.

Fortunately, some simpleheuristic solutions exist for stabilizing theestimates of the fth

andthesixthmoments. Onecansimplysetmaximumvaluesforthehigherordermoments

(49)

Uniform

Normal

Impossible Region

Exponential

Gamma

Student’s T

Lognormal

1

2

3

4

5

10

15

20 GBD Area

GLD Area

Weibull

Rayleigh

4 2 3

Figure 4.2: Characterization of somestandardized distributions by their third and fourth

moments. TheEGLD family coversthe areaabove the shadedregion, which is notvalid

foranydistribution. Theskewnessand thekurtosis ofmanydistributions occurring in the

engineeringapplicationsarepointedout

4.3 Extended Generalized Lambda Distribution

TheExtended GeneralizedLambdaDistribution(EGLD) isalargefamilyof distributions

covering the whole space of the third and the fourth moment. The lambda distribution

waspresentedbyTukey[115]in 1960. Theconcept wasgeneralizedin 70's [100, 101, 99].

Its main usehas been in ttingadistribution to theempirical data, andin the computer

generation of dierent distributions. The latest extension of the family by Karian and

Dudewicz in 1996 [70] is a combination of Generalized Lambda Distribution (GLD) and

Generalized Beta Distribution (GBD). The space of (

3

;

4

) values, which is coveredby

the EGLDdistribution family, includes thevalues for allthe mostimportant distribution

includingnormal,uniform,gammaandbetadistributionsasillustratedin Figure4.2.

TheGeneralizedLambdaDistributionisdenedbytheinversedistributionfunction

F 1 (p)= 1 + p 3 (1 p) 4 ; (4.14)

(50)

where0p1and 1 , 2 , 3 and 4

are theparametersofthedistribution. Karianand

Dudewicz[70]showedthatGLDisavaliddistributionifandonlyif

2 3 p 3 1 + 4 (1 p) 4 1 0: (4.15)

ThealternativeFreimer-Mudholkar-Kollia-Lin(FMKL)parameterization[44]isgivenby

F 1 (p)= 1 + p 3 1 3 (1 p) 4 1 4 . 2 : (4.16)

TheFMKL-parameterizationseemsto havesomeadvantages overthe parameterizationin

equation(4.14)but sofarithasnotbeenusedforttingthedistributiontodata.

TheEGLDbasedblind separationalgorithm,EGLD-ICA[39], wasoriginally proposed

in Publication II. The L-moment basedestimation was proposed in Publication VI. The

implementation is similar to Pearson-ICA expect for the score function calculation and

parameterestimation.

4.3.1 Parameter estimation via sample moments

EstimationoftheGLD parametersusingthemethod ofmomentsis proposed in[70]. The

relationshipbetweentheparameters

1 , 2 , 3 and 4

andthemoments

1 , 2 , 3 and 4 is

establishedbyfournonlinearequations[70]thatcanbesolvednumerically. However,dueto

theintricacyofthecomputationalprocess,theparameters

1 , 2 , 3 and 4 aretabulated in[69,39]asfunctions of 3 and 4

forstandardizeddatawhere

1

=0and

2

=1. When

theEGLDis ttedto thedata, thechoicebetweenthe GLD andtheGBD ismadebased

onthevaluesofthekurtosisandtheskewnessasexplainedin PublicationII.

4.3.2 Parameter estimation via L-moments

Other statisticscan be utilized in the estimation of the parametersinstead of thesample

moments. Well-knowndrawbacksofthehigherordersamplemomentsarethehighvariance

ofestimatorsand thelackof robustness. Theconceptof L-moments[52] canbeseenasa

(51)

rstfourtheoreticalL-momentsaredenedas L 1 = Z 1 0 F 1 (p)d p (4.17) L 2 = Z 1 0 F 1 (p)(2p 1)dp (4.18) L 3 = Z 1 0 F 1 (p)(6p 2 6p+1)dp (4.19) L 4 = Z 1 0 F 1 (p)(20p 3 30p 2 +12p 1)dp: (4.20)

The L-moments exist if and only if the distribution has a nite mean. Furthermore, a

distributionwithanite meanischaracterizedbyitsL-moments[52]. Analogouslyto the

conventionalmoments,L

1

measuresthelocation,L

2

measuresthescaling,L

3

measuresthe

skewnessand L

4

measures thekurtosis. Scaling invariantmeasures are obtainedbyusing

L-momentratiosdenedas

r ,L r =L 2 ; r=3;4;::: (4.21)

Unliketheconventionalmoments,theL-momentsoftheGLDmaybeexpressedinaclosed

form L 1 = 1 1 2 1 1+ 4 1 1+ 3 (4.22) L 2 2 = 1 1+ 3 + 2 2+ 3 1 1+ 4 + 2 2+ 4 (4.23) L 3 2 = 1 1+ 3 6 2+ 3 + 6 3+ 3 1 1+ 4 + 6 2+ 4 6 3+ 4 (4.24) L 4 2 = 1 1+ 3 + 12 2+ 3 30 3+ 3 + 20 4+ 3 (4.25) 1 1+ 4 + 12 2+ 4 30 3+ 4 + 20 4+ 4

Thedetailsfortheparameterestimationarepresentedin thePublicationVI.

Since the L-moments are linear combinations of order statistics, the variances of the

sample L-moments are usually smallerthan thevariances of theconventional sample

(52)

sizeissmall. Additionally,theL-momentsaremorerobustagainstoutliers.

4.3.3 Other estimation techniques

Inaddition to method of momentsand method ofL-moments, someother techniques are

recently proposed for the estimation of the GLD parameters. Karian and Dudewicz [37]

proposed theuseofpercentiles. Thepercentileshavesimilardesirablepropertiesasthe

L-momentsbutthedierenceisthatinthepercentilemethod,onlycertainorderstatisticsare

used,whereasinthemethod ofL-momentsallorderstatisticsare employed. Thissuggests

thattheL-momentsbasedestimatorsaremoreeÆcientthanthepercentilebasedestimators.

Purelycomputationalmethods, suchas,leastsquaret(

OzturkandDalemethod)[92]

andthestarshipmethod[75,91]arealsoapplicable. Thestarshipmethodhasthefollowing

threesteps[75]

1. Foraset ofdataandarangeof

1 , 2 , 3 and 4

values,apply thereverse

transfor-mation,i.e. adatavaluexistransformedtoF(x). (Notethat asF doesnotexist in

closedform fortheGLD,numericalmethods areneeded.)

2. Calculate thevalue of asuitable goodness-of-t measure for thecloseness of the

re-sultingvaluestotheuniform(0,1)distribution.

3. Choosethe 1 , 2 , 3 and 4

valuesthatminimizethechosengoodness-of-tmeasure

totheuniform,asthettedvalues.

Accordingtothesimulationresultsin[75]

OzturkandDalemethodandthestarshipmethod

give good estimates. The computational cost, however, is higher than in the method of

momentsor inthemethodofL-moments.

4.4 Adaptive Estimating Functions

Adaptiveestimating functionsproposedin PublicationVIIcanbepresentedasaweighted

sumof twoestimatingfunctions

'(s i )=! 1 ' 1 (s i )+! 2 ' 2 (s i ); (4.26)

(53)

where' 1 (s i )and' 2 (s i

)aretwoxedestimatingfunctionsand!

1

and!

2

aretheweighting

parameters. Thecorrespondingobjectivefunctionmaybepresentedas

(y i ;! 1 ;! 2 )=! 1 j 1 (y i )j+! 2 j 2 (y i )j: (4.27)

The ideais iterativelyupdate the weighting parametersin optimal manner. Theoptimal

weightingissolvedmaximizinganeÆcacymeasure basedontheperformance analysis[17,

18, 73,16] of contrastfunctions. It is usuallyassumed in theanalysis that allthe sources

are identically distributed. Local stability is found to depend on the following nonlinear

moments # i =Ef' 0 (s i )g Efs i '(s i )g (4.28)

andthevarianceoftheseparationsolutionisfoundtodependon

i =Ef'(s i ) 2 g Efs i '(s i )g 2 : (4.29)

In[73]itisproposed thatthefollowingmeasurecanbeusedasaperformancecriterion

= # 2 i i : (4.30)

ThismeasureiscalledBSSeÆcacyanditisindependentofthescalingofestimatingfunction

'. TheBSSeÆcacygivesusananalyticalwaytocomparecontrastfunctions. Thesolution

maximizingBSSeÆcacyisgivenin [72]andPublicationVII.

4.4.1 Estimating functions based on cumulants and absolute

mo-ments

The simplest choice for the symmetric and the asymmetric objective function is to use

the cumulant basedkurtosis (2.25) and skewness (2.24). InPublication VII the cumulant

(54)

performancein practice. Theabsolutemoment[106]oftheorderqisdened by q (y i )=Efjy i j q g; (4.31)

whereistheexpectedvalueofthedistribution. Theevenabsolute momentsareequalto

theconventionalcentral momentsof thesameorderbut theoddabsolutemomentscannot

be directly written in the terms of the central moments. In addition, we may dene the

skewedabsolutemomentsby

q (y i )=E (y i )jy i j q 1 = Efsign(y i )jy i j q g: (4.32)

Analogouslyto theabsolute moments, theoddskewedabsolute momentsareequalto the

conventional central moments of the same order but the even skewed absolute moments

cannotbedirectlywritteninthetermsofthecentralmoments.

Thekurtosisofadistributionwithunitvariancecanbemeasuredbythethirdabsolute

moment 3 (y i )=E jy i j 3 : (4.33)

Asameasure forskewnesswecanusethesecondskewedabsolutemoment

2 (y i )=Efjy i j(y i )g: (4.34) Exploiting 3 and 2

wemayconstructanICAobjectivefunction. First,wendthatfora

Gaussianrandomvariabley

i with=0and 2 =1 3 (y i )= Z 1 1 jy i j 3 1 p 2 e y 2 i =2 dy i =2 r 2 1:59577 (4.35) and 2 (y i

)=0. Furthermore, wedene measures resembling thecumulant basedkurtosis

andskewness Æ 3 (y i )= 3 ( y i ) 2 r 2 (4.36) Æ 2 (y i )= 2 ( y i ): (4.37)

(55)

Basedonthesemeasuresthefollowingobjectivefunction isproposed inPublicationVII (y i )=! ;1 j Æ 3 (y i )j+! ;2 j Æ 2 (y i )j: (4.38)

Theexpressions forthe optimalweightingparameters, !

;1

and !

;2

and otherdetails are

providedin PublicationVII.

4.4.2 Gaussian moments based estimating functions

Thecumulantbasedapproachcanbegeneralizedtoother suitablenonlinearities[72]. The

basicideais that the objectivefunction is asum of theabsolute valuesof symmetricand

asymmetric functions. The theoretical results for an arbitrary nonlinearities are diÆcult

to obtainand thus thevalidity of theobjective functions must be checkedin simulations.

WeproposeusingtheGaussianmomentsassymmetricandasymmetricobjectivefunctions.

TheGaussianmomentsof orderzeroto threearedenedby

G 0 (y i ;b) = e by 2 i =2 1 p b+1 (4.39) G 1 (y i ;b) = by i e by 2 i =2 (4.40) G 2 (y i ;b) = (by 2 i b)e by 2 i =2 (4.41) G 3 (y i ;b) = (3b 2 y i b 3 y 3 i )e by 2 i =2 ; (4.42)

wherebisapositiveconstant. TheGaussianmomentsformthebasisofGram-Charlierand

Edgeworthseries[106]. Usually(4.39)isgivenin theform

G 0 (y i ;b)=e by 2 i =2 : (4.43)

Therationalebehindtheconstant

1 p

b+1

becomesobviouswhenweconsidertheexpected

valueofG

0 (y

i

)inthecasewherethedistributionofy

i isGaussian(=0, 2 =1) EfG 0 (y i )g= Z 1 1 e by 2 i =2 1 p 2 e y 2 i =2 dy i 1 p b+1 =0: (4.44)

Inaddition,wenoticethattheexpectedvalueoftheasymmetricpartequalszeroEfG

1 (y

i

(56)

Thenonlinearityinexpression(4.43)maybeemployedasarobustICAobjectivefunction

asproposed in[57, 56]. Weproposetheuse ofG

0

andG

1

asmeasuresofthekurtosis and

the skewness in the ICA framework. Wenowdene theoretical measures for the kurtosis

andtheskewnessasfollows

Æ 0 (y i ;b)=E G 0 ( y i ;b) (4.45) Æ 1 (y i ;b)=E G 1 ( y i ;b) ; (4.46)

where istheexpected valueof y

i

and isthestandarddeviation. Themeasures

Æ 0 and Æ 1 are analogousto 4 and 3

in sensethat theyarezeroforGaussian distributionand in

generalatleastoneofthem isnonzeroforother distributions. However,

Æ 0 and Æ 1 donot

measure thekurtosis and the skewness in the samesense as

4 and 3 or Æ 3 and Æ 2 . For

instance,thesignsof

Æ 1

and

3

maydier.

Now,theobjectivefunctionbasedonGaussianmomentswithb=1canbeexpressedas

G (y i )=! G;1 jG 0 (y i )j+! G;2 jG 1 (y i )j: (4.47)

The estimating function related to the objectivefunction (4.47) and the derivativeof the

estimatingfunction are

' G (y i )=! G;1 sign( Æ 0 )G 1 (y i )+! G;2 sign( Æ 1 )G 2 (y i ) (4.48) ' 0 G (y i )=! G;1 sign( Æ 0 )G 2 (y i )+! G;2 sign( Æ 1 )G 3 (y i ): (4.49)

Thestatisticsign(

Æ 0

)hasasimilar roleasthesignofthekurtosishasinmanyalgorithms.

Thesignof

Æ 0

iseitherknowninadvance,ormorepractically,estimatedfromthedatafor

eachsource.

4.5 Performance

Several simulations presented in the original publications demonstratethe reliable