Teknillinen korkeakoulu Signaalinkäsittelytekniikan laboratorio
Espoo 2002
Report 35
ADAPTIVE METHODS FOR SCORE FUNCTION MODELING IN
BLIND SOURCE SEPARATION
Helsinki University of Technology Signal Processing Laboratory
Teknillinen korkeakoulu Signaalinkäsittelytekniikan laboratorio
Espoo 2002
Report 35
ADAPTIVE METHODS FOR SCORE FUNCTION MODELING IN
BLIND SOURCE SEPARATION
Juha Karvanen
Dissertation for the degree of Doctor of Science in Technology to be presented with due
permission for public examination and debate in Auditorium S4 at Helsinki University of
Technology (Espoo, Finland) on the 26th of August, 2002, at 12 o’clock noon.
Helsinki University of Technology
Department of Electrical and Communications Engineering
Signal Processing Laboratory
Teknillinen korkeakoulu
Sähkö- ja tietoliikennetekniikan osasto
Signaalinkäsittelytekniikan laboratorio
Distribution:
Helsinki University of Technology
Signal Processing Laboratory
P.O. Box 3000
FIN-02015 HUT
Tel. +358-9-451 2486
Fax. +358-9-460 224
E-mail: [email protected]
Juha Karvanen
ISBN 951-22-5990-7
ISSN 1458-6401
Otamedia Oy
Espoo 2002
Insignal processing and related elds, multichannel measurements are oftenencountered.
Dependingontheapplication,forinstance,multipleantennas,multiplemicrophonesor
mul-tiplebiomedicalsensorsareusedforthedataacquisition. Suchsystemscanbedescribed
us-ingMultiple-InputMultiple-Output(MIMO)systemmodels. Inmanycases,severalsource
signalsarepresentatthesametimeandthereisonlylimitedknowledgeoftheirproperties
andhowtheycontributetoeachsensoroutput. Ifthesourcesignalsandthephysicalsystem
areunknown andonlythe sensoroutputs areobserved,the processingmethods developed
forrecoveringtheoriginalsignalsarecalledblind.
In Blind Source Separation (BSS) the goal is to recover the source signals from the
observedmixedsignals(mixtures). Blindnessmeansthatneitherthesourcesnorthemixing
system is known. Separation can be based on the theoretically limiting but practically
feasibleassumptionthatthesourcesarestatisticallyindependent. Thisassumptionconnects
BSS and Independent Component Analysis (ICA). The usage of mutual information as a
measureofindependenceleadstoiterativeestimationofthescorefunctionsofthemixtures.
ThepurposeofthisthesisistodevelopBSSmethods thatcanadapttodierentsource
distributions. Adaptationmakesitpossibletoseparatesourceswithoutknowingthesource
distributionsoreventhecharacteristicsofsourcedistributions. Special attentionispaidto
methodsthatallowalsoasymmetricsourcedistributions. Asymmetricdistributionsoccurin
importantapplicationssuchascommunicationsandbiomedicalsignalprocessing. Adaptive
techniquesareproposedforthemodelingofscorefunctions orestimatingfunctions. Three
approaches basedonthe Pearsonsystem, theExtended GeneralizedLambdaDistribution
(EGLD) and adaptively combined xed estimating functions are proposed. The Pearson
systemandtheEGLDareparametricfamilies ofdistributions andtheyare usedto model
containawide classof distributions,includingasymmetric distributions withpositiveand
negativekurtosis,whiletheestimationoftheparametersisstillarelativelysimpleprocedure.
Themethodsmaybeimplementedusingexisting ICAalgorithms.
The reliable performance of the proposed methods is demonstrated in extensive
sim-ulations. In addition to symmetric source distributions, asymmetric distributions, such
as Rayleigh and lognormal distribution, are utilized in simulations. The score adaptive
methods outperform commonlyused methods due to theirabilityto adapt to asymmetric
Jaminakaansinsydamenitutkimaanviisautta jatietoa,mielettomyyttajatyhmyytta,
jaminatulintietamaan,ettasekin olituulentavoittelemista. Saarn. 1:17
The work reported in this thesiswas carried out in the Signal Processing Laboratory,
Helsinki University of Technology during the years 1999{2002. The period included three
months in the Signal Processing Laboratory at University of Pennsylvania, Philadelphia,
USA.
I wish to express my gratitude to my supervisor Prof. Visa Koivunen. I would like
to thankmy thesisreviewersDr. AapoHyvarinen andDr. JyrkiMottonen fortheir
con-structive comments. I wish to thank Prof. Saleem Kassam for my visit to University of
Pennsylvania. Iamalso gratefulto mycolleaguesandco-workers,especially JanEriksson,
Dr. YingluZhangandDr. CharlesMurphy. Inaddition,I wouldliketothankmyparents
ToivoandMarja-Liisafortheirsupport.
Theresearchwasfunded bytheAcademyof Finlandand theGraduateSchoolin
Elec-tronics, Telecommunications and Automation (GETA). Additional nancial support was
providedbyElektroniikkainsinooriensaatio,theNokiaFoundationandTekniikanedistamissaatio.
Theseorganizationsandfoundationsaregreatlyacknowledgedformakingthisworkpossible.
Espoo,Finland
May13, 2002
1 Introduction 13
1.1 Motivation . . . 13
1.2 ScopeoftheThesis. . . 15
1.3 ContributionoftheThesis. . . 15
1.4 SummaryofPublications . . . 16
2 Blind Source Separation 19 2.1 Overview . . . 19
2.2 IndependentComponentAnalysisModel. . . 19
2.2.1 ThebasicICAmodel . . . 19
2.2.2 ExtensionsofthebasicICA model . . . 20
2.3 AnatomyofanICAMethod . . . 22
2.4 MeasuresofIndependence . . . 22
2.5 ObjectiveFunctionsandEstimatingFunctions . . . 24
2.6 MutualInformation andSourceAdaptation . . . 27
2.7 Algorithms . . . 28
2.7.1 Naturalgradientalgorithm . . . 29
2.7.2 Fixed-pointalgorithm . . . 29
2.7.3 Jacobialgorithms. . . 29
2.8 Characterizationof SourceDistributions . . . 30
2.9 Discussion . . . 32
3 Review of sourceadaptive ICA methods 33 3.1 Overview . . . 33
3.2 Kernelestimationofdensities . . . 34 3.3 Parametricmodels . . . 34 3.3.1 Distributionfamilies . . . 34 3.3.2 MixtureofDensities . . . 36 3.4 Adaptivenonlinearities . . . 37 3.4.1 Polynomialexpansions. . . 37 3.4.2 Basisfunctions . . . 37
3.4.3 Thresholdfunctions andquantizers. . . 38
3.5 Discussion . . . 39
4 Adaptive ScoreModels 41 4.1 Overview . . . 41
4.2 PearsonSystem . . . 42
4.2.1 EstimationofthePearsonsystemparameters . . . 43
4.2.2 ExtensionsofthePearsonsystem. . . 45
4.3 ExtendedGeneralizedLambdaDistribution . . . 46
4.3.1 Parameterestimationviasamplemoments . . . 47
4.3.2 ParameterestimationviaL-moments. . . 47
4.3.3 Otherestimationtechniques. . . 49
4.4 AdaptiveEstimatingFunctions . . . 49
4.4.1 Estimatingfunctions basedoncumulantsandabsolutemoments . . . 50
4.4.2 Gaussianmomentsbasedestimatingfunctions. . . 52
4.5 Performance. . . 53
4.6 Discussion . . . 54
I J.Karvanen,J. Eriksson andV. Koivunen. PearsonSystem basedMethod for Blind
Separation. InProc. of TheSecondInternationalWorkshopon Independent
Compo-nentAnalysisandBlind SignalSeparation,ICA2000,pages585{590,June2000.
II J. Eriksson, J. Karvanenand V. Koivunen. Source Distribution Adaptive Maximum
LikelihoodEstimationofICAModel. InProc. ofTheSecondInternationalWorkshop
on Independent Component Analysis and Blind Signal Separation, ICA2000, pages
227{232,June2000.
III J.Karvanen,J. Erikssonand V. Koivunen. MaximumLikelihood Estimation of ICA
modelforWideClassofSourceDistributions. InProc. ofthe2000IEEEWorkshopon
NeuralNetworksforSignalProcessingX,pages445{454,December2000.
IV J. Karvanen and V. Koivunen. Blind Separation of Communication Signals Using
PearsonSystem Based Method. In Proc. of TheThirty-Fifth Annual Conferenceon
InformationSciencesandSystems,VolumeII,pages764{767,March2001.
V J. Karvanen and V. Koivunen. Blind Separation Methods Based on Pearson system
anditsExtensions. SignalProcessing Volume82,Issue4,pages663{673,April2002.
VI J. Karvanen, J. Eriksson and V. Koivunen. Adaptive Score Functions for Maximum
LikelihoodICA.JournalofVLSISignalProcessing,Volume32,pages83{92,2002.
VII J.KarvanenandV.Koivunen. BlindSeparationusingAbsoluteMomentsBased
Adap-tiveEstimatingFunction. InProc. oftheThirdInternationalConferenceon
Indepen-dentComponentAnalysis andSignalSeparation, ICA2001,pages218{223, December
symbols
Abbreviations
BER biterrorrate
BSS blindsourceseparation
cdf cumulativedistributionfunction
EGLD extendedgeneralizedlambdadistribution
FIR niteimpulseresponse
GBD generalizedbetadistribution
GLD generalizedlambdadistribution
GMSK gaussianmeanshiftkeying
ICA independentcomponentanalysis
i.i.d. independentidenticallydistributed
I-MIMO instantaneousmulti-inputmulti-output
ISI intersymbolinterference
MIMO multiple-inputmultiple-output
MSE meansquareerror
PCA principalcomponentanalysis
pdf probabilitydensityfunction
Symbols
x vectorofoutputsignals
A mixing matrix
s vectorofsourcesignals
m numberofsensors,dimensionofthedata
y vectorofsourceestimates
W demixing matrix A T TransposeofmatrixA A 1 InverseofmatrixA
det(A) determinantofmatrixA
I identitymatrix
f probabilitydensityfunction
f G
Gaussiandensity
F cumulativedistribution function
K(f(y );g(y )) Kullback-Leiblerdivergencebetweendensitiesf(y )andg(y )
K yjjs
Kullback-Leiblerdivergencebetweendensitiesofrandomvariablesyands
ML
maximumlikelihoodcontrast
MI
mutualinformationcontrast
(x;W ) matrix-valuedestimatingfunction
' y
(y ) scorefunctionofy
'(y i
) (one-unit)estimatingfunction
characteristicfunction ofadistribution
t timeindex T numberofobservations f 0 (x) derivateoff(x) 1 ; 2 ; 3 ;::: central moments 1 ; 2 ; 3 ;::: cumulants L 1 ;L 2 ;L 3 ;::: L-moments 1 ; 2 ; 3 ;::: absolutemoments 1 ; 2 ; 3
;::: skewedabsolutemoments
G 0 ;G 1 ;G 2 ;::: Gaussianmoments
Æ 3 ; Æ 4
cumulantbasedskewnessandkurtosis
3
;
4
L-momentbasedskewnessandkurtosis
Æ 2 ; Æ 3
absolutemomentsbasedskewnessandkurtosis
Æ 0 ; Æ 1
Gaussianmomentsbasedskewnessandkurtosis
Efxg expectedvalueofx
a 0 ;a 1 ;:::;a p
numeratorpolynomialparametersofthePearsonsystem
b 0 ;b 1 ;:::;b q
denominatorpolynomialparametersofthePearsonsystem
a;b vectorsofthePearsonsystemparameters
M;Q matricescontainingcentralmoments
1 ; 2 ; 3 ; 4 parametersoftheGLD 1 ; 2 ; 3 ; 4 parametersoftheGBD ! 1 ;! 2 ;::: weightingparameters E 1 Performanceindex # i Localstability i
varianceof stabilitysolution
BSSeÆcacy
kernelfunction
T
Introduction
1.1 Motivation
Insignal processing and related elds, multichannel measurements are oftenencountered.
Theobtaineddatacan berepresentedasmultivariatetimeseries. Dependingonthe
appli-cation,forinstance,multipleantennas,multiplemicrophonesormultiplebiomedicalsensors
are used for the data acquisition. Such systems can be described using Multiple-Input
Multiple-Output (MIMO) system models. The observed sensor outputs are dierent
be-causethesensorshavedierentproperties,e.g. separatelocations. Ontheotherhand,the
sensor outputs are related becausethe sensors are observing the same source signals. In
many cases, several source signalsare presentat thesame time and there is onlylimited
knowledgeoftheirpropertiesand howtheycontributeto eachsensor output. Ifthesource
signals and the physical system are unknown and only the sensor outputs are observed,
the processing methods developed for recoveringtheoriginal signals are called blind. An
illustrationofaninstantaneousmixingMIMO-modelispresentedinFigure1.1.
InBlind SourceSeparation(BSS,alsoknownasBlind SignalSeparation)thegoalisto
recoverthesource signalsfrom theobservedmixed signals. Blindness meansthat neither
thesourcesnorthemixing systemis known. Separationcanbebasedonthetheoretically
limitingbut practicallyfeasibleassumption that the sourcesare statistically independent.
This assumption connects BSS and Independent Component Analysis (ICA). The terms
Figure 1.1: An illustrationof aninstantaneousnoise-free mixing system. Thesystemand
thesourcesareunknownandonlythesensoroutputsareobserved.
toseparatecertaintransmittedsignalswhereasinICA thegoalistondsomecomponents
that arestatistically as independentaspossible. Thus, ICA canbe seenasatoolto solve
theBSSproblem.
BSS andICAhavebeenapplied,forexample,inthefollowingapplicationdomains
Audio andspeechsignalseparatione.g. [113,102,62]
Multiple-InputMultiple-Output(MIMO)communicationssystemse.g. [60,19,42,10,
29,123,116]
Biomedicalsignalprocessinge.g. [82,85,66,24,118]
Imageprocessingandfeatureextractione.g. [9,59,23]
Econometricsandnancialapplicationse.g. [8, 76,49]
Duringthelasttenyears,aconsiderableamountofworkhasbeenfocusedonBSS/ICA.
ConferencesandspecialsessionsconcentratingonICAhavebeenorganized. Thetheoretical
1.2 Scope of the Thesis
The purpose of this thesis is to develop ICA methods that can adapt to dierent source
distributions. Adaptationmakesitpossibletoseparatesourceswithoutknowingthesource
distributionsoreventhecharacteristicsofsourcedistributions. Special attentionispaidto
methods that allow not onlysymmetric but also asymmetric source distributions.
Asym-metricdistributionsoccurinkeyapplicationareas,suchas,communicationsandbiomedical
signalprocessing.
TheICAmodelhastwogroupsofparameters: themixing systemandthesource
distri-butions. It hasbeen shown that ifthesource distributions areknown, optimal separation
algorithmsmaybederived[17]. Thisisdonebyutilizingthescorefunctionsofthesources.
Blindness means, however,that no explicitknowledge onthe sourcedistributions is
avail-able. Itfollowsthatthebetterthesourcesorthescorefunctionsofthesourcesareestimated
thebetterseparationresultwecanexpect.
Therst goalofthis thesisisto developmethods forlearningthesource distributions.
Inpractice,itisadequatetoconcentrateontheapproachesthatcapturetheessential
prop-erties of the source distributions. The second goalis to nd eÆcient implementations of
theproposedmethods. This includes thechoice of theoptimization algorithm,robustness
considerationsand simulation studies of practical performance. Theobjective is to show
thatadaptiveestimationmethodsarenecessaryandontheotherhand,showthattheprice
paidfortheincreasedexibilityisnottoohigh.
1.3 Contribution of the Thesis
ThecontributionsofthisthesisareindevelopingnewmethodsforICA.Adaptivetechniques
are proposed for the modeling of score functions or estimating functions. Score functions
aremodeledusingparametricfamilies. ThemethodsmaybeincorporatedintoexistingICA
algorithms. Thecontributionscanbesummarizedasfollows:
Therelationshipbetweenscoreadaptiveestimationandminimizationofmutual
An extendedPearson systemmodel allowingmultimodal distributions isintroduced.
The case of bimodal distributions is considered in more detail. The obtained score
functions are bounded and dened everywhere. The parameters can be estimated
using themethod ofmoments.
TheuseoftheExtendedGeneralizedLambdaDistribution(EGLD)intheICA
prob-lem isintroducedin co-operationwiththeco-authors[PublicationII].
The method of L-moments is proposed for the estimation of the parameters of the
GeneralizedLambdaDistribution(GLD).
The optimalweighting isderivedfor theadaptiveestimating functions comprisedof
twoxedcomponentsusingtheconceptofBSSeÆcacy.
Absolutemomentsareproposedasestimatingfunctions.
Theperformanceoftheproposed methodsisstudied quantitativelyandqualitatively
insimulations. Thesimulationsdemonstratethereliableperformanceofthemethods.
1.4 Summary of Publications
This thesis consists of 7publications and a summary. Thesummary part of the thesisis
organizedasfollows: Chapter2introducesthebasicconceptsandmethodsofBSS.Chapter
3containsan overviewof theexisting methods forthesource adaptiveICA. InChapter 4
themaincontributionofthethesisissummarizedandmethodsofPearson-ICA,EGLD-ICA
andadaptiveestimating functions arepresented. Chapter5providesabriefsummaryand
outlinesfutureresearch.
InPublicationIaPearsonsystembasedBSSmethodisintroduced. Analgorithmusing
themethodofmomentsisproposedforndingtheparametersofthePearsonsystem. The
actual separationis performed using xed point algorithm [58]. Thesimulation examples
demonstratethatthemethodcanseparatesuper-and sub-Gaussiansourcesandeven
non-Gaussiansourceswithzerokurtosis.
InPublicationIIanEGLDbasedBSSmethodisintroduced. Analgorithmutilizingthe
In PublicationIIIthe algorithmsproposed in PublicationsII andI are further studied
and compared. It is demonstrated in simulations that the standard BSS methods may
performpoorlyinthecaseswherethesourceshaveasymmetricdistributions. Duetosource
adaptationtheEGLDandPearsonsystembasedmethods reliablyseparatethesources.
InPublicationIVtheapplicationsofPearson-ICAareconsideredfromtheviewpointof
telecommunications. SeparationofbinarysourcesandinstantaneousmixingofRayleighor
lognormalfadedsignalsareusedasexamples. Simulationresultsareprovided.
In Publication V the use of the Pearson system is further developed. The dierent
typesof distributionsin Pearsonfamily arestudied in ICA context. It is shown usingthe
results by Pham [97] that the minimization of the mutual information contrast leads to
iterativeuseofscorefunctionsasestimationfunctions. AnextensionofthePearsonsystem
that can model multimodal distributions is introduced. The applicability of the Pearson
systembasedmethod isdemonstratedinsimulationexamples,includingblind equalization
ofGMSKsignals.
Publication VIis anextended versionof Publication III. Theperformanceof the
pro-posed methods is studied in moredetail. Theadditionalcontribution is themethod of
L-momentsproposedfortheestimationofGLDparameters. ItisarguedthattheL-moments
areamorenaturalwaytoestimatetheGLDparametersthantheconventionalsample
mo-ments. Additionally, theL-momentshaveattractivetheoreticalproperties,includinglower
samplevariancecompared tothesamplemoments.
Publication VII considers the problem of adaptive score estimation from a dierent
viewpoint. The proposed estimating functions comprised of symmetric and asymmetric
partcancapture theessential features ofthe sourcedistributions. Theoptimal weighting
betweenthe symmetric andasymmetric part is derived using theconcept of BSSeÆcacy.
Generalresultsarederivedand absolutemomentbasedestimatingfunctions arepresented
asanexample.
Authorderivedalltheequations,performedallthesimulationsandwasmainly
respon-sible for writingin Publications I, III, IV, V, VI and VII. Theco-authors contributed in
steeringtheresearch,in designingexperimentsandin writingthepapers.
TherstauthorwasmainlyresponsibleforwritingPublicationII. Theideaofusingthe
Blind Source Separation
2.1 Overview
This chapter provides ashort overviewto blind source separation(BSS) andindependent
componentanalysis (ICA).ThekeyconceptsandassumptionsneededinICA andBSSare
described. BasicICAmodelanditsextensionsareconsidered. Theelementsandprinciples
ofanICA method areexplained. Moreextensiveoverviewsaregiveninseveralbooks and
tutorialarticles[60,49,50,80,17,4].
2.2 Independent Component Analysis Model
2.2.1 The basic ICA model
Inthis thesisweconsider thenoiselessinstantaneousICAmodel
x=As; (2.1) wheres=[s 1 ;s 2 ;:::;s m ] T
is anunknownsourcevectorandmatrixA
mm
isanunknown
real-valued mixing matrix. The observed mixtures x = [x
1 ;x 2 ;:::;x m ] T are sometimes
called as sensor outputs. The following assumptions for the model to be identiable are
neededaccordingto[27,68]
2. AtmostoneofthesourceshasGaussiandistribution.
3. MixingmatrixAisinvertible.
4. Momentsofthesourcesexistuptothenecessaryorder.
SeparationmeansthatwendaseparatingmatrixWthatmakesthecomponentsof
y=Wx (2.2)
mutually independent. The ith row vectorof W is marked by w
i
. It is possible to nd
solution up to some scaling and permutation. If W is a separating matrix, any matrix
PW ,whereisadiagonalmatrixandPisapermutationmatrix(inpermutationmatrix
exactlyoneelement oneveryrowandcolumn is 1andtheother elementsare 0), isalso a
separatingmatrix[27].
The ICA model has two types of parameters: the mixing coeÆcients in A and the
source densities. Usually, we are interested in the mixing matrix A or the actual source
values,andthesourcedensitiesaretreatedasnuisanceparameters. Withoutanyadditional
assumptions, the estimation of the densities is considered as a nonparametric problem.
Together withthe parametricestimation of themixing matrix,the estimation of theICA
model isreferredtoasasemiparametricproblem[3].
2.2.2 Extensions of the basic ICA model
The basic ICA model may be extended several dierent ways. The noisy ICA model is
expressedas
x=As+n; (2.3)
wherenis aGaussian noisevectorindependentfromthesources. Addingthenoisemakes
themodelmorerealisticbecause thereis alwaysnoiseinphysicalsensor measurements. If
the noisevariancesare small comparedto the output variances,the methods fornoiseless
ICA canbe utilizedwith good results. At thepresenceof heavynoiseadditionalmethods
Complex-valued sources and mixing matrices occurespecially in communication
prob-lems. Thereexist ICAmethods developedforthecomplex-valuedproblem[15,11].
Some-timestheproblem reducestothereal-valuedproblem,e.g. [123].
In someapplicationsthe mixing matrixAis notasquare matrix. Thecase wherethe
number of mixtures is higher than the number of sources we essentially have the basic
problem with extra information. The rank of the model or the number of sources might
be unknown and should be also estimated e.g. [22]. The case where the number of the
mixtures is lowerthan the numberof thesources is adiÆcult problem. Since the mixing
is not invertible the identication of the mixing matrix and the recovery of the sources
are individual problems. Generally, the sources cannot be recovered without additional
assumptions. Theproblemhasbeenconsidered in[108,28,20,32,79,63].
In convolutivemixing,theobserveddiscrete-time signalsx
i
(t);i=1;:::;m are
gener-atedfromthemodel
x i (t)= m X j=1 X k a ik j s j (t k): (2.4)
This isaFiniteImpulse Response Multi-inputMulti-output (FIR-MIMO)model, whereas
the basic instantaneous mixing model (2.1) can be seen as an instantaneous MIMO
(I-MIMO)system. Inmodel (2.4)each FIRlter(forxed indicesi andj)isdened bythe
coeÆcientsa
ijk
. Convolutivemodelsareconsiderede.g. in [6, 114,93,45,46].
NonlinearICAmodelisgivenby
x=h(s) ; (2.5)
wherehisanunknownm-componentmixingfunction. Ifthespaceofthenonlinearfunctions
his notlimitedthere exist aninnityof solutions[61, 40]. Recently, theinteresttowards
nonlinearICAhasincreased. TheuniquenessproblemsareavoidedusingBayesianapproach
[117],regularizationtechniques[1]orstructuredmodels[109,40]. Animportantspecialcase
ofthegeneralnonlinearmodel(2.5)ispost-nonlinearmixturemodel[111]
x i =h i 0 @ m X j=1 a ij s j 1 A ; i=1;:::;m; (2.6)
wherenonlinearfunctionsh i
;i=1;:::;m;areapplied tothelinearmixtures.
2.3 Anatomy of an ICA Method
InthisthesisICAmethodsarestudiedinthefollowingframework. AnICAmethodconsist
ofthreeparts:
1. Measureforindependence(theoreticalcontrast)
2. Estimator ofthemeasure,orobjectivefunction
3. Algorithmforoptimization
These parts are considered in the followingsections. We make adistinction between
the-oretical measuresof independence andtheestimatorsof independence calculatedfrom the
data. From the theoretical point of viewthe linearinstantaneous ICA problem is solved:
independentcomponentsarefoundwhenthechosenmeasureforindependenceisminimized.
However, thegreat number of theproposed ICA methods showsthat there is work to do
withestimatorsand algorithms.
2.4 Measures of Independence
Mutualindependence ofrandomvariablesy=[y
1 ;y 2 ;:::;y m ] T
meansthatthejoint
distri-butioncanbefactorizedandpresentedasaproductofthemarginals. Thefactorizationcan
bedened usingcumulativedistributionfunctions
F(y )=F 1 (y 1 )F 2 (y 2 ):::F m (y m ); (2.7) probabilitydensities f(y )=f 1 (y 1 )f 2 (y 2 ):::f m (y m ); (2.8) orcharacteristicfunctions (t)= 1 (t 1 ) 2 (t 2 )::: m (t m ) (2.9)
wherecharacteristicfunctionis denedby (t)= Z e |ty dF = Z e |ty f(y)dy; (2.10)
where|istheimaginaryunit. Thesedenitionscharacterizeindependencebuttheydonot
directlytellhowtomeasuredependencies. Anaturalwaytodothisistoconstructameasure
using,forinstance,thedierencebetweenthejointcharacteristicfunctionsandtheproduct
ofmarginalcharacteristicfunctions[122,38]oralternatively,thedierencebetweenthejoint
cdfandtheproductofthemarginalcdfs,e.g. Kolmogorov-Smirnov[47]teststatistics
KS =sup x jF(x) F 1 (x 1 )F 2 (x 2 ):::F m (x m )j: (2.11)
Acontrastfunction orbriey acontrastisoneofthekeytermsin ICA.A contrastisa
function to beminimized in orderto separate thesources. Formallyacontrastfunction is
denedas[27]
Denition 1 A contrast is a mapping from the set of densities ff
y
;y 2 R
m
g to R
satisfyingthe following threerequirements
1. (f Py )=(f y ); 8Ppermutation, 2. (f y )=(f y ); 8diagonal invertible,
3. If y hasindependent components,then (f
Ay
)(f
y
); 8A invertible.
Accordingto Denition 1acontrastisafunction of densities. Under theassumption that
thedensitiesarecorrectlyestimated, acontrastbecomesafunction ofthecurrentmixture
yorequivalentlyafunctionof theseparatingmatrixW .
TwofundamentalICAcontrasts,themaximumlikelihoodcontrastandthemutual
infor-mationcontrast,arebasedonKullback-Leiblerdivergence.TheKullback-Leiblerdivergence
betweentherandomvariablesy
1 andy 2 isdenedas K(y 1 jjy 2 )= Z f 1 (y )log f 1 (y ) f 2 (y ) dy ; (2.12)
contrastcanbedenedastheKullback-Leiblerdivergencebetweenyands ML (y )=K yjjs (2.13)
andthemutualinformationcontrastcanbedened as
MI
(y )=K yjjy~
; (2.14)
where ~y denotes the vector with independent entries with each entry distributed as the
correspondingmarginalofy . Nowtheconnectionbetweenmutualinformationandlikelihood
canbewrittenas K yjjs =K yjj~y +K ~yjjs ; (2.15)
Mutual information is a suÆcient statistics in ICA [17]. Likelihood is a sum of mutual
informationandanuisancetermthatgivesthemarginalmismatchbetweentheoutputand
theassumedsources.
2.5 Objective Functions and Estimating Functions
An estimatorofacontrastfunction is oftencalled asobjectivefunction, criterionfunction
or cost function. In addition, the term contrast is sometimes used also for the estimator
calculatedfromthedata. ItshouldbementionedthatthemeaningofcontrastinICAdiers
from themeaning contrasthasin statistics [103]. TheICA terminologymaybeconfusing
herebut thebasicideaisthatwehaveameasureofindependence andanestimatorforit.
Thederivativeofanobjectivefunctionmaybecalledanestimatingfunction. Estimating
functions aresometimesalso called separatingfunctionsoractivationfunctions. Since the
objectivefunctions mustbeminimizednumerically,theestimatingfunctionshavean
essen-tialrole in practicalICA algorithms. Formally, theestimating function [3] canbe dened
asamatrix-valuedfunction (x;W )such that
Ef (x;W
)g=0; (2.16)
(x;W )=I '(y )y T where '(y )=[' 1 (y 1 );' 2 (y 2 );:::;' m (y m )] T isavectorofone-unit
estimating functions. The term 'estimating function' is commonly used to refer to these
one-unitestimating functions,asdonealso inthis thesis. This denition oftheestimating
function is related to the projection pursuit [53, 65] and the deation approach [33, 54]
whereone-unitobjectivefunctionsareusedtoextractthesourcesonebyone.
If the source distributions are known, the maximum likelihood principle leads to the
estimatingfunctionsthat arethescorefunctionsofthesources[17]:
' y (y )= d dy logf y (y ): (2.17)
This is a fundamental result but it applies only when the source densities are positive
everywhere. For example, if uniformly distributed sources are mixed we cannot use score
functionsinseparationbecausethescorefunctionsarezeroinaniteintervalandundened
elsewhere.
In practice,thesource distributions are notknown. Themaximum likelihood contrast
canbeemployedwithsomepre-chosendensitiesforthesources. An equivalentapproachis
tochoosedirectlyasuitablenonlinearfunctionasestimatingfunction. Weusethenotation
where '
y
refers to the true score function of random variable y , as dened in equation
(2.17). The notation without the subindex ' refers to an estimating function, or to the
estimatedscorefunction. Thisemphasizesthecloserelationshipbetweenthescorefunction
modeling andthe nonlinearityselection. Ifestimating function ' isused, weobservethat
thefollowingexpressionfortheassumedsourcedensityisobtained
f(y i )= exp( R '(y i )dy i ) R 1 1 exp( R '(y i )dy i )dy i : (2.18)
Itshouldbenotedthat(2.18)isnotalwaysavaliddensityintraditionalsense. Forsome
typ-icalchoicesoftheestimatingfunction,thedenominatorin(2.18)tendstoinnity. Thiscan
beavoidedmakingaworkingassumptionthaty
i
belongstoanite intervalandevaluating
theintegralsoverthisinterval.
InlinearICAaccurateestimationofsourcedistributionsisnotalwayscrucial. However,
better separationmaybeachievedif thesourcedistributions areestimated. This becomes
Cumulantshavebeenusedasobjectivefunctionssincetheearlydaysofblindseparation
[67, 27,15, 33]. Cumulantsareemployedalsoin somerecentworks;seee.g. [86]and [30].
Cumulants 1 ; 2 ; 3
;::: aredenedviacharacteristicfunction(2.10)bytheidentity
exp X q (|t) q =q! =(t) (2.19)
Cumulants may be estimated from the sample moments of same and lower orders. The
estimatesofthesamplemoments(centralmoments) areobtainedasfollows
x= T X t=1 x(t)=T (2.20) ^ 2 = ^ 2 = T X t=1 (x(t) x) 2 =T (2.21) ^ 3 = T X t=1 (x(t) x) 3 =T (2.22) ^ 4 = T X t=1 (x(t) x) 4 =T; (2.23)
where T is the numberof observations. In this thesis, notation
1 ; 2 ; 3 ;::: is used for
boththe theoreticalsample momentsand theirestimators. The cumulant-basedskewness
andkurtosismaybedened asfollows
Æ 3 (y i )= 3 (y i ) 2 (y i ) 3=2 =E ( y i y i yi 3 ) (2.24) Æ 4 (y i )= 4 (y i ) 2 (y i ) 2 =E ( y i y i yi 4 ) 3; (2.25) where y i and y i
aretheexpectedvalueandthestandarddeviationofy
i
,respectively. The
separationcanbebasedonthefactthatforGaussiandistributionthehigherordercumulants
equal to zero. Maybe the simplest technique to separate the sources is to maximize or
minimizekurtosis.
Whensample varianceandrobustnessto outliers (in noisyICA model)are ofconcern,
boundednonlinearfunctionsmaybemoreadvisablethancumulants. However,thepractical
typi-Objectivefunction Estimatingfunction kurtosis (y i )=y 4 i cubic'(y i )=y 3 i skewness(y i )=y 3 i '(y i )=y 2 i (y i )=log (cosh(y i
)) hyperbolictangent'(y
i
)=tanh(y
i )
Gaussian moments,e.g. (y
i )=e y 2 i =2 '(y i )= y i e y 2 i =2 3rdabsolutemoment(y i )=jy i j 3 '(y i )=y i jy i j
Table2.1: Sometypicalone-unitobjectivefunctionsandthecorrespondingestimating
func-tions. Thescalingconstantsareomitted.
These simpleestimating functions are good benchmark for moreadvanced methods: they
areeasytoimplementandtheysuccessfullyseparatemostoftypicalsources.
The objectivefunctions in Table 2.1, expect the skewness, employ only even moments
or symmetric properties of the source distributions. This means that there is animplicit
assumptionthatthesourceshaveasymmetricdistribution. Theexplicitconnectioncanbe
foundusingtheequation(2.18). InPublicationVII adaptivemethodsforndingobjective
functions with the optimal weighting between the symmetric and asymmetric properties
havebeenproposedandtheywillbeconsidered inSection4.4of thisthesis.
2.6 Mutual Information and Source Adaptation
TheICAmethods proposedinthisthesisarebasedondirectminimizationofmutual
infor-mation. Thedirectminimization ofmutualinformationleadstotheadaptiveestimationof
the scorefunctions of the mixtures asshown in [97] and in Publication V. Starting from
mutual information contrast
MI
(W ) dened as afunction of W , the following gradient
(calledasrelativegradientin [97])isobtained
0 MI (W )= Z ' y (y )y T f x (x)dx I: (2.26)
Usingtherelationy=Wx,whereWisorthogonal,wecanwrite(2.26) intheform
0 MI (W )= Z ' y (y )y T f y (y )dy I: (2.27)
Ify (t)isanergodicrandomprocess,wheretheindividualsamplesaredistributedaccording
tof
y
(y ),weobtainthefollowingestimator
^ 0 MI (W )= 1 T T X t=1 ^ ' y (y (t))y (t) T I; (2.28) where'^ y
isanestimatorforthescorefunctionofyandT isthesamplesize.Inthecaseof
mutualinformationcontrast,theestimatingfunctionisthescorefunctionofy . Becausethe
outputy changes oneveryiterationofthe optimizationalgorithm,theoptimalestimating
functions alsochangeineachiteration.
Aprocedureforparametricminimizationofmutualinformationmaybegivenasfollows:
Afterthechoiceofmodelfamilyandsomesuitablealgorithm,suchasnaturalgradient(2.29)
orxedpointalgorithm(2.30),thefollowingstepsarerepeateduntil theconvergence:
1) Appropriatesamplestatistics(e.g. moments)arecomputedfromthecurrentdatay
k =
W k
x.
2) The parameters of score function are estimated for each component using the sample
statistics.
3) ThescorefunctionsareutilizedasestimatingfunctionsintheICAalgorithmperforming
theseparation.
2.7 Algorithms
Numericalmethods areneededinorder tooptimizeanICAobjectivefunction. Ingeneral,
the choice of the algorithm is independent from the choice of the objective function. Of
course,theremaybedierencesin thecomputationalcomplexity. Itiscommonlyassumed
thatthedataiscenteredandwhitened(zeromean,uncorrelated,unitvariance)priortothe
actualseparation. Afterwhiteningtheseparatingmatrixis(asymptotically)orthogonaland
thenumberofparameterstobeestimatedissmaller. Prewhiteningimprovestheconvergence
2.7.1 Natural gradient algorithm
Abasicprincipleof thegradienttypeoptimization methods isto moveto thedirection of
(negative) gradient. InICA, thegradient can be adjusted to correspond to the geometry
ofthe problem. This leadstonaturalgradient[2] orrelativegradient[17] algorithm. The
updatingrulefortheseparatingmatrixisthefollowing
W k +1 =W k + I '(y )y T W k ; (2.29) where '(y ) =[' 1 (y 1 );' 2 (y 2 );::: ;' m (y m )] T
isthe vectorof estimating functions and is
thelearningrate.
2.7.2 Fixed-point algorithm
Fixed-point algorithm [57, 58] can be seen as a computationally moreeÆcient versionof
naturalgradientalgorithm. Theupdate rulecanbeexpressedas
W k +1 =W k +D Ef'(y )y T g diag(Ef'(y i )y i g) W k ; (2.30) where D = diag 1=(Ef'(y i )y i g Ef' 0 (y i )g
. After every iteration, the separating
ma-trix is projected to the set of orthogonal matrices (in the case of prewhitened data)
us-ing symmetric orthogonalizationW
orth
= (WW
T )
1=2
W . The algorithm convergeswhen
jjW k +1 W k jj<"withe.g. "=0:0001. 2.7.3 Jacobi algorithms
Jacobi-typealgorithmsarebasedonthetheorem [27]stating thatin thecaseofthelinear
ICA model, pairwise independence implies mutual independence. This leads to the
algo-rithms wherepairwise costfunctionsare sequentiallyoptimized. Such algorithmsconverge
when all the pairs are optimized in the limits of some predetermined converge criterion.
The best-known Jacobi type algorithm is probablyJoint Approximate Diagonalization of
2.8 Characterization of Source Distributions
Inmanyapplicationsthenatureofthesourcesignalsisknowneveniftheexactsource
dis-tributions areunknown. Commonly, distributions are dividedto super-and sub-Gaussian
distributions. A symmetric zero mean distribution f(x) is super-Gaussian (respectively
sub-Gaussian)if9x 0 >0j8xx 0 ;f(x)>f G (x)(f(x)<f G
(x)forsub-Gaussian),where
f G
(x) isthenormalizedGaussianpdf. Inthecaseofunimodal symmetricsourcesthesign
of kurtosis (2.25) depends on super- andsub-Gaussianity[83]. Theconcept ofsuper-and
sub-Gaussianityisnotveryinformativein thecaseof asymmetric ormultimodal
distribu-tions. Measuresof boththe skewness andthe kurtosis areneeded to describeasymmetric
distributions. Multimodaldistributionsmaybecharacterizedbythelocationsofthemodes.
-4
-2
0
2
4
0.1
0.2
0.3
0.4
0.5
(a) Asuper-Gaussian distribution(the GGD
(equation(3.2))witha=1:4)
-4
-2
0
2
4
0.1
0.2
0.3
0.4
0.5
(b) A sub-Gaussian distribution (the GGD
(equation(3.2))witha=3:5)
-4
-2
0
2
4
0.1
0.2
0.3
0.4
0.5
(c) An asymmetric distribution (Centered
Rayleigh(2)distribution)
-4
-2
0
2
4
0.1
0.2
0.3
0.4
0.5
(d)Anasymmetricbimodaldistribution
(mix-tureoftwoGaussiandistributions)
2.9 Discussion
InthischaptertheICAmodelsandterminologywerereviewed. Measuresforindependence,
theirestimatorsandoptimizationalgorithmswereconsidered. Whenthefamilyofthe
pos-siblesourcedistributionsisexpandedfromsymmetricunimodaldistributionstoasymmetric
andmultimodaldistributionstheneedforthesourceadaptationbecomesobvious. The
con-nection withthesourceadaptationandminimization ofmutualinformationis established.
This suggests the adaptive estimation of the score functions of the mixtures. Methods
Review of source adaptive ICA
methods
3.1 Overview
AsdiscussedinChapter2,theoptimalseparationrequiresthatthesourcedistributionsare
known. Inpractice,thesourcedistributionsarenotknownandneedtobeestimatedreliably.
Inthepuremaximumlikelihood approach theprior knowledgeonthesourcesis renedto
adensitymodel oranestimating function. Inthe adaptivemaximumlikelihood approach
ormutualinformationapproach,densitiesorscorefunctionsareiterativelyestimatedfrom
thedata. Inthischapter,methods formodeling andestimatingthesourcedistributions in
ICAarereviewed. Theestimationmethods maybedividedintothree classes:
nonparametricmethods,e.g. kernelestimation,
parametricmodelsfordensitiesandscorefunctions,
modelsforestimating functions.
In this chapter,modelsand methods suitable forthe sourceadaptiveapproach are
re-viewed. Thechapterprovidesabackgroundforthescoreadaptivemodelsthatarepresented
3.2 Kernel estimation of densities
An overviewofkernelestimationandrelatednonparametrictechniquesisgivenin [104]. A
separationmethodwithkernelestimatesforthesourcedensitiesisproposedin[97]. Kernel
estimation of densities is also applied to nonlinear ICA problem [110, 111]. The kernel
densityestimate[104]isdened by
^ f i (u)= 1 T T X t=1 1 T u y i (t) T ; (3.1)
where is the kernel function and
T
is abin-width parameterdepending onthe number
ofobservationsT. Toguaranteethat
^ f
i
(u)isadensity,itsuÆcestotake adensityitself.
Thebin-width parameteraectsonthesmoothness oftheestimate. Pham[97]providesa
detailed theoreticalanalysis ontheuseofkernelestimatesinICA.
Somecomputationalproblemsneedtobesolvedinordertoapplykernelestimation. The
integralsinthegradientofmutualinformationcontrastmustbediscretizedbychoosingthe
spacing for theestimation grid, i.e. the points uwhere estimator(3.1) is computed. The
computationcanbemadefasterusingFastFourierTransform(FFT)[104]. Thekernel-based
methodisfurther developedin somerecentpapers[119,12].
3.3 Parametric models
3.3.1 Distribution families
The main contributions of this thesis are in using parametric families of distributions for
modelingthescorefunctions. ThesemethodsareconsideredinChapter4andinPublications
I-VI. Dierent parametric families for ICA are also employed in [21, 14, 41]. The models
used inthese papersare theGeneralizedGaussian Distribution(GGD)and t-distribution.
Botharefamiliesofsymmetricdistributionswithshapedependingontheparameters. The
pdfoftheGGDisdened as f(y;a; a )= a a 2 ( 1 ) exp( j a yj a ); (3.2)
whereaistheparameterofthedistribution, a
ascalingfactorand (x)isGammafunction
givenby (x)= Z 1 0 u x 1 exp( u)du: (3.3)
The parameter a controls the peakiness of the distribution. If a = 2, the distribution is
reducedtoGaussiandistribution;ifa<2,thedistributionissuper-Gaussian;andifa>2,
thedistributionis sub-Gaussian. Examplesarepresentedin Figure2.1. Theparameter
a
isascalingfactorcontrollingthevariance. ThescorefunctionoftheGGDisgivenby
'(y i )=a a sign(y i )j a y i j a 1 : (3.4)
TheparametersoftheGGDcanbesolvedfrom thefollowingmomentequations
4 = ( 5 a ) ( 1 a ) 2 ( 3 a ) 3; (3.5) a = s ( 3 a ) 2 ( 1 a ) ; (3.6) where 4
is thekurtosis and
2
is thesecond order moment. In practice, to estimatethe
parameters,thesamplekurtosisiscalculatedfromthedataandthevaluesoftheparameters
aand
a
aresolvednumerically.
Anothermodel,t-distribution,isfamiliarfromt-test[107,106]. Thepdfoft-distribution
withbdegreesoffreedomandthescaling factor
b is f(y;b; b )= b ( b+1 2 ) p b b 2 1+ 2 b y 2 b 1 2 (b+1) : (3.7)
Thescorefunction oft-distributioncanbewrittenas
'(y i )= (1 b)y i y 2 i b 2 b : (3.8)
4 = 3 ( b 4 2 ) ( b 2 ) 2 ( b 2 2 ) 3; (3.9) b = s b ( b 2 2 ) 2 2 ( b 2 ) : (3.10)
A simpler way to estimate the parameters of t-distribution using the Pearson system is
presentedlaterinSection 4.2.
In[21],onlytheGGDisemployedasamodelforthesources. In[14]thechoicebetween
theGGDandt-distributionisdonebasedonthesamplekurtosis
'(y i )= 8 > > < > > : a a sign(y i )j a y i j a 1 ; if ^ 4 (y i )0 (1 b)yi y 2 i b 2 b ; if ^ 4 (y i )>0: (3.11) 3.3.2 Mixture of Densities
MixtureofGaussiansmodel(MOG)isemployedasthemodelofsourcedensitiesespecially
in Bayesianapproach[7,117,78]. Thedensitymodelisthefollowing
f(x)= P j ! j 1 j p 2 exp( (x j ) 2 =2 2 j ) P j ! j ; (3.12) where i and 2 i
are mean and variance and !
j
is a weighting parameter. Mixtures of
Gaussianscanapproximatevirtuallyanycontinuoussourcedistributionbut thenumberof
requiredGaussiansdependsonthesourcedistribution. Forinstance,severalGaussiansare
neededtoapproximateuniformdensity. Theexpectation-maximization(EM)algorithm[34]
isoftenused inthelearningoftheMOGparameters. Due tocomputationalcomplexityof
MOG-basedICA,thenumberofGaussiansisusuallyxedtosomesmallnumber. Thismay
limittheperformanceinsomecaseseventhoughtheperformanceofthemethodisgenerally
good.
Mixture of densities models are also proposed in [120, 43, 48, 81]. In [120] amixture
of Gaussian or logisticdensities is proposed. In [43] a closely relatedmethod of adaptive
activationfunction neuronsisstudied. In[48]and[81]MOGandhyperbolic-Cauchy
3.4 Adaptive nonlinearities
ThemethodspresentedinSections3.2and3.3startedfromtheestimationofdensities. The
methodspresentedinthissectionapproachtheproblemfromadierentviewpoint. Instead
ofdensities,estimatingfunctionsaredirectlyworkedout. AsmentionedinSection2.5these
twoapproachesare theoretically equivalent. In practice,adaptive nonlinearities may have
someappealingcomputationalpropertiesalthoughadhocadaptationrulesareoftenneeded.
3.4.1 Polynomial expansions
EdgeworthandGram-Charlierexpansions[106]provideapproximationsfordensitiesinthe
vicinity of a Gaussian density. Theexpansions can be used to obtain approximationsfor
negentropy[60] Neg = 1 12 2 3 + 1 48 2 4 ; (3.13)
orforthescorefunction ofasymmetricdensity
'(s)=s 4 6 (s 3 3s); (3.14) where 3 and 4
are the third and fourth cumulant, respectively. Polynomial expansions
areconsidered e.g. in [65,27, 5,121]. Theapproximationof entropy canbealsobasedon
other functions than polynomials asproposed in [56]. Forinstance, Gaussian density and
its derivativesmay beemployed. These approximationsare usually moreexact and more
robustthantheapproximationsbasedonpolynomials.
3.4.2 Basis functions
Quasi-maximumlikelihoodapproachemployingasetofarbitrarybasisfunctionsisproposed
byPham[98](see[17]forabriefsummary). Thescorefunctionisapproximatedbyalinear
combination '(y i )= N X ! n ' n (y i ) (3.15)
ofaxedsetf' 1 ;' 2 ;:::;' N
gofarbitrarybasisfunctions. Itturnsoutthattheweighting
parameters ! 1 ;! 2 ;:::;! N
can be solved without knowing the true scorefunction. Mean
squareerrorbetweenthetruescorefunction anditsapproximationisminimized when
'(y i )=(EfR 0 (y i )g) T (EfR (y i )R (y i ) T g) 1 R (y i ); (3.16) whereR (y i )=[' 1 (y i );' 2 (y i );::: ;' N (y i
)]istheN1columnvectorofbasisfunctionsand
R 0
(y i
) isthe column vectorof theirderivatives. Inpractice, theexpectations arereplaced
bysampleaverages.
Algorithms where the nonlinearities are adaptively chosen on the basis of
sub/super-Gaussianityareusede.g. in[35,49,80]. Typically,thenonlinearitiesarebasedonfunctions
suchastanh(y)andy
3
andthesignofthenonlinearityis chosenadaptively.
3.4.3 Threshold functions and quantizers
Verysimplealgorithmscanbeconstructedusingadaptivethresholdfunctions. A threshold
activationfunction [84]isdened as
'(y i )= 8 > < > : 0; jy i j<b i ; a i sign(y i ); jy i jb i ; (3.17) wherea i andb i
aredatadependentparameters. Thethresholdb
i
maybechosensothatthe
local stability ismaximized. However,thismaximizationrequires knowledge ofthesource
distribution. Asapracticalsolution,theauthorsin[84]proposethefollowingupdatingrules
a i (t+1)=a i (t) a (1 ^ 2 (y i ;t)); (3.18) b i (t+1)=b i (t) b ^ Æ 4 (y i ;t); (3.19) where^ 2 (y i
;t)isthesamplevarianceofy
i aftertobservations,^ Æ 4 (y i
;t)isthesamplekurtosis
and
a
and
b
arethelearningrates. Additionally,thevaluesofb
i
areforcedtotheinterval
[0;1:5].
Thesimplethresholdfunctioncanbegeneralizedintroducingmorethresholdsandlevels.
ofquantizersandthresholdfunctionsisthattheycanbeeasilyimplementedindigitalsignal
processing.
3.5 Discussion
Thepresentedestimationmethodsillustratethetrade-obetweengeneralityandsimplicity.
Thenonparametricestimation isapparentlythe mostexibleconcept. However,acertain
implementation withaxed kernelisalreadyamorerestrictedmodel. Thecriticalpartof
kernel estimationis thechoiceof thekernelfunction and thebin-width parameter. There
existopposingopinionsonthecomplexityandthecomputationalcostofkernelestimation
in ICA [97, 60]. The speed requirement depends of course on the particular application
but itseems that kernelestimation is relatively complexmethod whencompared to other
methods.
Theexibilityofparametricestimationdependsonthechosendistributionfamily.
Prob-lemsmayoccurifthechosendistributionfamilycannotmodel theessentialfeaturesofthe
actual distribution. On the other hand, if an appropriate parametric model is used, the
methodswork eÆciently.
Theadvantageoftheadaptivenonlinearitiesisthattheyarecomputationallysimpleand
easytoimplement. Theperformancedependsonthesourcedistributions. Successful
sepa-rationisexpectedifthenonlinearitiescanreacttotheessentialfeaturesofthedistributions.
Adaptive Score Models
4.1 Overview
Inthischapterweintroduce methodsfor estimatingscorefunctions adaptively. The
para-metric models employed are the Pearson system and Generalized Lambda Distribution.
Additionally, adaptive estimating functions using iterative weighting are presented. The
guidelinesusedforchoosing anappropriateparametricmodelare
1. The model should adapt to asymmetricormultimodal sources,but the performance
should notdegradeinthecaseofunimodalsymmetricsourcedistributions.
2. Theparametersofthemodel shouldbeeasytoestimatefrom thedata.
3. Thefunctionalformofthescorefunctionshouldbeeasytocomputeandrobustagainst
outliers.
Asymmetric and multimodal source distributions are considered becauseblindness means
thatwecannotrestricttosymmetricsources. Asymmetricandmultimodalsource
distribu-tionsalso occurin the keyapplication areas,such as,telecommunicationsand biomedical
signalprocessing. Therequirementofeasyparameterestimationis naturalfrom thepoint
ofcomputationaleÆciencyandsimplicityoftheconcept. Asuitablefunctionalformofthe
4.2 Pearson System
ThePearsonsystemis afourparametric familyofdistributions dened bythedierential
equation f 0 (x)= (x a)f(x) b 0 +b 1 x+b 2 x 2 ; (4.1)
wheref(x)isadensityfunctionanda,b
0
,b
1
andb
2
aretheparametersofthedistribution.
ThePearsonsystemhasbeenextensivelystudiedinstatistics. Overviewsaregivenin[90]
andin [106]. Thedistributionfamilyis namedafter KarlPearson[94, 95]. Theestimation
ofthePearsonparametersisconsiderede.g. in[105,13,25,26,51,87,74,96]. Somerelated
distributionsarepresentedin [25,88,64,89,112].
An alternativeparameterizationis f 0 (x)= (a 1 x a 0 )f(x) b 0 +b 1 x+b 2 x 2 ; (4.2) where a 0 , a 1 , b 0 , b 1 andb 2
aretheparametersofthedistribution. Both parameterizations
(4.1)and(4.2)characterizethesamedistributionsbuttheexpression(4.2)hastheadvantage
that a
1
canbezeroand thevaluesoftheparametersareboundwhenthefourth cumulant
exists. Thus,weusetheparameterization(4.2). Thescorefunction ofthePearsonsystem
iseasilysolvedfrom (4.2)
'(x)= f 0 (x) f(x) = a 1 x a 0 b 0 +b 1 x+b 2 x 2 : (4.3)
Thederivativeofthescorefunctionis
' 0 (x)= a 1 b 0 +a 0 b 1 +2a 0 b 2 x a 1 b 2 x 2 (b 0 +b 1 x+b 2 x 2 ) 2 : (4.4)
Severalwell-knowndistributionsbelongtothePearsonfamily. Forinstance,forGaussian
distribution withmean andvariance
2
thevaluesoftheparametersarea
0 =12( 2 ) 3 , a 1 = 12( 2 ) 3 , b 0 = 12( 2 ) 4 , b 1 = 0 and b 2
= 0. Also Gamma, Beta and Student's
t-distributionbelongtothePearsonfamily. Thisisillustratedin Figure4.1.
0.5
1
1.5
2
1
2
3
4
5
6
7
IV
VI
Impossible for all distributions
III
V
II
II
4 2 3 I(M) I(J,U)Figure4.1: AnillustrationofthePearsonsystemin(
2 3
,
4
)-plane. Thelimitforall
distribu-tionsisline
4
=
2 3
+1. TheLatinnumbersrefertothetraditionalclassicationofPearson
distributions. TypesI andIIarebetadistributionsofrstkind. ThenotationI(J,U)refers
toJ-andU-shapeddistributionsandI(M)tounimodaldistribution. Theboundarybetween
I(J,U)andI(M)iscurve4(4
4 3 2 3 )(5 4 6 2 3 9) 2 = 2 3 ( 4 +3) 2 (8 4 9 2 3 12)TypeIII
isGammadistributionforwhich
4 = 3 2 2 3
+3. TypeVIisthebetadistribution ofsecond
kind. TypeVis characterizedbycurve
2 3 ( 4 +3) 2 =4(4 4 3 2 3 )(2 4 3 2 3 6). Type
IVis the casewhere the equationb
0 +b 1 +b 2 x 2
=0has complexroots. TypeVII is the
Student'st-distribution.
problem this classication is more useful than the traditional classication (types I-VII)
[106]. Theclassicationispresentedanddiscussedin PublicationV.
Pearsonsystembasedblindseparationalgorithm, Pearson-ICA[71], wasoriginally
pro-posedinPublicationIandfurtherimprovedinPublicationV. Theimplementationisbased
ontheFastICAalgorithm [55].
4.2.1 Estimation of the Pearson system parameters
The parametersof the Pearsonsystem canbe estimated using method of moments[106].
Themomentequationsarederiveddirectlyfromthedenition(4.2)
x n (b 0 +b 1 x+b 2 x 2 )f 0 (x)=x n (a 1 x a 0 )f(x): (4.5)
Whentheleftside isintegratedbyparts,(4.5)leadstoarecursionformula nb 0 n 1 (n+1)b 1 n (n+2)b 2 n+1 = (4.6) a 1 n+1 a 0 n ; where n
is nth theoretical central moment. When this recursion formula is successively
appliedforvaluesn=0;1;2;3,thefollowingrelationshipbetweentheparametersa
0 ,a 1 ,b 0 , b 1 andb 2
andthetheoreticalcentralmoments
1 0, 0 1, 1 =0, 2 , 3 and 4 arises a 1 =j10 4 2 12 2 3 18 3 2 j (4.7) a 0 =b 1 = 3 ( 4 +3 2 2 ) (4.8) b 0 = 2 (4 2 4 3 2 3 ) (4.9) b 2 = 2( 2 4 3 2 3 6 3 2 ): (4.10)
Whenthetheoreticalcentralmomentsarereplacedbythesamplemoments,themoment
es-timatorsfortheparametersa
0 ,a 1 ,b 0 ,b 1 andb 2
areobtained. Thenumberoftheparameters
actuallyreducestothree becauseb
1 =a 0 anda 1 isascalingterm.
Iftheapproximateddensityissymmetric(i.e.
3
=0) theestimated scorereducesto
'(x)= (5 4 9 2 2 )x 2 2 4 ( 4 6 2 2 )x 2 (4.11)
It canbe easilychecked that when
4
3this corresponds t-distribution denedin (3.8),
(3.9)and(3.10).
Thetypeofthedistribution,(i),(ii)or(iii),mustberecognizedafter themodelis
esti-mated. Fortypes(i)and(ii)itispossiblethat theestimated densityisnotexactlycorrect
andthussomeobservationslayoutsidethedomain. IntheICAproblemweareonly
inter-ested inndingthescorefunction,which makesiteasytoheuristicallysolvethisproblem.
First,thesampleminimumandmaximumcanbeutilizedin theestimation. Alternatively,
saturated score functions (the values of the score function are bounded between suitable
4.2.2 Extensions of the Pearson system
Theestimation of thePearsonsystemparameters canbe basedonsample statisticsother
thantherstfour moments. Forinstance,in[87] theparameterestimationisbasedonthe
mean,thevariance,theskewness andtheleft(or right)boundary.
ThedierentialequationdeningthePearsonsystemmayalsobegeneralized. Anatural
generalizationis f 0 (x) f(x) = a(x) b(x) (4.12) where a(x) = a 0 +a 1 x+a 2 x 2 +:::+a p x p and b(x)= b 0 +b 1 x+b 2 x 2 +:::+b q x q are
somepolynomialsofx. Somegeneralizationsofthiskindareconsidered in[25] andbriey
discussedinPublicationV.
InPublicationVweproposeamultimodalgeneralizationofthePearsonsystemdened
asfollows f 0 (x) f(x) = a 3 x 3 +a 2 x 2 +a 1 x+a 0 x 4 +1 (4.13) where a 0 ;a 1 ;a 2 and a 3
are the parameters of the system. Thethird order polynomial in
the numerator enables modeling bimodal distributions. The fourth order polynomial in
the denominator makes sure that the score function behaves robustly when outliers are
encountered by bounding their inuence. Since the denominator is always positive, the
scorefunction doesnothavepointsofdiscontinuity.
Themethodofmomentscanbeusedtoestimatetheparametersof(4.13). Thisleadsto
theuseofthefthand thesixthordersamplemomentsthat areverysensitiveto outliers.
Fortunately, some simpleheuristic solutions exist for stabilizing theestimates of the fth
andthesixthmoments. Onecansimplysetmaximumvaluesforthehigherordermoments
Uniform
Normal
Impossible Region
Exponential
Gamma
Student’s T
Lognormal
1
2
3
4
5
10
15
20
GBD Area
GLD Area
Weibull
Rayleigh
4 2 3Figure 4.2: Characterization of somestandardized distributions by their third and fourth
moments. TheEGLD family coversthe areaabove the shadedregion, which is notvalid
foranydistribution. Theskewnessand thekurtosis ofmanydistributions occurring in the
engineeringapplicationsarepointedout
4.3 Extended Generalized Lambda Distribution
TheExtended GeneralizedLambdaDistribution(EGLD) isalargefamilyof distributions
covering the whole space of the third and the fourth moment. The lambda distribution
waspresentedbyTukey[115]in 1960. Theconcept wasgeneralizedin 70's [100, 101, 99].
Its main usehas been in ttingadistribution to theempirical data, andin the computer
generation of dierent distributions. The latest extension of the family by Karian and
Dudewicz in 1996 [70] is a combination of Generalized Lambda Distribution (GLD) and
Generalized Beta Distribution (GBD). The space of (
3
;
4
) values, which is coveredby
the EGLDdistribution family, includes thevalues for allthe mostimportant distribution
includingnormal,uniform,gammaandbetadistributionsasillustratedin Figure4.2.
TheGeneralizedLambdaDistributionisdenedbytheinversedistributionfunction
F 1 (p)= 1 + p 3 (1 p) 4 ; (4.14)
where0p1and 1 , 2 , 3 and 4
are theparametersofthedistribution. Karianand
Dudewicz[70]showedthatGLDisavaliddistributionifandonlyif
2 3 p 3 1 + 4 (1 p) 4 1 0: (4.15)
ThealternativeFreimer-Mudholkar-Kollia-Lin(FMKL)parameterization[44]isgivenby
F 1 (p)= 1 + p 3 1 3 (1 p) 4 1 4 . 2 : (4.16)
TheFMKL-parameterizationseemsto havesomeadvantages overthe parameterizationin
equation(4.14)but sofarithasnotbeenusedforttingthedistributiontodata.
TheEGLDbasedblind separationalgorithm,EGLD-ICA[39], wasoriginally proposed
in Publication II. The L-moment basedestimation was proposed in Publication VI. The
implementation is similar to Pearson-ICA expect for the score function calculation and
parameterestimation.
4.3.1 Parameter estimation via sample moments
EstimationoftheGLD parametersusingthemethod ofmomentsis proposed in[70]. The
relationshipbetweentheparameters
1 , 2 , 3 and 4
andthemoments
1 , 2 , 3 and 4 is
establishedbyfournonlinearequations[70]thatcanbesolvednumerically. However,dueto
theintricacyofthecomputationalprocess,theparameters
1 , 2 , 3 and 4 aretabulated in[69,39]asfunctions of 3 and 4
forstandardizeddatawhere
1
=0and
2
=1. When
theEGLDis ttedto thedata, thechoicebetweenthe GLD andtheGBD ismadebased
onthevaluesofthekurtosisandtheskewnessasexplainedin PublicationII.
4.3.2 Parameter estimation via L-moments
Other statisticscan be utilized in the estimation of the parametersinstead of thesample
moments. Well-knowndrawbacksofthehigherordersamplemomentsarethehighvariance
ofestimatorsand thelackof robustness. Theconceptof L-moments[52] canbeseenasa
rstfourtheoreticalL-momentsaredenedas L 1 = Z 1 0 F 1 (p)d p (4.17) L 2 = Z 1 0 F 1 (p)(2p 1)dp (4.18) L 3 = Z 1 0 F 1 (p)(6p 2 6p+1)dp (4.19) L 4 = Z 1 0 F 1 (p)(20p 3 30p 2 +12p 1)dp: (4.20)
The L-moments exist if and only if the distribution has a nite mean. Furthermore, a
distributionwithanite meanischaracterizedbyitsL-moments[52]. Analogouslyto the
conventionalmoments,L
1
measuresthelocation,L
2
measuresthescaling,L
3
measuresthe
skewnessand L
4
measures thekurtosis. Scaling invariantmeasures are obtainedbyusing
L-momentratiosdenedas
r ,L r =L 2 ; r=3;4;::: (4.21)
Unliketheconventionalmoments,theL-momentsoftheGLDmaybeexpressedinaclosed
form L 1 = 1 1 2 1 1+ 4 1 1+ 3 (4.22) L 2 2 = 1 1+ 3 + 2 2+ 3 1 1+ 4 + 2 2+ 4 (4.23) L 3 2 = 1 1+ 3 6 2+ 3 + 6 3+ 3 1 1+ 4 + 6 2+ 4 6 3+ 4 (4.24) L 4 2 = 1 1+ 3 + 12 2+ 3 30 3+ 3 + 20 4+ 3 (4.25) 1 1+ 4 + 12 2+ 4 30 3+ 4 + 20 4+ 4
Thedetailsfortheparameterestimationarepresentedin thePublicationVI.
Since the L-moments are linear combinations of order statistics, the variances of the
sample L-moments are usually smallerthan thevariances of theconventional sample
sizeissmall. Additionally,theL-momentsaremorerobustagainstoutliers.
4.3.3 Other estimation techniques
Inaddition to method of momentsand method ofL-moments, someother techniques are
recently proposed for the estimation of the GLD parameters. Karian and Dudewicz [37]
proposed theuseofpercentiles. Thepercentileshavesimilardesirablepropertiesasthe
L-momentsbutthedierenceisthatinthepercentilemethod,onlycertainorderstatisticsare
used,whereasinthemethod ofL-momentsallorderstatisticsare employed. Thissuggests
thattheL-momentsbasedestimatorsaremoreeÆcientthanthepercentilebasedestimators.
Purelycomputationalmethods, suchas,leastsquaret(
OzturkandDalemethod)[92]
andthestarshipmethod[75,91]arealsoapplicable. Thestarshipmethodhasthefollowing
threesteps[75]
1. Foraset ofdataandarangeof
1 , 2 , 3 and 4
values,apply thereverse
transfor-mation,i.e. adatavaluexistransformedtoF(x). (Notethat asF doesnotexist in
closedform fortheGLD,numericalmethods areneeded.)
2. Calculate thevalue of asuitable goodness-of-t measure for thecloseness of the
re-sultingvaluestotheuniform(0,1)distribution.
3. Choosethe 1 , 2 , 3 and 4
valuesthatminimizethechosengoodness-of-tmeasure
totheuniform,asthettedvalues.
Accordingtothesimulationresultsin[75]
OzturkandDalemethodandthestarshipmethod
give good estimates. The computational cost, however, is higher than in the method of
momentsor inthemethodofL-moments.
4.4 Adaptive Estimating Functions
Adaptiveestimating functionsproposedin PublicationVIIcanbepresentedasaweighted
sumof twoestimatingfunctions
'(s i )=! 1 ' 1 (s i )+! 2 ' 2 (s i ); (4.26)
where' 1 (s i )and' 2 (s i
)aretwoxedestimatingfunctionsand!
1
and!
2
aretheweighting
parameters. Thecorrespondingobjectivefunctionmaybepresentedas
(y i ;! 1 ;! 2 )=! 1 j 1 (y i )j+! 2 j 2 (y i )j: (4.27)
The ideais iterativelyupdate the weighting parametersin optimal manner. Theoptimal
weightingissolvedmaximizinganeÆcacymeasure basedontheperformance analysis[17,
18, 73,16] of contrastfunctions. It is usuallyassumed in theanalysis that allthe sources
are identically distributed. Local stability is found to depend on the following nonlinear
moments # i =Ef' 0 (s i )g Efs i '(s i )g (4.28)
andthevarianceoftheseparationsolutionisfoundtodependon
i =Ef'(s i ) 2 g Efs i '(s i )g 2 : (4.29)
In[73]itisproposed thatthefollowingmeasurecanbeusedasaperformancecriterion
= # 2 i i : (4.30)
ThismeasureiscalledBSSeÆcacyanditisindependentofthescalingofestimatingfunction
'. TheBSSeÆcacygivesusananalyticalwaytocomparecontrastfunctions. Thesolution
maximizingBSSeÆcacyisgivenin [72]andPublicationVII.
4.4.1 Estimating functions based on cumulants and absolute
mo-ments
The simplest choice for the symmetric and the asymmetric objective function is to use
the cumulant basedkurtosis (2.25) and skewness (2.24). InPublication VII the cumulant
performancein practice. Theabsolutemoment[106]oftheorderqisdened by q (y i )=Efjy i j q g; (4.31)
whereistheexpectedvalueofthedistribution. Theevenabsolute momentsareequalto
theconventionalcentral momentsof thesameorderbut theoddabsolutemomentscannot
be directly written in the terms of the central moments. In addition, we may dene the
skewedabsolutemomentsby
q (y i )=E (y i )jy i j q 1 = Efsign(y i )jy i j q g: (4.32)
Analogouslyto theabsolute moments, theoddskewedabsolute momentsareequalto the
conventional central moments of the same order but the even skewed absolute moments
cannotbedirectlywritteninthetermsofthecentralmoments.
Thekurtosisofadistributionwithunitvariancecanbemeasuredbythethirdabsolute
moment 3 (y i )=E jy i j 3 : (4.33)
Asameasure forskewnesswecanusethesecondskewedabsolutemoment
2 (y i )=Efjy i j(y i )g: (4.34) Exploiting 3 and 2
wemayconstructanICAobjectivefunction. First,wendthatfora
Gaussianrandomvariabley
i with=0and 2 =1 3 (y i )= Z 1 1 jy i j 3 1 p 2 e y 2 i =2 dy i =2 r 2 1:59577 (4.35) and 2 (y i
)=0. Furthermore, wedene measures resembling thecumulant basedkurtosis
andskewness Æ 3 (y i )= 3 ( y i ) 2 r 2 (4.36) Æ 2 (y i )= 2 ( y i ): (4.37)
Basedonthesemeasuresthefollowingobjectivefunction isproposed inPublicationVII (y i )=! ;1 j Æ 3 (y i )j+! ;2 j Æ 2 (y i )j: (4.38)
Theexpressions forthe optimalweightingparameters, !
;1
and !
;2
and otherdetails are
providedin PublicationVII.
4.4.2 Gaussian moments based estimating functions
Thecumulantbasedapproachcanbegeneralizedtoother suitablenonlinearities[72]. The
basicideais that the objectivefunction is asum of theabsolute valuesof symmetricand
asymmetric functions. The theoretical results for an arbitrary nonlinearities are diÆcult
to obtainand thus thevalidity of theobjective functions must be checkedin simulations.
WeproposeusingtheGaussianmomentsassymmetricandasymmetricobjectivefunctions.
TheGaussianmomentsof orderzeroto threearedenedby
G 0 (y i ;b) = e by 2 i =2 1 p b+1 (4.39) G 1 (y i ;b) = by i e by 2 i =2 (4.40) G 2 (y i ;b) = (by 2 i b)e by 2 i =2 (4.41) G 3 (y i ;b) = (3b 2 y i b 3 y 3 i )e by 2 i =2 ; (4.42)
wherebisapositiveconstant. TheGaussianmomentsformthebasisofGram-Charlierand
Edgeworthseries[106]. Usually(4.39)isgivenin theform
G 0 (y i ;b)=e by 2 i =2 : (4.43)
Therationalebehindtheconstant
1 p
b+1
becomesobviouswhenweconsidertheexpected
valueofG
0 (y
i
)inthecasewherethedistributionofy
i isGaussian(=0, 2 =1) EfG 0 (y i )g= Z 1 1 e by 2 i =2 1 p 2 e y 2 i =2 dy i 1 p b+1 =0: (4.44)
Inaddition,wenoticethattheexpectedvalueoftheasymmetricpartequalszeroEfG
1 (y
i
Thenonlinearityinexpression(4.43)maybeemployedasarobustICAobjectivefunction
asproposed in[57, 56]. Weproposetheuse ofG
0
andG
1
asmeasuresofthekurtosis and
the skewness in the ICA framework. Wenowdene theoretical measures for the kurtosis
andtheskewnessasfollows
Æ 0 (y i ;b)=E G 0 ( y i ;b) (4.45) Æ 1 (y i ;b)=E G 1 ( y i ;b) ; (4.46)
where istheexpected valueof y
i
and isthestandarddeviation. Themeasures
Æ 0 and Æ 1 are analogousto 4 and 3
in sensethat theyarezeroforGaussian distributionand in
generalatleastoneofthem isnonzeroforother distributions. However,
Æ 0 and Æ 1 donot
measure thekurtosis and the skewness in the samesense as
4 and 3 or Æ 3 and Æ 2 . For
instance,thesignsof
Æ 1
and
3
maydier.
Now,theobjectivefunctionbasedonGaussianmomentswithb=1canbeexpressedas
G (y i )=! G;1 jG 0 (y i )j+! G;2 jG 1 (y i )j: (4.47)
The estimating function related to the objectivefunction (4.47) and the derivativeof the
estimatingfunction are
' G (y i )=! G;1 sign( Æ 0 )G 1 (y i )+! G;2 sign( Æ 1 )G 2 (y i ) (4.48) ' 0 G (y i )=! G;1 sign( Æ 0 )G 2 (y i )+! G;2 sign( Æ 1 )G 3 (y i ): (4.49)
Thestatisticsign(
Æ 0
)hasasimilar roleasthesignofthekurtosishasinmanyalgorithms.
Thesignof
Æ 0
iseitherknowninadvance,ormorepractically,estimatedfromthedatafor
eachsource.
4.5 Performance
Several simulations presented in the original publications demonstratethe reliable