ValentinRobu ???
,HanLaPoutr´e,andSanderBohte
CWI,CenterforMathematicsandComputerScience
Kruislaan413,NL-1098SJAmsterdam,TheNetherlands
frobu,hlp,sbohteg@cwi.nl
Abstract. Thispaperprovidesacomprehensivestudyofthestructureand
dy-namicsofonlineadvertisingmarkets,mostlybasedontechniquesfromthe
emer-gentdisciplineofcomplexsystems analysis.First,welookathowthe display
rankofaURLlinkinuencesitsclickfrequency,forbothsponsoredsearchand
organicsearch.Second,westudythemarketstructurethatemergesfromthese
queries,especiallythemarketsharedistributionofdifferentadvertisers.Weshow
thatthesponsoredsearchmarketishighlyconcentrated,withlessthan5%ofall
advertisersreceivingover2/3oftheclicksinthemarket.Furthermore,weshow
thatboththenumberofadimpressionsandthenumberofclicksfollowpower
lawdistributionsofapproximately thesamecoefcient.However,wendthis
resultdoesnotholdwhenstudyingthesamedistributionofclicksperrank
posi-tion,whichshowsconsiderablevariance,mostlikelyduetothewayadvertisers
dividetheirbudgetondifferentkeywords.Finally,weturnourattentiontohow
suchsponsoredsearchdatacouldbeusedtoprovidedecisionsupporttoolsfor
biddingforcombinationsofkeywords. Weprovideamethodtovisualize
key-wordsofinterestingraphicalform,aswellasamethodtopartitionthesegraphs
toobtaindesirablesubsetsofsearchterms.
1 Introduction
Sponsoredsearch, the payment by advertisersfor clicks on text-only ads displayed
alongsidesearch engineresults, hasbecomean importantpart ofthe Web.It
repre-sents themain sourceofrevenue forlargesearch engines,suchas Google;Yahoo!;
andMicrosoft'sLive.com,andsponsoredsearchisreceivingarapidlyincreasingshare
ofadvertisingbudgetsworldwide.Buttheproblemsthatarisefromsponsoredsearch
alsopresentexcitingresearchopportunities,foreldsasdiverseaseconomics,articial
intelligenceandmulti-agentsystems.
Intheeldofmulti-agentsystems,researchershavebeenworkingforsometime
ontopicssuchasdesigningautomatedauctionbiddingstrategiesinuncertainand
com-petitiveenvironments(e.g.[4,21]).Anotheremergenteldwhichstudiedsuchtopic
isagent-basedcomputationaleconomics(ACE)[15],wheresignicantresearcheffort
hasfocusedonthe dynamicsofelectronicmarketsthroughagent-basedsimulations.
OneparticulartopicofresearchfortheACEcommunityishoworderandmacro-level
marketstructurecanemergefromthemicro-levelactionsofindividualusers.However,
???
CurrentlyintheIntelligence,Agents,MultimediaGroup,UniversityofSouthampton,UK.
?
mostexistingworkhasbeenbasedonsimulations,as thereare fewsourcesof
large-scale,empiricaldatafromreal-worldautomatedmarkets.Inthiscontext,empiricaldata
madeavailablefromsponsoredsearchprovidesanexcellentopportunitytotestthe
as-sumptionsmadeinsuchmodelsinarealmarket.
Inthispaper,whichisbasedonlarge-scaleMicrosoftsponsoredsearchdata,we
pro-videadetailedempiricalanalysisofsuchdata.Todothis,wemakeuseofseveral
tech-niquesderivedfromcomputationaleconomics,andespeciallycomplexsystemstheory.
Complexsystemsanalysis(whichwebrieyreviewbelow)hasbeenshowntobean
excellenttoolforanalyzinglargesocial,technologicalandeconomicsystems,including
websystems[19,11,6].
1.1 Thedataset
Thestudyprovidedinthispaperisbasedonalargedatasetofsponsoredsearchqueries,
obtainedfromthewebsiteLive.com 1
.Thesearchdataprovidedconsistsoftwodistinct
datasets:asetofsponsoredsearchdataset(URLsreturnedareallocatedtoadvertisers,
throughanauctionmechanism)andanorganicsearchdataset(standard,unbiasedweb
search).The sponsoredsearch dataconsistsof101,171,081distinctimpressions(i.e.
singledisplaysofadvertiserlinks,correspondingtoonewebquery),whichintotal
re-ceived7,822,292clicks.Thissponsoreddatasetwascollectedforaroughly3-month
pe-riodintheautumnof2007.Theorganicsearchdatasetconsistsof12,251,068queries,
andwascollectedinadifferent3-monthintervalin2006(thereforethetwodatasets
arechronologicallydisjoint).
Itisimportanttostressthatintheresultsreportedinthispaperarebasedmostlyon
thesponsoredsearchdataset 2
.Furthermore,thesponsoredsearchdatawehadavailable
onlyprovidespartialinformation,inordertoprotecttheprivacyofMicrosoftLive.com
customersandbusinesspartners.Forexample,wehavenoinformationaboutnancial
issues, suchthe pricesofdifferentkeywords,howmuchdifferentadvertisersbidfor
thesekeywords,thebudgetstheyallocateetc.Furthermore,whilethedatabaseprovides
ananonymizedidentierforeachuser performingaquery,thisdoesnotallowusto
traceindividualusersforanylengthoftime.
Nevertheless,onecanextractagreatdealofusefulinformationfromthedata.For
example,theidentitiesofthebidders;forwhichkeywordcombinationstheiradswere
shown(i.e.theimpressions);forwhichofthesecombinationstheyreceivedaclick;the
positiontheirsponsoredlinkwasinwhenclickedetc...Insightsgainedfromanalyzing
thisinformationformsthemaintopicofthispaper.
2 Complexsystemsanalysisappliedtothewebandeconomics
Complexsystemsrepresentsanemergingresearchdiscipline,attheintersectionof
di-verseeldssuchasAI,economics,multi-agentsimulations,butalsophysicsand
biol-ogy[2].Thegeneraltopicofstudiesintheeldofcomplexsystemsishowmacro-level
1
ThisdatawaskindlyprovidedtousbyMicrosoftresearchthroughBeyondSearchaward
2
structurecanemergefromindividual,micro-levelactionsperformedbyalargenumber
ofindividualagents(suchasinanelectronicmarket).Forwebphenomena,complex
systems techniqueshavebeen successfully usedbeforetostudyphenomena suchas
collaborativetagging[11,10]ortheformationofonlinesocialgroups[1].
Oneofthe phenomenathatare indicativetosuchcomplexdynamicsisthe
emer-genceofscale-freedistributions,suchaspowerlaws.Theemergenceofpowerlawsin
suchasystemusuallyindicatesthatsomesortofcomplexfeedbackphenomena(e.g.
suchasapreferentialattachementphenomena)isatwork.Thisisusuallyoneofthe
cri-teriausedfordescribingthesystemascomplex[2,6].Researchindisciplinessuchas
econophysicsandcomputationaleconomicsdiscusseshowsuchpowerlawscanemerge
inlarge-scaleeconomicsystems(see[6,19]foradetaileddiscussion).
2.1 Powerlaws:denition
Apowerlawisarelationshipbetweentwoscalarquantitiesxandyoftheform:
y=cx
(1)
where andc are constantscharacterizing the given powerlaw. Eq.1can also be
writtenas:
logy=logx+logc (2)
Whenwritteninthisform,afundamentalpropertyofpowerlawsbecomes
appar-ent;whenplottedinlog-logspace,powerlawsappearas straightlines.Asshownby
Newman[19]andothers,themainparameterthatcharacterizesapowerlawisitsslope
parameter.(Onalog-logscale,theconstantparameterconlygivestheverticalshift
ofthedistributionwithrespecttothey-axis.).Verticalshiftcanvarysignicantly
be-tweendifferentphenomenameasured(inthiscase,clickdistributions),whichotherwise
followthesamedynamics.Furthermore,sincethelogarithmisappliedtobothsidesof
theequation,thesizeoftheparameterdoesnotdependonthebasischosenforthe
logarithm(althoughtheshiftingconstantcisaffected).Inthelog-logplotsshowninthis
paper,wehavechosenthebasisofthelogarithmtobe2,sincewefoundgraphswith
thislowbasisthe moregraphicallyintuitive.But, inprinciple,thesameconclusions
shouldholdifwechoosethelogarithmbasistobe,e.g.eor10.
3 Inuenceofdisplayrankonclickingbehavior
Therstissuethatwestudied(forbothsponsoredandorganicsearchdata)ishowthe
positionthataURLlinkisdisplayedininuencesitschancesofreceivingaclick.Note
thatthisparticularissuehasreceivedmuchattentioninexistingliterature[8].Tobriey
explain,Microsoft'sLive.comsearchinterface(fromwhichthedatawascollected),is
structuredasfollows:
Forsponsoredsearchthereareupto8availableslots(positions)inwhichsponsored
URLlinkscanbeplaced.Threeofthesepositions(rankedas1-3)appearatthetop
ofthepage,abovetheorganicsearchresults,butdelimitedfromthosebyadifferent
0
0.5
1
1.5
2
2.5
3
16
17
18
19
20
21
22
23
Display rank (position) of sponsored link (or ad)
Number of clicks the link receives
Relative ad position vs. no of clicks (log2−log2 scales)
0
1
2
3
4
5
6
4
6
8
10
12
14
16
18
20
22
24
Relative position of clicked link (normal search link, not sponsored)
Number clicks the link receives
Relative position of clicked links vs. no of clicks (log2−log2 scales)
Fig.1.DistributionofclicksreceivedbyaURL,relativetoitspositiononthedisplay,for
spon-soredandorganicsearch.A(left-side,sponsoredsearch):Thereareupto8sponsoredadvertiser
linksdisplayed:3onthetopofthepage,and5inasidebar.B(right,organicsearch):Thereare
usually10positionsdisplayedperpage,withmultipleresultpagesappearingasplateaus.
Theorganicsearchresultsareusuallyreturnedas10URLlinks/page(ausercan
opttochangethissetting,butveryfewactuallydo).
Allthesponsoredlinksareallocatedbasedonanauction-likemechanismbetween
the setofinterestedadvertisers(suchadisplay,inanyposition iscalledin
impres-sion).However,theadvertisersonlypayiftheirlinkactuallygetsclicked-i.e.pay
perclickmodel.Theexactalgorithmusedbytheenginetodeterminethewinnersand
whichadvertisergetswhichpositionisacomplexmechanismdesignproblemandnot
alldetailsaremadepublic.However,ingeneral,itdependsonsuchfactorsastheprice
thebidderiswillingtopayperclick,therelevanceofthequerytohersetofterms,and
herpastperformanceintermsofclickthroughrate(i.e.howoftenlinksofthatuser
wereclickedinthepast,foragivenkeyword).Bycontrast,inorganicsearch,returned
resultsarerankedsimplybasedonrelevancetotheuser'squery.
3.1 Resultsondisplaypositionbiasandinterpretation
ResultsforthepositionbiasonclickdistributionareplottedinFig.1:partA(leftside)
forsponsoredsearchandpart B(rightside)fortheorganicsearch.Notethatbothof
thesearecumulativedistributions:theywereobtainedbyaddingthenumberofclicks
foralinkineachposition,irrespectiveoftheexactcontextofthequeriesorlinksthat
generatedthem.Furthermore,botharedrawninthelog-logspace.
Therearetwomainconclusionstobedrawnfromthesepictures.Forthesponsored
searchresults(Fig.1.A),thedistributionacrossthe8slotsseemstoresembleastraight
line,withaslopeparameteraprox. =2.However,suchaconclusionwouldbetoo
simplistic:thereis,infact,adifferencebetweentheslopebetweentherst3positions
positionsisaround
1
=1:4,whileforthelast5isaround
2
=2:5.Themostlikely
reason forthisdropcomesfrom theway theLive.comsearch interface isdesigned.
Therst3slotsforsponsoredsearchlinksareshownonthetopofthepage,abovethe
organicsearchresults,whilethelast5areshowninasidebarontherightofthepage.
Fig.1.Bcorrespondstothe sameplot fororganicsearch results,the maineffect
onenotices isthe presenceofseverallevels(thresholds),correspondingtoclickson
differentsearchpages.Westressthat,sincethisisalog-logplot,thedropinattention
betweensubsequentsearchpagesisindeedverylarge-abouttwoordersofmagnitude
(i.e.thetop-rankedlinkonthesecondsearchpageis,onaverage,about65timesless
likelytobeclickedthanthelast-rankedlinkontherstpage).Thedistributionof
intra-pageclicks,however,atleastfortherstpageofresults,couldberoughlyapproximated
byapowerlawofcoefcient=1:25.
Allthisraisesofcoursethequestion:whatdothesedistributionsmeanandwhat
kindofuserbehaviourcouldaccountfortheemergenceofsuchdistributionsin
spon-soredsearchresults?First, weshould pointoutthatthe fact thatwe ndpowerlaw
distributionsinthiscontextisnotcompletelysurprising.Suchdistributionshavebeen
observedinmanywebandsocialphenomena(togivejustoneexample,in
collabora-tivetaggingsystems,intheworkbyoneoftheco-authorsofthispaper[11]andothers).
Infact,anymodeloftoptobottomprobabilisticattentionbehaviour,suchasauser
scanningthelistofresultsfromtoptobottomandleavingthesitewithacertain
prob-abilitybyclickingoneofthemcouldgiverisetosuchadistribution.Ofcourse,more
ne-grainedmodelsofuserbehavior areneededtoexplainingclick behaviorinthis
context(anexampleofsuchamodelis[8]).Butfornowweleavethisissuetofurther
research,andwelookatthemaintopicofthispaperwhichisexaminingthestructure
ofthesponsoredsearchmarketitself.
4 Marketstructureattheadvertiserlevel
InthisSection,welookathowsponsoredsearchmarketsarestructured,fromthe
per-spectiveoftheparticipants(i.e.advertisersthatbuysearchslotsfortheirURLs).More
specically,westudyhowrelativemarketsharesaredistributedacrosslink-based
ad-vertisers.Wenotethatinmanymarkets,anoftencitedrule,alsoinformallyattributed
toPareto,isthat20%ofparticipantsinamarket(e.g.customersinamarketplace)drive
80%oftheactivity.Here,wecallthiseffectthemarketconcentration.
Inasponsored search market,the maincommodity whichproduces valuefor
marketparticipants(eitheradvertisersandthesearchengine)isthenumberofclicks.
Therefore,therstthingthatweplotted(rst,usingnormal,i.e.non-logarithmicaxes)
isthecumulativeshareofdifferentadvertisers(seeFig.2.A.-leftsidegraph).From
thisgraph,onecanalreadyseethatjustthe top500advertisersgetroughly66%(or
abouttwo-thirds)ofthetotal7.8millionclicksintheavailabledataset 3
.
3
Notethatanadvertiserwastaken,followingtheavailabledata,bythe domainURLof the
sponsoredlink.Thisisareasonableassumption,inthiscase.Forexample,Ebayusesmany
0
1000
2000
3000
4000
5000
0
10
20
30
40
50
60
70
80
Cumulative share of the click market of highest rated advertisers (%)
Advertiser rank by no. of clicks
Percentage of total clicks
0
2
4
6
8
10
12
14
0
5
10
15
20
25
Rank of the advertiser (log
2
scale)
No. of times the URL of the advertiser shown / clicked
Number of impressions and number of clicks received (log
2
−log
2
scale)
Impressions
Clicks
Fig.2.A(left-side):Cumulativepercentagedistributionofthenumberofclicksadvertisersin
themarketreceive,wrt.totheirrankposition,consideringthetop5000advertisersinthemarket
(normalscales).B(right):Log-logscaledistributionsofthenumberofimpressions,respectively
numberofclicks,receivedbythetop10000advertisersinthemarket.Notethatbothdistributions
followapproximately parallelpowerlaws,buttheclickdistributionslevelsoffinalongtail
aftertherst4000advertisers,whiletheimpressiondistributionhasamuchlongertail(notall
appearinginthegure).
Sinceinourdata,thereareatleast10000distinctadvertisers(mostlikely,thereare
manymore,butweonlyconsideredthetop10000),thismeansthatapercentageofless
than5%ofalladvertisershaveatwo-thirdsmarketshare.Thissuggeststhatsponsored
searchmarketsareindeedveryconcentrated,perhapsevenmoresothantraditional
real-worldmarkets.
4.1 Distributionofimpressionsvs.distributionofclicksforthetopadvertisers
Next,westudiedthedetaileddistributionofthenumbersofimpressions(i.e.displayed
URLs)andclicksontheseimpressions,forthetop10000distinctadvertisers.Results
areshowninFig.2.B.(right-handsidegraph),usingalog-logplot.
ThemaineffectthatonecanseefromFig.2.B.isthatthedistributionofimpressions
andthedistributionforclicksreceivedbytheadvertisersformtwoapproximately
paral-lel,straightlinesinthelog-logspace(i.e.theyaretwopowerlawsofapproximatelythe
sameslopecoefcient).Thereisoneimportantdifference,though,whichisthesize
ofthelong tailofthedistribution.Thedistributionofthe numberofclicks(lower
line), levelsoffafterabout 4000-5000positions.Basically,indataterms,thismeans
thatadvertisersbeyondthetop5000eachreceiveanegligiblenumberofclicks,atleast
inthedatasetweexamined.Thereason forthismaybe thattheiradsalmostalways
appearinthelowerdisplayranks,orsimplythattheybidonasetofrarelyused(or
highlyspecialised)searchkeywords.Bycontrast,thedistributionofimpressionsstill
0
2
4
6
8
10
12
6
8
10
12
14
16
18
20
Rank of the advertiser by total number of times her links were clicked
No. of times the links were clicked (all positions)
Ranking of advertisers by number of clicks received (log
2
−log
2
scale)
0
2
4
6
8
10
12
0
2
4
6
8
10
12
14
16
18
20
Rank of the advertiser by the number of times her links were clicked
Number of times links of the advertiser were clicked
Ranking of advertisers by number of clicks received, in total and for different positions
Total clicks
Position 1
Position 2
Position 3
Position 4
Position 5
Position 6
Position 7
Position 8
Fig.3.Distributionofadvertisermarketshare,basedon theirorderedrankvs.the numberof
clickstheirlinksreceive(log-logscales).Theleft-handsideplot(partA)givesthetotalnumber
ofclicksanadvertiserreceivedforallimpressionsofherlinks,regardlessofthepositionthey
werein.Theright-handside(partB)givesthenumberofclicksreceived,bothintotal,butalso
whenheradsweredisplayedonaspecicpositiononthepage(amongthe8rankedslotsofthe
sponsoredsearchinterface).
4.2 Distributionofmarketshareperdisplayrankposition
ThepreviousSectionexaminedthepowerlawdistributionsofthenumberofclickseach
advertisergetsinaggregate(i.e.overalldisplayrankshis/herlinksareshownin).Here,
welookhowanadvertiser'smarketsharedistributionisaffectedwhenbrokendownper
displayrank(anissuewealreadytouchedoninSect.3).
However,werstmakeaslightrestrictioninthenumberofadvertisersweconsider.
AsshowninSect4.1above,thereisapowerlawdistributionintheclicksreceivedby
thetop4000advertisers,advertisersrankedbeyondthispositioneachreceivea
negligi-blenumberofclicks.Therefore,inthisSection,werestrictourattentiontothetop4000
advertisers.Asthese4000advertisersreceiveover80%ofall7.8millionclicksinthe
dataset(seeFig.2.A),wedonotriskloosingmuchusefulinformation.
ResultsareshowninFig.3. First,inFig.3.A.we showagain,moreclearly,the
powerlawdistributionofthenumberofclicksforthetop4000advertisers.Notethat
thisisawidedistribution,inthesensethatitcovers4000positionsandseveralorders
ofmagnitude.On theright-handsidegraph(Fig.3.B),weshowthe samegraph,but
now,foreach advertiser,we also break down the number ofclicks received by the
positionhis/hersponsoredURLwasinwhenitwasclicked.
Surprisingly,perhaps,thesmoothpowerlawshapeisnotfollowedatthelevelof
thedisplayrank-infact,forthelowerlevelsthevariancebecomessogreatthatthe
dis-tributionbreaksdown,atthedisplayranklevel.Wehypothesizethemostlikelyreason
forthis varianceis the wayeachindividualadvertiser doesthe biddingforthe
pre-ferredkeywordsatdifferentpointsintime,orthewayhespeciesthewayhiskeyword
hencegettingthetopspot.Bycontrast,othersmayprefertohavelonger-runningads,
eveniftheydon'tgetthe topspoteverytime.Someanecdotalevidencefromonline
marketingsuggeststhatevenjusttherepeateddisplayofalinkofacertainmerchant
onthescreenmaycount:ifauserseesanadrepeatedlyinhis/herattentionspace,that
mayestablishthebrandasmoretrustworthy.
InFig.3.B,bylokingthethetop4advertisersinthisdataset,onecanalreadysee
thatusersranked2and3utilizearatherdifferentstrategythanthetrendrepresented
byusers1and4.Whiletheirtotalnumerofclicksdoesfollow,approximatelythepower
law,theyseemtoget,proportionallyspeaking,moreclicksonthetop-rankedslotonthe
pagethantherest.While,inordertopreservetheprivacyofthedata,wecannot
men-tionwhothesecompaniesare,itdoesseemthatusers2and3areactuallyaggregators
ofadvertisingdemand.Bythis,wemeanonlineadvertisingagenciesorengines(or
au-tomatedservicesofferedbytheplatformitself)thataggregatedemandfromdifferent
advertisersand dothe biddingon theirbehalf.Apparently,thisallowsthem to
cap-ture,proportionally,moreoftenthetopslotfortherequiredkeyword.Unfortunately,
however,wecannotinvestigatethisaspectfurther,sincethedatasetprovideddoesnot
containanyinformationaboutbidding,budgetsornancialinformationingeneral.
InthefollowingandlastSectionofthispaper,weturnourattentiontoasomewhat
differentproblem:howcouldweuseinsightsgainedfromanalyzingthisquerydatato
provideabiddingdecisionsupportforadvertiserstakingpartinthismarket.
5 Usingclickdatatoderivesearchtermrecommendations
ThepreviousSectionsofthispaperusedcomplexsystemsanalysistoprovidea
high-levelexaminationofthe dynamicsofsponsoredsearch markets.InthisSection,we
lookathowsuchquerylogdatacouldbeusedtooutputrecommendationstoindividual
advertisers.Suchanapproachshouldleadtoanswerstoquestionssuchas:Whatkindof
keywordcombinationslookmostpromisingtospendone'sbudgeton,suchastoattract
amaximumnumberofrelevantuserclicks?Whilethepreviousanalysisofpower-law
formationwasdoneatamacro-level,inthisSectionwetakeamorelocalperspective.
Thatis,wedonotconsiderthesetofallpossiblesearchterms,butratherasetthatis
specictoadomain.Thisisareasonablemodel:inpractice,mostadvertisers(which
aretypicallyonlinemerchants),areonlyconcernedwitharestrictedsetofkeywords
whicharerelatedtowhattheyareactuallytryingtosell.
Fortheanalysis inthispaper,wehavechosenasadomain 50keywordsrelated
tothetourismindustry(i.e.onlinebookingsoftickets,travelpackagesandsuch).The
reasonforthisisthatmuchofthisactivity isalreadyfastmovingonline(e.g.avery
substantialproportionof,forexample,ightticketsandhotelreservationsarenow
car-riedoutonline).Furthermore-andperhapsmoreimportant-therearelowbarriersof
entryandtheeldisnotdominatedbyonemajorplayer.Thiscontrasts,forexample,
otherdomains,suchas thesaleofIpods andaccessories,whereAppleStorescanbe
5.1 Derivingdistancesfromco-occurrenceinsponsoredclicklogs
Givenalarge-scalequerylog,oneofthemostusefulpiecesofinformationitprovidesis
theco-occurenceofwordsindifferentqueries.Muchpreviousworkhasobservedthat
thefactthattwosearchkeywordsfrequentlyappeartogetherinthesamequerygives
risetosomeimplicitsemanticdistancebetweenthem[11].
Inthispaper,wetakeaslightlydifferentperspectiveonthisissue,since,in
com-putingthedistances,weonlyusethosequerieswhichreceivedatleastonesponsored
searchclickforthetextads(i.e.URLs)displayedalongsidetheresults.Wearguethis
is asubtlebutvery importantdifferencefromsimplyusingco-occurenceinorganic
search logs.Thefact thatqueries containingsomecombinationofquerywordslead
toaclickonasponsoredURLimpliesnot onlyapurelysemanticdistancebetween
thosekeywords,moreimportantforanadvertiser,thefactthatuserssearchingonthose
combinationsofkeywordshavethepossibleintentionofbuyingthingsonline.
Formally,let N(T
i ;T
j
)denotethe numberoftimes twosearch terms T
i andT
j
appearjointlyinthesamequery,ifthatqueryreceivedatleastonesponsoredsearch
click.LetN(T
i
)andN(T
j
)denotethesamenumberofqueriesleadingtoaclick,in
whichtermsT
i
,respectivelyT
j
appearintotal(regardlessofothertermstheyco-occur
with).ThecosinesimilaritydistancebetweentermsT
i andT j isdenedas: Sim(T i ;T j )= N(T i ;T j ) p N(T i )N(T j ) (3)
Notethat cosinesimilarityisnottheonlywaytodenesuchadistance,butitis
averypromisingoneinmanyonlineapplicationsettings, suchas showninprevious
workbytheseauthorsandothers[11,16,23,22].
5.2 Constructingkeywordcorrelationgraphs
Themostintuitivewaytorepresentsimilaritydistancesisthroughakeywordcorrelation
graph.Theresults fromoursubsetof50travel-relatedtermsare shown inFig.4.In
thisgraph,thesize ofeachnode(representingonequeryterm)isproportionaltothe
absolutefrequencyofthekeywordinallqueriesinthelog.Thedistancesbetweenthe
nodesareproportionaltothesimilaritydistancebetweeneachpairofterms,computed
Eq.3,wherethe wholegraphisdrawnaccording toaso calledspring
embedder-typealgorithm.Inthistypeofalgorithm,edgescanbeconceivedassprings,whose
strengthisindirectlyproportionaltotheirsimilaritydistance,leadingtoclusterofedges
similartoeachothertobeshowninthesamepartofthegraph.
Thereareseveralcommercialandacademicpackagesavailabletodrawsuch
com-plexnetworks.Theonewethinkismostsuitable-andwhichwasusedforgraphFig.
4-isPajek(see[3]foradescription).Notethatnotall edgesare consideredinthe
nalgraph.Evenfor50nodes,thereare
50
2
=1225possiblepairwisesimilarities
(edges),oneforeachpotentialkeywordpair.Mostofthesedependenciesare,however,
spurious(theyrepresentjustnoiseinthedata),andouranalysisbenetsfrom using
onlythetopfraction,correspondingtothestrongestdependencies.Inthegraphshown
holiday{15.4%}
beach{5.4%}
luxury{7.1%}
hotel{8.2%}
island{3.7%}
party{9.9%}
package{4.1%}
resort{3.8%}
sun{4.4%}
mountain{5.2%}
ocean{6.7%}
nightlife{1.3%}
weather{4.8%}
vacation{8.7%}
relaxation{4.6%}
holidays{2.5%}
hiking{7.1%}
climbing{6.3%}
getaway{3.2%}
view{3.9%}
sea{6.7%}
sand{6%}
diving{5.3%}
cruise{8.6%}
show{2.6%}
swimming{10.6%}
trip{10.7%}
entertainment{4.6%}
tickets{4.6%}
ticket{11.6%}
exotic{5.6%}
tropical{6.8%}
destination{5.9%}
flight{8.8%}
cheap{29.1%}
last{7.9%}
minute{13.7%}
deal{2.9%}
visit{4%}
romantic{6%}
warm{6.5%}
tour{3.3%}
fun{6.5%}
offer{3%}
great{3.9%}
sunrise{5.1%}
sunset{4.6%}
Hawaii{5.1%}
Oahu{6.1%}
Fig.4.Visualizationofasearchtermcorrelationgraph,forasetofsearchtermsrelatedtothe
tourismindustry. Eachsearchtermisassignedonecoloreddot.Thesizeofthedots givesits
relativeweight(intotalnumberofclicks received),whilethedistances betweenthe dots are
obtainedthroughaspring-embeddertypealgorithmandareproportionaltotheco-occurrenceof
thetwosearchtermsinaquery.Eachdotismarkedwithitssuccessrate(percentageofthetotal
numberofimpressionsassociatedwiththatquerywordthatreceivedaclick).
5.3 Graphcorrelationgraphs:results
ThereareseveralconclusionsthatcanbedrawnfromthevisualizationinFig.4
con-structedbased on the Live.com sponsoredsearch query logs.First, noticethat each
nodewaslabellednotonlywiththetermorkeyworditcorrespondsto,butalsowith
theaggregateclick-throughrate(CTR),specicforthatkeyword.Basicallly,thisisthe
percentageofallthequeriesthatusedthetermwhichgeneratedatleastoneclicktoa
sponsoredsearchURLdisplayedwiththatquery.
Notethat these click-throughrates may,ata rst glance,seemon the low side:
ingeneralonlyafewpercent ofallqueries actuallylead toaclick onansponsored
(i.e.advertiser)link.Nevertheless,asasearchenginereceivesmillionsofqueriesina
rathershortperiodoftime,evena5%-10%click-throughratecanbequitesignicant.
Notethatsomekeywords(suchacheap)haveahigherclick-throughratethanothers.
Thereasonforthismaybethatpeoplesearchingforcheapthings(e.g.cheapairline
However,themostinterestingeffecttoobserveinFig.4arethetermclustersthat
emergeindifferentpartsofthegraph,fromtheapplicationofthespring-embedder
vi-sualizationalgorithm.Forexample,theleftmostpartofthegraphhas4termsrelated
toweather,suchaswarm,tropicalandexotic.Onthetopleftpartofthegraph,
onecanndtermssuchasentertainment,nightlife,partyandfun,whilevery
bottompartincludesrelatedtermsassuchclimbing,hikingandmountain.The
top-rightpartincludescommercialtermssuchas:ticket,tickets,ight,cheap,
last,minute.Thecentralpartofthegraphincludestermssuchabeach,sand,
sea,resort,ocean,islandetc.Additionally,pairsoftermsonewouldnaturally
associatedoindeedappearclosetogether,suchasromanticandgetawayand
sun-setandsunriseandocean.
Inthefollowing,wediscussanalgorithmthatcandetectsuchclustersautomatically.
Moreprecisely,wewouldlikeanalgorithmthatselectscombinationsoftagsthatlook
promisinginattractingqueriesandclicks.
5.4 Automaticidenticationofsetsofkeywords
InthisSection,weshowhowkeywordgraphscouldbeautomaticallypartitionedinto
relevantkeywordclusters.Thetechniqueweuseforthispurposeisthesocalled
com-munitydetectionalgorithm[20],alsoinspiredbycomplexsystemstheory.Innetwork
orgraph-theoreticterms,acommunityisdenedasasubsetofnodesthatareconnected
morestronglytoeachotherthantotherestofthenetwork(i.e.adisjointcluster).Ifthe
networkanalyzedisasocialnetwork(i.e.vertexesarepeople),thencommunityhas
anintuitiveinterpretation.However,thenetwork-theoreticnotionofcommunity
detec-tionalgorithmisbroader,hasbeensuccesfullyappliedtodomainssuchasnetworksof
itemsonEbay[12],publicationsonarXiv,foodwebs[20]etc.
Communitydetection: aformal discussion Let the network consideredbe
repre-sented a graph G = (V;E), when jVj = n and jEj = m. The community
de-tectionproblemcanbeformalizedas apartitioningproblem,subjecttoaconstraint.
Each v 2 V must be a assigned to exactly one group (i.e. communityor cluster)
C 1 ;C 2 ;:::C nC
,whereallclustersaredisjoint.
Inorder tocomparewhichpartitionisoptimal, themetric usedis modularity,
henceforthdenotedbyQ.Intuitively,anyedgethatinagivenpartition,hasbothendsin
thesameclustercontributestoincreasingmodularity,whileanyedgethatcutsacross
clustershasanegativeeffectonmodularity.Formally,lete
ij
;i;j=1::n
C
bethe
frac-tionofalledgeweightsinthegraphthatconnectclustersiandjandleta
i = 1 2 P j e ij
bethefractionoftheendsofedgesinthegraphthatfallwithinclusteri.Themodularity
QofagraphjGjwithrespecttoapartitionCisdenedas:
Q(G;C)= X i (e i;i a 2 i ) (4)
Informally,Q isdened as thefraction ofedges inthe network thatfall within
Algorithm1GreedyQ Partitioning:GivenagraphG = (V;E);jVj = n;jEj = m returnspartition<C 1 ;:::C nC > 1. Ci=fvig,8i=1;n 2. n C =n
3. 8i;j,eijinitializedasinEq.5
4. repeat 5. <C i ;C j >=argmax c i ;c j (e ij +e ji 2a i a j ) 6. Q=maxc i ;c j
(eij+eji 2aiaj)
7. C i =C i S C j ,C j =;//mergeC i andC j 8. nC =nC 1 9. untilQ0 10.maxQ=Q(C1;::Cn C )
Asshownin[20],ifQ=0,thenthechosenpartitioncshowsthesamemodularity
asarandomdivision.AvalueofQcloserto1isanindicatorofstrongercommunity
structure-inrealnetworks,however,thehighestreportedvalueisQ=0:75.In
prac-tice, [20]found(basedonawiderangeofempiricalstudies)thatvaluesofQabove
around0.3indicateastrongcommunitystructureforthegivennetwork.Inourcase,
theedgesthatweconsideredinthegraph(recallthatonlythestrongest150edgesare
considered)haveaweight,denedas showninEq.3above. Forthepurposeofthe
clusteringalgorithm,thisweighthastobenormalizedbythesumofallweightsinthe
system,thusweassigninitialvaluestoe
ij as: e ij = 1 P ij sim ij sim ij (5)
5.5 Thegraphpartitioningalgorithm
Thealgorithmweusetodeterminetheoptimalpartitionisthecommunity
identica-tionalgorithmdescribedin[20],formallyspecied asAlg.1above.Informally
de-scribed,thealgorithmrunsasfollows.Initially,eachofthevertexes(inourcase,each
keyword)isassignedtoitsownindividualcluster.Then,ateachiterationofthe
algo-rithm, twoclustersare selectedwhich,if merged,lead tothe highestincrease inthe
modularityQofthe partition.Ascanbe seenfromlines5-6 ofAlg. 1,because
ex-actlytwoclustersaremergedateachstep,itiseasytocomputethisincreaseinQas:
Q=(e ij +e ji 2a i a j )orQ=2(e ij a i a j )(thevalueofe ij beingsymmetric).
ThealgorithmstopswhennofurtherincreaseinQispossiblebyfurthermerging.
Notethat itis possibletospecifyanother stoppingcriteria inAlg.1, line9,e.g.
itis possibletoaskthe algorithmtoreturnaminimumnumberofclusters(subsets),
byletting thealgorithmrun untiln
C
reachesthisminimumvalue.Furthermore,this
Cluster1 Cluster2 Cluster3 Cluster4Cluster5 Cluster6 Cluster7Cluster8Cluster9
beach party package weather getaway diving cruise show last
luxury entertainment vacation exotic romanticswimming sunrise tickets minute
hotel nightlife holidays tropical sunset ticket visit
island fun destination warm cheap
resort Hawaii deal ight
sun Oahu tour
mountain offer ocean great hiking climbing sea sand
Keywordseliminatedtoincreasemodularity:holiday,holidays,relaxation,trip.
Fig.5.Optimalpartitionofthesetoftraveltermsinsemanticclusters,whenthetop150edgesare
considered.ThepartitionwasobtainedbyapplyingNewman'sautomatedcommunitydetection
algorithmtothegraphfromFig4.ThispartitionhasaclusteringcoefcientQ=0.59.
5.6 Discussionofgraphpartitioningresults
Theresultsfromthegraphpartitioningalgorithm,showingthepartitionmaximisesthe
modularityQforthissetting,isshowninFig.5.Notethatthisisnottheonlypossible
waytopartitionthisgraph-ifonewouldconsideradifferentnumberofstrongest
de-pendenciestobeginwith(inthiscaseweselectedthetop150edges,for50keywords),
oradifferentstoppingcriteria,onemaygetasomewhatdifferentresult.Furthermore,
note that somekeywords,which were very general andcould t in severalclusters
(shownbelowthegure),wereprunedinordertoimprovemodularity,througha
sepa-ratealgorithmnotshownhere.
Still,thepartitionresultsshowninFig.5matchwellwhatourintuitionwould
de-scribeasinterestingcombinationsofsearchterms,forsuchasetting.Thereisonelarge
centralcluster,oftermsthatallhavereasonablystrongrelationstoeachother,andaset
ofsmall,marginalclustersontheside.Thelargeclusterinthemiddlecouldbefurther
brokenbythepartitionalgorithm, butonlyif we forcesomeotherstop criteriathan
maximummodularity(suchasacertainnumberofdistrinctclusters).
ThepartitioninFig.5tswellwithwhatcanbegraphicallyobservedinFig.4:
actually,mostoftheclustersobtainedautomaticallyafterpartitioncanbeidentiedon
differentpartsofthegraph.Thisdoesnothavetobeaone-to-onemapping,however,
becauseina2Ddrawing,thelayoutofthenodesafterspringembeddingmayvary
6 Discussion
6.1 Contributionofthepaper&relatedwork
Ourworkcanbeseenas relatedtoseveralotherdirectionsofresearch.Similar
tech-niquestotheonesusedinthispaperhavebeensuccesfullyappliedtoanalyzelarge-scale
collaborativetaggingsystems[11]andpreferencenetworksforEbayitems[12].
Theamountofworkwhichisspecicallygearedtosponsoredsearchauctions,
es-peciallyempiricalstudies,hassofarbeenratherlimited(probablynotleastduetolack
ofextensivedatasetsinthiseld).Muchofthepreviouswork,e.g.[8]looksmostlyat
thebiasintroducedbyalink'sdisplayrankonclickingbehaviour(suchas discussed
inSect.3ofthispaper).Anotherimportantdirectionofworkusesexistingintuitions
aboutuserclickingbehaviourtodesigndifferentallocationmechanismsforthis
prob-lem-theworkof[5]isagoodexampleofthisapproach.Bycomparisontoourwork,
theapproachtakenby[5]studiesmostlyatmechanismdesignissuesarisingfrom
com-putationaladvertising,ratherthanperformanempiricalexaminationofsuchmarkets.
Onepaperthatisrelatedinscopetoours,sinceitalsoprovidesanempiricalstudyof
searchengineadvertisingmarketsis[9].Thisworktakes,however,adifferent
perspec-tiveonthisproblem,alsoduetothedifferenttypeofdatatheauthorshadavailable.By
contrasttoourwork,thedatathat[9]usecomesfromasingle,large-scaleadvertiser.
Thismeanstheydogetaccesstomoredetailedinformation(includingnancialone)
andcansaymoreaboutactualbiddingbehaviour.Bycomparison,thedataavailableto
usforthisstudydoesnotcontainanydetailednancialinformation,but,unlike[9]it
allowsustohaveagloballevelviewofthewholemarket(fromtheperspectiveofthe
searchengine,notjustasingleadvertiser).Thisprovidesveryimportantinsightsabout
thestructureofsponsoredsearchmarkets.
Finally,thereexistspreviousworkthathasappliedsimilarco-occurence-based
tech-niques toorganicsearchlogs ortaggingsystems [7,11]. However,ourfocusinthis
paper isdifferent: wedo notaimtotomerelydeducewhat isthesemantic distance
betweenkeywordsinthegeneralsense,butwhatkindofcombinationsofkeywordsare
nancially interestingforasponsoredsearchadvertisertobid on.Thisisthe reason
whythesizeofthenodesanddistancescomputedinFig.4arebuiltusingonlyqueries
whichleadtoanactualclickonasponsoredad.Basically,thisisequivalenttoltering
onlytheopinion(expressedthroughqueries)ofthesubsetofusersthatarelikelyto
buysomethingonline,ratherthanallsearchengineusers.Toourknowledge,thisisthe
rstpapertousesponsoredsearchclickdatainthisway.
6.2 Futurework
Thiswork,beingsomewhatpreliminary,leavesmanyaspectsopentofutureresearch.
Onsuchaspectwouldbeistheissueofexternalities:howthepresenceoflinksby
com-petingadvertisersinuencestheclickthroughratesofotherbidders.Asthecompetition
isbasically oncustomers'attentionspace,externalitiesplayanimportantrole inthe
efcacityofsponsoredsearchimpressions.
atthelevelofindividualsetsofkeywords.Infact,sponsoredsearchcanbeseennotonly
asonemarket,asanetworkofmarkets,sincemostadvertisersareinterestedin(andbid
on)aspecicsetofkeywordsrelatedtowhattheyareselling.Forexample,wecould
applycommunitydetectionalgorithmstopartitionnotonlysets ofsearch keywords,
butalsosetsofbidders(advertisers)interestedinthosekeywords.Thisshouldallowus
toderivemorein-depthinsightsintothestructureofsponsoredsearch.
Finally,exploringhowspecializedmechanismssuchaspricedoptions[13,14,18,
17]couldimprovetheallocationofattentionspaceinsponsoredsearchmarkets
repre-sentsanotherpromisinglineforfurtherwork.
Acknowledgements
TheauthorsthankMicrosoftResearchfortheirsupport,intheframeworkofa`Beyond
Searchaward.WealsowishtothankNicoleImmorlicaandRenatoGomes
(Nortwest-ernUniversity)formanyusefuldiscussionsinthepreliminarystagesofthiswork.
References
1. A.Baldassarri,A.Barrat,A.Cappocci,H.Halpin,U.Lehner,J.Ramasco,V.Robu,and
D.Taraborelli. Theberners-leehypothesis:Powerlawsandgroupstructureinickr,2008.
Proc.ofDagstuhlSeminaronSocialWebCommunities,DagstuhlDROPS08391.
2. Y.Bar-Yam.Thedynamicsofcomplexsystems.WestviewPress,2003.
3. V. BatageljandA. Mrvar. Pajek -Aprogram forlargenetworkanalysis. Connections,
21:4757,1998.
4. S. M.Bohte, E.Gerding, andJ.L.Poutr´e. Market-basedrecommendation:Agents that
competeforconsumerattention.ACMTrans.InternetTechnology,4(4):420448,2004.
5. C.Borgs,J.Chayes,N.Immorlica,K.Jain,O.Etesami,andM.Mahdian. Dynamicsofbid
optimizationinonlineadvertisementauctions. InWWW'07:Proc.16thInt.Conf.World
WideWeb,pages531540.ACMPress,2007.
6. T.Carter. Ashorttripthroughentropytopowerlaws,2007. ComplexSystemsSummer
School,SantaFeInstitute,NM.
7. R.L.CilibrasiandP.M.B.Vitanyi.Thegooglesimilaritydistance.IEEETrans.onKnowl.
andDataEng.,19(3):370383,2007.
8. N.Craswell,O.Zoeter,M.Taylor,andB.Ramsey. Anexperimentalcomparisonofclick
position-biasmodels.InProcof.WSDM'08,pages8794.ACMPress,2008.
9. A.GhoseandS.Yang.Analyzingsearchengineadvertising:rmbehaviorandcross-selling
inelectronicmarkets. InWWW'08:Proc.ofthe17thInt.Conf.onWorldWideWeb,pages
219226.ACMPress,2008.
10. H.Halpin,V.Robu,andH.Shepherd.Thedynamicsandsemanticsofcollaborativetagging.
InProc.ofthe1stSemanticAuthoringandAnnotationWorkshop(SAAW'06),2006.
11. H.Halpin,V.Robu,andH.Shepherd. Thecomplexdynamicsofcollaborativetagging. In
Proc.16thInt.WorldWideWebConference(WWW'07),pages211220.ACM,2007.
12. R.K.-X.Jin,D.C.Parkes,andP.J.Wolfe.AnalysisofbiddingnetworksineBay:Aggregate
preferenceidenticationthroughcommunitydetection. InProc.AAAIWorkshoponPlan,
ActivityandIntentRecognition,2007.
13. A.I.JudaandD.C.Parkes. Anoptions-basedmethodtosolvethecomposabilityproblem
14. A.I.JudaandD.C.Parkes.Thesequentialauctionproblemonebay:Anempiricalanalysis
andasolution. InProc.7thACMConf.onElectr.Commerce,pages180189.ACMPress,
June2006.
15. K.L.JuddandL.Tesfatsion.HandbookofcomputationaleconomicsII:Agent-Based
com-putationaleconomics. HandbooksinEconomicsSeries,North-Holland,the Netherlands,
2005.
16. G.Linden,B.Smith,andJ.York.Amazon.comrecommendations:item-to-itemcollaborative
ltering.InIEEEInternetComputing7(1),pages7680,2003.
17. L.Mous,V.Robu,andH.LaPoutr´e. Canpricedoptionssolvethe exposure problemin
sequentialauctions?InACMSIGEcomExchanges7(2).ACMPress,2008.
18. L.Mous,V.Robu,andH.LaPoutr´e.Usingpricedoptionstosolvetheexposureproblemin
sequentialauctions. InAgent-MediatedElectronicCommerce(AMEC'08).SpringerLNAI
(toappear),2008.
19. M. Newman. Power laws, pareto distributions and zipf's law. ContemporaryPhysics,
46:323351,2005.
20. M.E.J.Newman.Fastalgorithmfordetectingcommunitystructureinnetworks.Phys.Rev.,
E69,066133,2004.
21. V.RobuandH.LaPoutr´e.Designingbiddingstrategiesinsequentialauctionsforriskaverse
agents.InAgent-MediatedElectronicCommerce,,pages7689.SpringerLNBIP13,2007.
22. V.RobuandH.L.Poutr´e.Learningthestructureofutilitygraphsusedinmulti-issue
nego-tiationthroughcollaborativeltering. InProc.ofthePacicRimWorkshoponMulti-Agent
Systems(PRIMA'05),2005.
23. V.Robu,D.Somefun,andJ.A.L.Poutr´e.Modelingcomplexmulti-issuenegotiationsusing
utilitygraphs. InProc.ofthe4thInt.Conf.onAutonomousAgents&MultiAgentSystems