• No results found

The Complex Dynamics of Sponsored Search Markets

N/A
N/A
Protected

Academic year: 2021

Share "The Complex Dynamics of Sponsored Search Markets"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

ValentinRobu ???

,HanLaPoutr´e,andSanderBohte

CWI,CenterforMathematicsandComputerScience

Kruislaan413,NL-1098SJAmsterdam,TheNetherlands

frobu,hlp,sbohteg@cwi.nl

Abstract. Thispaperprovidesacomprehensivestudyofthestructureand

dy-namicsofonlineadvertisingmarkets,mostlybasedontechniquesfromthe

emer-gentdisciplineofcomplexsystems analysis.First,welookathowthe display

rankofaURLlinkinuencesitsclickfrequency,forbothsponsoredsearchand

organicsearch.Second,westudythemarketstructurethatemergesfromthese

queries,especiallythemarketsharedistributionofdifferentadvertisers.Weshow

thatthesponsoredsearchmarketishighlyconcentrated,withlessthan5%ofall

advertisersreceivingover2/3oftheclicksinthemarket.Furthermore,weshow

thatboththenumberofadimpressionsandthenumberofclicksfollowpower

lawdistributionsofapproximately thesamecoefcient.However,wendthis

resultdoesnotholdwhenstudyingthesamedistributionofclicksperrank

posi-tion,whichshowsconsiderablevariance,mostlikelyduetothewayadvertisers

dividetheirbudgetondifferentkeywords.Finally,weturnourattentiontohow

suchsponsoredsearchdatacouldbeusedtoprovidedecisionsupporttoolsfor

biddingforcombinationsofkeywords. Weprovideamethodtovisualize

key-wordsofinterestingraphicalform,aswellasamethodtopartitionthesegraphs

toobtaindesirablesubsetsofsearchterms.

1 Introduction

Sponsoredsearch, the payment by advertisersfor clicks on text-only ads displayed

alongsidesearch engineresults, hasbecomean importantpart ofthe Web.It

repre-sents themain sourceofrevenue forlargesearch engines,suchas Google;Yahoo!;

andMicrosoft'sLive.com,andsponsoredsearchisreceivingarapidlyincreasingshare

ofadvertisingbudgetsworldwide.Buttheproblemsthatarisefromsponsoredsearch

alsopresentexcitingresearchopportunities,foreldsasdiverseaseconomics,articial

intelligenceandmulti-agentsystems.

Intheeldofmulti-agentsystems,researchershavebeenworkingforsometime

ontopicssuchasdesigningautomatedauctionbiddingstrategiesinuncertainand

com-petitiveenvironments(e.g.[4,21]).Anotheremergenteldwhichstudiedsuchtopic

isagent-basedcomputationaleconomics(ACE)[15],wheresignicantresearcheffort

hasfocusedonthe dynamicsofelectronicmarketsthroughagent-basedsimulations.

OneparticulartopicofresearchfortheACEcommunityishoworderandmacro-level

marketstructurecanemergefromthemicro-levelactionsofindividualusers.However,

???

CurrentlyintheIntelligence,Agents,MultimediaGroup,UniversityofSouthampton,UK.

?

(2)

mostexistingworkhasbeenbasedonsimulations,as thereare fewsourcesof

large-scale,empiricaldatafromreal-worldautomatedmarkets.Inthiscontext,empiricaldata

madeavailablefromsponsoredsearchprovidesanexcellentopportunitytotestthe

as-sumptionsmadeinsuchmodelsinarealmarket.

Inthispaper,whichisbasedonlarge-scaleMicrosoftsponsoredsearchdata,we

pro-videadetailedempiricalanalysisofsuchdata.Todothis,wemakeuseofseveral

tech-niquesderivedfromcomputationaleconomics,andespeciallycomplexsystemstheory.

Complexsystemsanalysis(whichwebrieyreviewbelow)hasbeenshowntobean

excellenttoolforanalyzinglargesocial,technologicalandeconomicsystems,including

websystems[19,11,6].

1.1 Thedataset

Thestudyprovidedinthispaperisbasedonalargedatasetofsponsoredsearchqueries,

obtainedfromthewebsiteLive.com 1

.Thesearchdataprovidedconsistsoftwodistinct

datasets:asetofsponsoredsearchdataset(URLsreturnedareallocatedtoadvertisers,

throughanauctionmechanism)andanorganicsearchdataset(standard,unbiasedweb

search).The sponsoredsearch dataconsistsof101,171,081distinctimpressions(i.e.

singledisplaysofadvertiserlinks,correspondingtoonewebquery),whichintotal

re-ceived7,822,292clicks.Thissponsoreddatasetwascollectedforaroughly3-month

pe-riodintheautumnof2007.Theorganicsearchdatasetconsistsof12,251,068queries,

andwascollectedinadifferent3-monthintervalin2006(thereforethetwodatasets

arechronologicallydisjoint).

Itisimportanttostressthatintheresultsreportedinthispaperarebasedmostlyon

thesponsoredsearchdataset 2

.Furthermore,thesponsoredsearchdatawehadavailable

onlyprovidespartialinformation,inordertoprotecttheprivacyofMicrosoftLive.com

customersandbusinesspartners.Forexample,wehavenoinformationaboutnancial

issues, suchthe pricesofdifferentkeywords,howmuchdifferentadvertisersbidfor

thesekeywords,thebudgetstheyallocateetc.Furthermore,whilethedatabaseprovides

ananonymizedidentierforeachuser performingaquery,thisdoesnotallowusto

traceindividualusersforanylengthoftime.

Nevertheless,onecanextractagreatdealofusefulinformationfromthedata.For

example,theidentitiesofthebidders;forwhichkeywordcombinationstheiradswere

shown(i.e.theimpressions);forwhichofthesecombinationstheyreceivedaclick;the

positiontheirsponsoredlinkwasinwhenclickedetc...Insightsgainedfromanalyzing

thisinformationformsthemaintopicofthispaper.

2 Complexsystemsanalysisappliedtothewebandeconomics

Complexsystemsrepresentsanemergingresearchdiscipline,attheintersectionof

di-verseeldssuchasAI,economics,multi-agentsimulations,butalsophysicsand

biol-ogy[2].Thegeneraltopicofstudiesintheeldofcomplexsystemsishowmacro-level

1

ThisdatawaskindlyprovidedtousbyMicrosoftresearchthrough“BeyondSearch”award

2

(3)

structurecanemergefromindividual,micro-levelactionsperformedbyalargenumber

ofindividualagents(suchasinanelectronicmarket).Forwebphenomena,complex

systems techniqueshavebeen successfully usedbeforetostudyphenomena suchas

collaborativetagging[11,10]ortheformationofonlinesocialgroups[1].

Oneofthe phenomenathatare indicativetosuchcomplexdynamicsisthe

emer-genceofscale-freedistributions,suchaspowerlaws.Theemergenceofpowerlawsin

suchasystemusuallyindicatesthatsomesortofcomplexfeedbackphenomena(e.g.

suchasapreferentialattachementphenomena)isatwork.Thisisusuallyoneofthe

cri-teriausedfordescribingthesystemas“complex”[2,6].Researchindisciplinessuchas

econophysicsandcomputationaleconomicsdiscusseshowsuchpowerlawscanemerge

inlarge-scaleeconomicsystems(see[6,19]foradetaileddiscussion).

2.1 Powerlaws:denition

Apowerlawisarelationshipbetweentwoscalarquantitiesxandyoftheform:

y=cx

(1)

where andc are constantscharacterizing the given powerlaw. Eq.1can also be

writtenas:

logy=logx+logc (2)

Whenwritteninthisform,afundamentalpropertyofpowerlawsbecomes

appar-ent;whenplottedinlog-logspace,powerlawsappearas straightlines.Asshownby

Newman[19]andothers,themainparameterthatcharacterizesapowerlawisitsslope

parameter.(Onalog-logscale,theconstantparameterconlygivesthe“verticalshift”

ofthedistributionwithrespecttothey-axis.).Verticalshiftcanvarysignicantly

be-tweendifferentphenomenameasured(inthiscase,clickdistributions),whichotherwise

followthesamedynamics.Furthermore,sincethelogarithmisappliedtobothsidesof

theequation,thesizeoftheparameterdoesnotdependonthebasischosenforthe

logarithm(althoughtheshiftingconstantcisaffected).Inthelog-logplotsshowninthis

paper,wehavechosenthebasisofthelogarithmtobe2,sincewefoundgraphswith

thislowbasisthe moregraphicallyintuitive.But, inprinciple,thesameconclusions

shouldholdifwechoosethelogarithmbasistobe,e.g.eor10.

3 Inuenceofdisplayrankonclickingbehavior

Therstissuethatwestudied(forbothsponsoredandorganicsearchdata)ishowthe

positionthataURLlinkisdisplayedininuencesitschancesofreceivingaclick.Note

thatthisparticularissuehasreceivedmuchattentioninexistingliterature[8].Tobriey

explain,Microsoft'sLive.comsearchinterface(fromwhichthedatawascollected),is

structuredasfollows:

– Forsponsoredsearchthereareupto8availableslots(positions)inwhichsponsored

URLlinkscanbeplaced.Threeofthesepositions(rankedas1-3)appearatthetop

ofthepage,abovetheorganicsearchresults,butdelimitedfromthosebyadifferent

(4)

0

0.5

1

1.5

2

2.5

3

16

17

18

19

20

21

22

23

Display rank (position) of sponsored link (or ad)

Number of clicks the link receives

Relative ad position vs. no of clicks (log2−log2 scales)

0

1

2

3

4

5

6

4

6

8

10

12

14

16

18

20

22

24

Relative position of clicked link (normal search link, not sponsored)

Number clicks the link receives

Relative position of clicked links vs. no of clicks (log2−log2 scales)

Fig.1.DistributionofclicksreceivedbyaURL,relativetoitspositiononthedisplay,for

spon-soredandorganicsearch.A(left-side,sponsoredsearch):Thereareupto8sponsoredadvertiser

linksdisplayed:3onthetopofthepage,and5inasidebar.B(right,organicsearch):Thereare

usually10positionsdisplayedperpage,withmultipleresultpagesappearingasplateaus.

– The“organic”searchresultsareusuallyreturnedas10URLlinks/page(ausercan

opttochangethissetting,butveryfewactuallydo).

Allthesponsoredlinksareallocatedbasedonanauction-likemechanismbetween

the setofinterestedadvertisers(suchadisplay,inanyposition iscalledin

“impres-sion”).However,theadvertisersonlypayiftheirlinkactuallygetsclicked-i.e.“pay

perclick”model.Theexactalgorithmusedbytheenginetodeterminethewinnersand

whichadvertisergetswhichpositionisacomplexmechanismdesignproblemandnot

alldetailsaremadepublic.However,ingeneral,itdependsonsuchfactorsastheprice

thebidderiswillingtopayperclick,therelevanceofthequerytohersetofterms,and

herpastperformanceintermsof“clickthroughrate”(i.e.howoftenlinksofthatuser

wereclickedinthepast,foragivenkeyword).Bycontrast,inorganicsearch,returned

resultsarerankedsimplybasedonrelevancetotheuser'squery.

3.1 Resultsondisplaypositionbiasandinterpretation

ResultsforthepositionbiasonclickdistributionareplottedinFig.1:partA(leftside)

forsponsoredsearchandpart B(rightside)fortheorganicsearch.Notethatbothof

thesearecumulativedistributions:theywereobtainedbyaddingthenumberofclicks

foralinkineachposition,irrespectiveoftheexactcontextofthequeriesorlinksthat

generatedthem.Furthermore,botharedrawninthelog-logspace.

Therearetwomainconclusionstobedrawnfromthesepictures.Forthesponsored

searchresults(Fig.1.A),thedistributionacrossthe8slotsseemstoresembleastraight

line,withaslopeparameteraprox. =2.However,suchaconclusionwouldbetoo

simplistic:thereis,infact,adifferencebetweentheslopebetweentherst3positions

(5)

positionsisaround

1

=1:4,whileforthelast5isaround

2

=2:5.Themostlikely

reason forthisdropcomesfrom theway theLive.comsearch interface isdesigned.

Therst3slotsforsponsoredsearchlinksareshownonthetopofthepage,abovethe

organicsearchresults,whilethelast5areshowninasidebarontherightofthepage.

Fig.1.Bcorrespondstothe sameplot fororganicsearch results,the maineffect

onenotices isthe presenceofseverallevels(thresholds),correspondingtoclickson

differentsearchpages.Westressthat,sincethisisalog-logplot,thedropinattention

betweensubsequentsearchpagesisindeedverylarge-abouttwoordersofmagnitude

(i.e.thetop-rankedlinkonthesecondsearchpageis,onaverage,about65timesless

likelytobeclickedthanthelast-rankedlinkontherstpage).Thedistributionof

intra-pageclicks,however,atleastfortherstpageofresults,couldberoughlyapproximated

byapowerlawofcoefcient=1:25.

Allthisraisesofcoursethequestion:whatdothesedistributionsmeanandwhat

kindofuserbehaviourcouldaccountfortheemergenceofsuchdistributionsin

spon-soredsearchresults?First, weshould pointoutthatthe fact thatwe ndpowerlaw

distributionsinthiscontextisnotcompletelysurprising.Suchdistributionshavebeen

observedinmanywebandsocialphenomena(togivejustoneexample,in

collabora-tivetaggingsystems,intheworkbyoneoftheco-authorsofthispaper[11]andothers).

Infact,anymodelof“toptobottom”probabilisticattentionbehaviour,suchasauser

scanningthelistofresultsfromtoptobottomandleavingthesitewithacertain

prob-abilitybyclickingoneofthemcouldgiverisetosuchadistribution.Ofcourse,more

ne-grainedmodelsofuserbehavior areneededtoexplainingclick behaviorinthis

context(anexampleofsuchamodelis[8]).Butfornowweleavethisissuetofurther

research,andwelookatthemaintopicofthispaperwhichisexaminingthestructure

ofthesponsoredsearchmarketitself.

4 Marketstructureattheadvertiserlevel

InthisSection,welookathowsponsoredsearchmarketsarestructured,fromthe

per-spectiveoftheparticipants(i.e.advertisersthatbuysearchslotsfortheirURLs).More

specically,westudyhowrelativemarketsharesaredistributedacrosslink-based

ad-vertisers.Wenotethatinmanymarkets,anoftencitedrule,alsoinformallyattributed

toPareto,isthat20%ofparticipantsinamarket(e.g.customersinamarketplace)drive

80%oftheactivity.Here,wecallthiseffectthe“marketconcentration”.

Inasponsored search market,the main“commodity” whichproduces valuefor

marketparticipants(eitheradvertisersandthesearchengine)isthenumberofclicks.

Therefore,therstthingthatweplotted(rst,usingnormal,i.e.non-logarithmicaxes)

isthecumulativeshareofdifferentadvertisers(seeFig.2.A.-leftsidegraph).From

thisgraph,onecanalreadyseethatjustthe top500advertisersgetroughly66%(or

abouttwo-thirds)ofthetotal7.8millionclicksintheavailabledataset 3

.

3

Notethatanadvertiserwastaken,followingtheavailabledata,bythe domainURLof the

sponsoredlink.Thisisareasonableassumption,inthiscase.Forexample,Ebayusesmany

(6)

0

1000

2000

3000

4000

5000

0

10

20

30

40

50

60

70

80

Cumulative share of the click market of highest rated advertisers (%)

Advertiser rank by no. of clicks

Percentage of total clicks

0

2

4

6

8

10

12

14

0

5

10

15

20

25

Rank of the advertiser (log

2

scale)

No. of times the URL of the advertiser shown / clicked

Number of impressions and number of clicks received (log

2

−log

2

scale)

Impressions

Clicks

Fig.2.A(left-side):Cumulativepercentagedistributionofthenumberofclicksadvertisersin

themarketreceive,wrt.totheirrankposition,consideringthetop5000advertisersinthemarket

(normalscales).B(right):Log-logscaledistributionsofthenumberofimpressions,respectively

numberofclicks,receivedbythetop10000advertisersinthemarket.Notethatbothdistributions

followapproximately parallelpowerlaws,buttheclickdistributionslevelsoffina“longtail”

aftertherst4000advertisers,whiletheimpressiondistributionhasamuchlongertail(notall

appearinginthegure).

Sinceinourdata,thereareatleast10000distinctadvertisers(mostlikely,thereare

manymore,butweonlyconsideredthetop10000),thismeansthatapercentageofless

than5%ofalladvertisershaveatwo-thirdsmarketshare.Thissuggeststhatsponsored

searchmarketsareindeedveryconcentrated,perhapsevenmoresothan“traditional”

real-worldmarkets.

4.1 Distributionofimpressionsvs.distributionofclicksforthetopadvertisers

Next,westudiedthedetaileddistributionofthenumbersofimpressions(i.e.displayed

URLs)andclicksontheseimpressions,forthetop10000distinctadvertisers.Results

areshowninFig.2.B.(right-handsidegraph),usingalog-logplot.

ThemaineffectthatonecanseefromFig.2.B.isthatthedistributionofimpressions

andthedistributionforclicksreceivedbytheadvertisersformtwoapproximately

paral-lel,straightlinesinthelog-logspace(i.e.theyaretwopowerlawsofapproximatelythe

sameslopecoefcient).Thereisoneimportantdifference,though,whichisthesize

ofthe“long tail”ofthedistribution.Thedistributionofthe numberofclicks(lower

line), levelsoffafterabout 4000-5000positions.Basically,indataterms,thismeans

thatadvertisersbeyondthetop5000eachreceiveanegligiblenumberofclicks,atleast

inthedatasetweexamined.Thereason forthismaybe thattheiradsalmostalways

appearinthelowerdisplayranks,orsimplythattheybidonasetofrarelyused(or

highlyspecialised)searchkeywords.Bycontrast,thedistributionofimpressionsstill

(7)

0

2

4

6

8

10

12

6

8

10

12

14

16

18

20

Rank of the advertiser by total number of times her links were clicked

No. of times the links were clicked (all positions)

Ranking of advertisers by number of clicks received (log

2

−log

2

scale)

0

2

4

6

8

10

12

0

2

4

6

8

10

12

14

16

18

20

Rank of the advertiser by the number of times her links were clicked

Number of times links of the advertiser were clicked

Ranking of advertisers by number of clicks received, in total and for different positions

Total clicks

Position 1

Position 2

Position 3

Position 4

Position 5

Position 6

Position 7

Position 8

Fig.3.Distributionofadvertisermarketshare,basedon theirorderedrankvs.the numberof

clickstheirlinksreceive(log-logscales).Theleft-handsideplot(partA)givesthetotalnumber

ofclicksanadvertiserreceivedforallimpressionsofherlinks,regardlessofthepositionthey

werein.Theright-handside(partB)givesthenumberofclicksreceived,bothintotal,butalso

whenheradsweredisplayedonaspecicpositiononthepage(amongthe8rankedslotsofthe

sponsoredsearchinterface).

4.2 Distributionofmarketshareperdisplayrankposition

ThepreviousSectionexaminedthepowerlawdistributionsofthenumberofclickseach

advertisergetsinaggregate(i.e.overalldisplayrankshis/herlinksareshownin).Here,

welookhowanadvertiser'smarketsharedistributionisaffectedwhenbrokendownper

displayrank(anissuewealreadytouchedoninSect.3).

However,werstmakeaslightrestrictioninthenumberofadvertisersweconsider.

AsshowninSect4.1above,thereisapowerlawdistributionintheclicksreceivedby

thetop4000advertisers,advertisersrankedbeyondthispositioneachreceivea

negligi-blenumberofclicks.Therefore,inthisSection,werestrictourattentiontothetop4000

advertisers.Asthese4000advertisersreceiveover80%ofall7.8millionclicksinthe

dataset(seeFig.2.A),wedonotriskloosingmuchusefulinformation.

ResultsareshowninFig.3. First,inFig.3.A.we showagain,moreclearly,the

powerlawdistributionofthenumberofclicksforthetop4000advertisers.Notethat

thisisa“wide”distribution,inthesensethatitcovers4000positionsandseveralorders

ofmagnitude.On theright-handsidegraph(Fig.3.B),weshowthe samegraph,but

now,foreach advertiser,we also break down the number ofclicks received by the

positionhis/hersponsoredURLwasinwhenitwasclicked.

Surprisingly,perhaps,thesmoothpowerlawshapeisnotfollowedatthelevelof

thedisplayrank-infact,forthelowerlevelsthevariancebecomessogreatthatthe

dis-tributionbreaksdown,atthedisplayranklevel.Wehypothesizethemostlikelyreason

forthis varianceis the wayeachindividualadvertiser doesthe biddingforthe

pre-ferredkeywordsatdifferentpointsintime,orthewayhespeciesthewayhiskeyword

(8)

hencegettingthetopspot.Bycontrast,othersmayprefertohavelonger-runningads,

eveniftheydon'tgetthe topspoteverytime.Someanecdotalevidencefromonline

marketingsuggeststhatevenjusttherepeateddisplayofalinkofacertainmerchant

onthescreenmaycount:ifauserseesanadrepeatedlyinhis/herattentionspace,that

mayestablishthebrandasmoretrustworthy.

InFig.3.B,bylokingthethetop4advertisersinthisdataset,onecanalreadysee

thatusersranked2and3utilizearatherdifferentstrategythan“thetrend”represented

byusers1and4.Whiletheirtotalnumerofclicksdoesfollow,approximatelythepower

law,theyseemtoget,proportionallyspeaking,moreclicksonthetop-rankedslotonthe

pagethantherest.While,inordertopreservetheprivacyofthedata,wecannot

men-tionwhothesecompaniesare,itdoesseemthatusers2and3areactually“aggregators”

ofadvertisingdemand.Bythis,wemeanonlineadvertisingagenciesorengines(or

au-tomatedservicesofferedbytheplatformitself)thataggregatedemandfromdifferent

advertisersand dothe biddingon theirbehalf.Apparently,thisallowsthem to

cap-ture,proportionally,moreoftenthetopslotfortherequiredkeyword.Unfortunately,

however,wecannotinvestigatethisaspectfurther,sincethedatasetprovideddoesnot

containanyinformationaboutbidding,budgetsornancialinformationingeneral.

InthefollowingandlastSectionofthispaper,weturnourattentiontoasomewhat

differentproblem:howcouldweuseinsightsgainedfromanalyzingthisquerydatato

provideabiddingdecisionsupportforadvertiserstakingpartinthismarket.

5 Usingclickdatatoderivesearchtermrecommendations

ThepreviousSectionsofthispaperusedcomplexsystemsanalysistoprovidea

high-levelexaminationofthe dynamicsofsponsoredsearch markets.InthisSection,we

lookathowsuchquerylogdatacouldbeusedtooutputrecommendationstoindividual

advertisers.Suchanapproachshouldleadtoanswerstoquestionssuchas:Whatkindof

keywordcombinationslookmostpromisingtospendone'sbudgeton,suchastoattract

amaximumnumberofrelevantuserclicks?Whilethepreviousanalysisofpower-law

formationwasdoneatamacro-level,inthisSectionwetakeamorelocalperspective.

Thatis,wedonotconsiderthesetofallpossiblesearchterms,butratherasetthatis

specictoadomain.Thisisareasonablemodel:inpractice,mostadvertisers(which

aretypicallyonlinemerchants),areonlyconcernedwitharestrictedsetofkeywords

whicharerelatedtowhattheyareactuallytryingtosell.

Fortheanalysis inthispaper,wehavechosenasadomain 50keywordsrelated

tothetourismindustry(i.e.onlinebookingsoftickets,travelpackagesandsuch).The

reasonforthisisthatmuchofthisactivity isalreadyfastmovingonline(e.g.avery

substantialproportionof,forexample,ightticketsandhotelreservationsarenow

car-riedoutonline).Furthermore-andperhapsmoreimportant-therearelowbarriersof

entryandtheeldisnotdominatedbyonemajorplayer.Thiscontrasts,forexample,

otherdomains,suchas thesaleofIpods andaccessories,whereAppleStorescanbe

(9)

5.1 Derivingdistancesfromco-occurrenceinsponsoredclicklogs

Givenalarge-scalequerylog,oneofthemostusefulpiecesofinformationitprovidesis

theco-occurenceofwordsindifferentqueries.Muchpreviousworkhasobservedthat

thefactthattwosearchkeywordsfrequentlyappeartogetherinthesamequerygives

risetosomeimplicitsemanticdistancebetweenthem[11].

Inthispaper,wetakeaslightlydifferentperspectiveonthisissue,since,in

com-putingthedistances,weonlyusethosequerieswhichreceivedatleastonesponsored

searchclickforthetextads(i.e.URLs)displayedalongsidetheresults.Wearguethis

is asubtlebutvery importantdifferencefromsimplyusingco-occurenceinorganic

search logs.Thefact thatqueries containingsomecombinationofquerywordslead

toaclickonasponsoredURLimpliesnot onlyapurelysemanticdistancebetween

thosekeywords,moreimportantforanadvertiser,thefactthatuserssearchingonthose

combinationsofkeywordshavethepossibleintentionofbuyingthingsonline.

Formally,let N(T

i ;T

j

)denotethe numberoftimes twosearch terms T

i andT

j

appearjointlyinthesamequery,ifthatqueryreceivedatleastonesponsoredsearch

click.LetN(T

i

)andN(T

j

)denotethesamenumberofqueriesleadingtoaclick,in

whichtermsT

i

,respectivelyT

j

appearintotal(regardlessofothertermstheyco-occur

with).ThecosinesimilaritydistancebetweentermsT

i andT j isdenedas: Sim(T i ;T j )= N(T i ;T j ) p N(T i )N(T j ) (3)

Notethat cosinesimilarityisnottheonlywaytodenesuchadistance,butitis

averypromisingoneinmanyonlineapplicationsettings, suchas showninprevious

workbytheseauthorsandothers[11,16,23,22].

5.2 Constructingkeywordcorrelationgraphs

Themostintuitivewaytorepresentsimilaritydistancesisthroughakeywordcorrelation

graph.Theresults fromoursubsetof50travel-relatedtermsare shown inFig.4.In

thisgraph,thesize ofeachnode(representingonequeryterm)isproportionaltothe

absolutefrequencyofthekeywordinallqueriesinthelog.Thedistancesbetweenthe

nodesareproportionaltothesimilaritydistancebetweeneachpairofterms,computed

Eq.3,wherethe wholegraphisdrawnaccording toaso called“spring

embedder”-typealgorithm.Inthistypeofalgorithm,edgescanbeconceivedas“springs”,whose

strengthisindirectlyproportionaltotheirsimilaritydistance,leadingtoclusterofedges

similartoeachothertobeshowninthesamepartofthegraph.

Thereareseveralcommercialandacademicpackagesavailabletodrawsuch

com-plexnetworks.Theonewethinkismostsuitable-andwhichwasusedforgraphFig.

4-isPajek(see[3]foradescription).Notethatnotall edgesare consideredinthe

nalgraph.Evenfor50nodes,thereare

50

2

=1225possiblepairwisesimilarities

(edges),oneforeachpotentialkeywordpair.Mostofthesedependenciesare,however,

spurious(theyrepresentjustnoiseinthedata),andouranalysisbenetsfrom using

onlythetopfraction,correspondingtothestrongestdependencies.Inthegraphshown

(10)

holiday{15.4%}

beach{5.4%}

luxury{7.1%}

hotel{8.2%}

island{3.7%}

party{9.9%}

package{4.1%}

resort{3.8%}

sun{4.4%}

mountain{5.2%}

ocean{6.7%}

nightlife{1.3%}

weather{4.8%}

vacation{8.7%}

relaxation{4.6%}

holidays{2.5%}

hiking{7.1%}

climbing{6.3%}

getaway{3.2%}

view{3.9%}

sea{6.7%}

sand{6%}

diving{5.3%}

cruise{8.6%}

show{2.6%}

swimming{10.6%}

trip{10.7%}

entertainment{4.6%}

tickets{4.6%}

ticket{11.6%}

exotic{5.6%}

tropical{6.8%}

destination{5.9%}

flight{8.8%}

cheap{29.1%}

last{7.9%}

minute{13.7%}

deal{2.9%}

visit{4%}

romantic{6%}

warm{6.5%}

tour{3.3%}

fun{6.5%}

offer{3%}

great{3.9%}

sunrise{5.1%}

sunset{4.6%}

Hawaii{5.1%}

Oahu{6.1%}

Fig.4.Visualizationofasearchtermcorrelationgraph,forasetofsearchtermsrelatedtothe

tourismindustry. Eachsearchtermisassignedonecoloreddot.Thesizeofthedots givesits

relativeweight(intotalnumberofclicks received),whilethedistances betweenthe dots are

obtainedthroughaspring-embeddertypealgorithmandareproportionaltotheco-occurrenceof

thetwosearchtermsinaquery.Eachdotismarkedwithitssuccessrate(percentageofthetotal

numberofimpressionsassociatedwiththatquerywordthatreceivedaclick).

5.3 Graphcorrelationgraphs:results

ThereareseveralconclusionsthatcanbedrawnfromthevisualizationinFig.4

con-structedbased on the Live.com sponsoredsearch query logs.First, noticethat each

nodewaslabellednotonlywiththetermorkeyworditcorrespondsto,butalsowith

theaggregateclick-throughrate(CTR),specicforthatkeyword.Basicallly,thisisthe

percentageofallthequeriesthatusedthetermwhichgeneratedatleastoneclicktoa

sponsoredsearchURLdisplayedwiththatquery.

Notethat these click-throughrates may,ata rst glance,seemon the low side:

ingeneralonlyafewpercent ofallqueries actuallylead toaclick onansponsored

(i.e.advertiser)link.Nevertheless,asasearchenginereceivesmillionsofqueriesina

rathershortperiodoftime,evena5%-10%click-throughratecanbequitesignicant.

Notethatsomekeywords(sucha“cheap”)haveahigherclick-throughratethanothers.

Thereasonforthismaybethatpeoplesearchingfor“cheap”things(e.g.cheapairline

(11)

However,themostinterestingeffecttoobserveinFig.4arethetermclustersthat

emergeindifferentpartsofthegraph,fromtheapplicationofthespring-embedder

vi-sualizationalgorithm.Forexample,theleftmostpartofthegraphhas4termsrelated

toweather,suchas“warm”,“tropical”and“exotic”.Onthetopleftpartofthegraph,

onecanndtermssuchas“entertainment”,“nightlife”,“party”and“fun”,whilevery

bottompartincludesrelatedtermsassuch“climbing”,“hiking”and“mountain”.The

top-rightpartincludescommercialtermssuchas:“ticket”,“tickets”,“ight”,“cheap”,

“last”,“minute”.Thecentralpartofthegraphincludestermssucha“beach”,“sand”,

“sea”,“resort”,“ocean”,“island”etc.Additionally,pairsoftermsonewouldnaturally

associatedoindeedappearclosetogether,suchas“romantic”and“getaway”and

“sun-set”and“sunrise”and“ocean”.

Inthefollowing,wediscussanalgorithmthatcandetectsuchclustersautomatically.

Moreprecisely,wewouldlikeanalgorithmthatselectscombinationsoftagsthatlook

promisinginattractingqueriesandclicks.

5.4 Automaticidenticationofsetsofkeywords

InthisSection,weshowhowkeywordgraphscouldbeautomaticallypartitionedinto

relevantkeywordclusters.Thetechniqueweuseforthispurposeisthesocalled

“com-munitydetection”algorithm[20],alsoinspiredbycomplexsystemstheory.Innetwork

orgraph-theoreticterms,acommunityisdenedasasubsetofnodesthatareconnected

morestronglytoeachotherthantotherestofthenetwork(i.e.adisjointcluster).Ifthe

networkanalyzedisasocialnetwork(i.e.vertexesarepeople),then“community”has

anintuitiveinterpretation.However,thenetwork-theoreticnotionofcommunity

detec-tionalgorithmisbroader,hasbeensuccesfullyappliedtodomainssuchasnetworksof

itemsonEbay[12],publicationsonarXiv,foodwebs[20]etc.

Communitydetection: aformal discussion Let the network consideredbe

repre-sented a graph G = (V;E), when jVj = n and jEj = m. The community

de-tectionproblemcanbeformalizedas apartitioningproblem,subjecttoaconstraint.

Each v 2 V must be a assigned to exactly one group (i.e. communityor cluster)

C 1 ;C 2 ;:::C nC

,whereallclustersaredisjoint.

Inorder tocomparewhichpartitionis“optimal”, themetric usedis modularity,

henceforthdenotedbyQ.Intuitively,anyedgethatinagivenpartition,hasbothendsin

thesameclustercontributestoincreasingmodularity,whileanyedgethat“cutsacross”

clustershasanegativeeffectonmodularity.Formally,lete

ij

;i;j=1::n

C

bethe

frac-tionofalledgeweightsinthegraphthatconnectclustersiandjandleta

i = 1 2 P j e ij

bethefractionoftheendsofedgesinthegraphthatfallwithinclusteri.Themodularity

QofagraphjGjwithrespecttoapartitionCisdenedas:

Q(G;C)= X i (e i;i a 2 i ) (4)

Informally,Q isdened as thefraction ofedges inthe network thatfall within

(12)

Algorithm1GreedyQ Partitioning:GivenagraphG = (V;E);jVj = n;jEj = m returnspartition<C 1 ;:::C nC > 1. Ci=fvig,8i=1;n 2. n C =n

3. 8i;j,eijinitializedasinEq.5

4. repeat 5. <C i ;C j >=argmax c i ;c j (e ij +e ji 2a i a j ) 6. Q=maxc i ;c j

(eij+eji 2aiaj)

7. C i =C i S C j ,C j =;//mergeC i andC j 8. nC =nC 1 9. untilQ0 10.maxQ=Q(C1;::Cn C )

Asshownin[20],ifQ=0,thenthechosenpartitioncshowsthesamemodularity

asarandomdivision.AvalueofQcloserto1isanindicatorofstrongercommunity

structure-inrealnetworks,however,thehighestreportedvalueisQ=0:75.In

prac-tice, [20]found(basedonawiderangeofempiricalstudies)thatvaluesofQabove

around0.3indicateastrongcommunitystructureforthegivennetwork.Inourcase,

theedgesthatweconsideredinthegraph(recallthatonlythestrongest150edgesare

considered)haveaweight,denedas showninEq.3above. Forthepurposeofthe

clusteringalgorithm,thisweighthastobenormalizedbythesumofallweightsinthe

system,thusweassigninitialvaluestoe

ij as: e ij = 1 P ij sim ij sim ij (5)

5.5 Thegraphpartitioningalgorithm

Thealgorithmweusetodeterminetheoptimalpartitionisthe“community

identica-tion”algorithmdescribedin[20],formallyspecied asAlg.1above.Informally

de-scribed,thealgorithmrunsasfollows.Initially,eachofthevertexes(inourcase,each

keyword)isassignedtoitsownindividualcluster.Then,ateachiterationofthe

algo-rithm, twoclustersare selectedwhich,if merged,lead tothe highestincrease inthe

modularityQofthe partition.Ascanbe seenfromlines5-6 ofAlg. 1,because

ex-actlytwoclustersaremergedateachstep,itiseasytocomputethisincreaseinQas:

Q=(e ij +e ji 2a i a j )orQ=2(e ij a i a j )(thevalueofe ij beingsymmetric).

ThealgorithmstopswhennofurtherincreaseinQispossiblebyfurthermerging.

Notethat itis possibletospecifyanother stoppingcriteria inAlg.1, line9,e.g.

itis possibletoaskthe algorithmtoreturnaminimumnumberofclusters(subsets),

byletting thealgorithmrun untiln

C

reachesthisminimumvalue.Furthermore,this

(13)

Cluster1 Cluster2 Cluster3 Cluster4Cluster5 Cluster6 Cluster7Cluster8Cluster9

beach party package weather getaway diving cruise show last

luxury entertainment vacation exotic romanticswimming sunrise tickets minute

hotel nightlife holidays tropical sunset ticket visit

island fun destination warm cheap

resort Hawaii deal ight

sun Oahu tour

mountain offer ocean great hiking climbing sea sand

Keywordseliminatedtoincreasemodularity:holiday,holidays,relaxation,trip.

Fig.5.Optimalpartitionofthesetoftraveltermsinsemanticclusters,whenthetop150edgesare

considered.ThepartitionwasobtainedbyapplyingNewman'sautomated“communitydetection”

algorithmtothegraphfromFig4.ThispartitionhasaclusteringcoefcientQ=0.59.

5.6 Discussionofgraphpartitioningresults

Theresultsfromthegraphpartitioningalgorithm,showingthepartitionmaximisesthe

modularityQforthissetting,isshowninFig.5.Notethatthisisnottheonlypossible

waytopartitionthisgraph-ifonewouldconsideradifferentnumberofstrongest

de-pendenciestobeginwith(inthiscaseweselectedthetop150edges,for50keywords),

oradifferentstoppingcriteria,onemaygetasomewhatdifferentresult.Furthermore,

note that somekeywords,which were very general andcould t in severalclusters

(shownbelowthegure),wereprunedinordertoimprovemodularity,througha

sepa-ratealgorithmnotshownhere.

Still,thepartitionresultsshowninFig.5matchwellwhatourintuitionwould

de-scribeasinterestingcombinationsofsearchterms,forsuchasetting.Thereisonelarge

centralcluster,oftermsthatallhavereasonablystrongrelationstoeachother,andaset

ofsmall,marginalclustersontheside.Thelargeclusterinthemiddlecouldbefurther

brokenbythepartitionalgorithm, butonlyif we forcesomeotherstop criteriathan

maximummodularity(suchasacertainnumberofdistrinctclusters).

ThepartitioninFig.5tswellwithwhatcanbegraphicallyobservedinFig.4:

actually,mostoftheclustersobtainedautomaticallyafterpartitioncanbeidentiedon

differentpartsofthegraph.Thisdoesnothavetobeaone-to-onemapping,however,

becauseina2Ddrawing,thelayoutofthenodesafter“springembedding”mayvary

(14)

6 Discussion

6.1 Contributionofthepaper&relatedwork

Ourworkcanbeseenas relatedtoseveralotherdirectionsofresearch.Similar

tech-niquestotheonesusedinthispaperhavebeensuccesfullyappliedtoanalyzelarge-scale

collaborativetaggingsystems[11]andpreferencenetworksforEbayitems[12].

Theamountofworkwhichisspecicallygearedtosponsoredsearchauctions,

es-peciallyempiricalstudies,hassofarbeenratherlimited(probablynotleastduetolack

ofextensivedatasetsinthiseld).Muchofthepreviouswork,e.g.[8]looksmostlyat

thebiasintroducedbyalink'sdisplayrankonclickingbehaviour(suchas discussed

inSect.3ofthispaper).Anotherimportantdirectionofworkusesexistingintuitions

aboutuserclickingbehaviourtodesigndifferentallocationmechanismsforthis

prob-lem-theworkof[5]isagoodexampleofthisapproach.Bycomparisontoourwork,

theapproachtakenby[5]studiesmostlyatmechanismdesignissuesarisingfrom

com-putationaladvertising,ratherthanperformanempiricalexaminationofsuchmarkets.

Onepaperthatisrelatedinscopetoours,sinceitalsoprovidesanempiricalstudyof

searchengineadvertisingmarketsis[9].Thisworktakes,however,adifferent

perspec-tiveonthisproblem,alsoduetothedifferenttypeofdatatheauthorshadavailable.By

contrasttoourwork,thedatathat[9]usecomesfromasingle,large-scaleadvertiser.

Thismeanstheydogetaccesstomoredetailedinformation(includingnancialone)

andcansaymoreaboutactualbiddingbehaviour.Bycomparison,thedataavailableto

usforthisstudydoesnotcontainanydetailednancialinformation,but,unlike[9]it

allowsustohaveagloballevelviewofthewholemarket(fromtheperspectiveofthe

searchengine,notjustasingleadvertiser).Thisprovidesveryimportantinsightsabout

thestructureofsponsoredsearchmarkets.

Finally,thereexistspreviousworkthathasappliedsimilarco-occurence-based

tech-niques toorganicsearchlogs ortaggingsystems [7,11]. However,ourfocusinthis

paper isdifferent: wedo notaimtotomerelydeducewhat isthesemantic distance

betweenkeywordsinthegeneralsense,butwhatkindofcombinationsofkeywordsare

nancially interestingforasponsoredsearchadvertisertobid on.Thisisthe reason

whythesizeofthenodesanddistancescomputedinFig.4arebuiltusingonlyqueries

whichleadtoanactualclickonasponsoredad.Basically,thisisequivalenttoltering

onlythe“opinion”(expressedthroughqueries)ofthesubsetofusersthatarelikelyto

buysomethingonline,ratherthanallsearchengineusers.Toourknowledge,thisisthe

rstpapertousesponsoredsearchclickdatainthisway.

6.2 Futurework

Thiswork,beingsomewhatpreliminary,leavesmanyaspectsopentofutureresearch.

Onsuchaspectwouldbeistheissueofexternalities:howthepresenceoflinksby

com-petingadvertisersinuencestheclickthroughratesofotherbidders.Asthecompetition

isbasically oncustomers'attentionspace,externalitiesplayanimportantrole inthe

efcacityofsponsoredsearchimpressions.

(15)

atthelevelofindividualsetsofkeywords.Infact,sponsoredsearchcanbeseennotonly

asonemarket,asanetworkofmarkets,sincemostadvertisersareinterestedin(andbid

on)aspecicsetofkeywordsrelatedtowhattheyareselling.Forexample,wecould

applycommunitydetectionalgorithmstopartitionnotonlysets ofsearch keywords,

butalsosetsofbidders(advertisers)interestedinthosekeywords.Thisshouldallowus

toderivemorein-depthinsightsintothestructureofsponsoredsearch.

Finally,exploringhowspecializedmechanismssuchaspricedoptions[13,14,18,

17]couldimprovetheallocationofattentionspaceinsponsoredsearchmarkets

repre-sentsanotherpromisinglineforfurtherwork.

Acknowledgements

TheauthorsthankMicrosoftResearchfortheirsupport,intheframeworkofa`Beyond

Search”award.WealsowishtothankNicoleImmorlicaandRenatoGomes

(Nortwest-ernUniversity)formanyusefuldiscussionsinthepreliminarystagesofthiswork.

References

1. A.Baldassarri,A.Barrat,A.Cappocci,H.Halpin,U.Lehner,J.Ramasco,V.Robu,and

D.Taraborelli. Theberners-leehypothesis:Powerlawsandgroupstructureinickr,2008.

Proc.ofDagstuhlSeminaronSocialWebCommunities,DagstuhlDROPS08391.

2. Y.Bar-Yam.Thedynamicsofcomplexsystems.WestviewPress,2003.

3. V. BatageljandA. Mrvar. Pajek -Aprogram forlargenetworkanalysis. Connections,

21:47–57,1998.

4. S. M.Bohte, E.Gerding, andJ.L.Poutr´e. Market-basedrecommendation:Agents that

competeforconsumerattention.ACMTrans.InternetTechnology,4(4):420–448,2004.

5. C.Borgs,J.Chayes,N.Immorlica,K.Jain,O.Etesami,andM.Mahdian. Dynamicsofbid

optimizationinonlineadvertisementauctions. InWWW'07:Proc.16thInt.Conf.World

WideWeb,pages531–540.ACMPress,2007.

6. T.Carter. Ashorttripthroughentropytopowerlaws,2007. ComplexSystemsSummer

School,SantaFeInstitute,NM.

7. R.L.CilibrasiandP.M.B.Vitanyi.Thegooglesimilaritydistance.IEEETrans.onKnowl.

andDataEng.,19(3):370–383,2007.

8. N.Craswell,O.Zoeter,M.Taylor,andB.Ramsey. Anexperimentalcomparisonofclick

position-biasmodels.InProcof.WSDM'08,pages87–94.ACMPress,2008.

9. A.GhoseandS.Yang.Analyzingsearchengineadvertising:rmbehaviorandcross-selling

inelectronicmarkets. InWWW'08:Proc.ofthe17thInt.Conf.onWorldWideWeb,pages

219–226.ACMPress,2008.

10. H.Halpin,V.Robu,andH.Shepherd.Thedynamicsandsemanticsofcollaborativetagging.

InProc.ofthe1stSemanticAuthoringandAnnotationWorkshop(SAAW'06),2006.

11. H.Halpin,V.Robu,andH.Shepherd. Thecomplexdynamicsofcollaborativetagging. In

Proc.16thInt.WorldWideWebConference(WWW'07),pages211–220.ACM,2007.

12. R.K.-X.Jin,D.C.Parkes,andP.J.Wolfe.AnalysisofbiddingnetworksineBay:Aggregate

preferenceidenticationthroughcommunitydetection. InProc.AAAIWorkshoponPlan,

ActivityandIntentRecognition,2007.

13. A.I.JudaandD.C.Parkes. Anoptions-basedmethodtosolvethecomposabilityproblem

(16)

14. A.I.JudaandD.C.Parkes.Thesequentialauctionproblemonebay:Anempiricalanalysis

andasolution. InProc.7thACMConf.onElectr.Commerce,pages180–189.ACMPress,

June2006.

15. K.L.JuddandL.Tesfatsion.HandbookofcomputationaleconomicsII:Agent-Based

com-putationaleconomics. HandbooksinEconomicsSeries,North-Holland,the Netherlands,

2005.

16. G.Linden,B.Smith,andJ.York.Amazon.comrecommendations:item-to-itemcollaborative

ltering.InIEEEInternetComputing7(1),pages76–80,2003.

17. L.Mous,V.Robu,andH.LaPoutr´e. Canpricedoptionssolvethe exposure problemin

sequentialauctions?InACMSIGEcomExchanges7(2).ACMPress,2008.

18. L.Mous,V.Robu,andH.LaPoutr´e.Usingpricedoptionstosolvetheexposureproblemin

sequentialauctions. InAgent-MediatedElectronicCommerce(AMEC'08).SpringerLNAI

(toappear),2008.

19. M. Newman. Power laws, pareto distributions and zipf's law. ContemporaryPhysics,

46:323–351,2005.

20. M.E.J.Newman.Fastalgorithmfordetectingcommunitystructureinnetworks.Phys.Rev.,

E69,066133,2004.

21. V.RobuandH.LaPoutr´e.Designingbiddingstrategiesinsequentialauctionsforriskaverse

agents.InAgent-MediatedElectronicCommerce,,pages76–89.SpringerLNBIP13,2007.

22. V.RobuandH.L.Poutr´e.Learningthestructureofutilitygraphsusedinmulti-issue

nego-tiationthroughcollaborativeltering. InProc.ofthePacicRimWorkshoponMulti-Agent

Systems(PRIMA'05),2005.

23. V.Robu,D.Somefun,andJ.A.L.Poutr´e.Modelingcomplexmulti-issuenegotiationsusing

utilitygraphs. InProc.ofthe4thInt.Conf.onAutonomousAgents&MultiAgentSystems

References

Related documents

And we didn't, did we, Hiccup?&#34; When my father got to the bit about how Humungous the Hero had appeared out of nowhere after all those years when everybody thought he was dead,

More specifically, can the federal govern- ment set standards of performance, encourage experi- mentation in the delivery o f health care, coordinate existing

impact of past traumatic experiences of MKs on their adjustment to college their freshman year.. First, it was hypothesized that missionary kids experience more challenges

Eftekhari, On Some Iterative Methods with Memory and High Efficiency Index for Solv- ing Nonlinear Equations, International Journal of Differential Equations, vol. Eftekhari,

Хат уу буудай нъ ТгШсит ёигит зүйлд хамаарагддаг нэг настай ихэечлэн зусах хэлбэртэй үет ургамал бөгөөд уураг ихтэй, шилэрхэг үртэй, натур жин их байдгаараа

A third set of processes constituting the ePortfolio is that of the current ‘obligation to documentation’ (Rose, 1999, p. Many areas of life, from work and education to social

In this PhD thesis new organic NIR materials (both π-conjugated polymers and small molecules) based on α,β-unsubstituted meso-positioning thienyl BODIPY have been

To test whether CVC funds have different investment amounts, duration periods, and exit strategies compared with IVC funds, we construct three dependent variables at the start-up