Characterizing Locality, Evolution, and Life Span of Accesses in Enterprise Media Server Workloads
Ludmila Cherkasova
Hewlett-Packard Laboratories 1501 Page Mill Road, Palo Alto, CA 94303, USA
[email protected]
Minaxi Gupta
College of Computing, Georgia Institute of Technology
Atlanta, GA 30332, USA
[email protected]
Abstract The main issue we address in this paper is the
workloadanalysisoftoday'senterprisemediaservers. This
analysisaimstoestablishasetofproperties specicforen-
terprise media serverworkloads andto compare them with
wellknownrelatedobservationsaboutwebserverworkloads.
We propose twonew metrics tocharacterize the dynamics
and evolution of the accesses, and the rate of change in
thesiteaccesspattern,andillustratethemwiththeanalysis
oftwodierententerprisemediaserverworkloadscollected
overasignicantperiodoftime. Anothergoalofourwork-
loadanalysis studyis to developa media serverlog analy-
sistool,calledMediaMetrics,thatproducesamediaserver
traÆcaccessproleanditssystemresource usageinaway
usefultoserviceproviders.
Keywords:workloadanalysis,enterprisemediaservers,static
locality,temporallocality,sharingpatterns,dynamics,CDNs.
1. INTRODUCTION
Streamingmedia representsa newwave of richInternet
content. Video fromnews, sports, andentertainmentsites
is more popular than ever. Media servers are being used
foreducationalandtrainingpurposesbymanyuniversities.
Use of the media servers inthe enterprise environment is
catching momentum too. Enterprises are using rich me-
dia toattract prospective customers,improve eectiveness
ofonline advertising, web marketing,customerinteraction
centers,collaboration,andtraining.
Real-timenatureofmultimediacontentmakesitsensitive
tocongestion conditions inthe Internet. Moreover, multi-
mediastreamsconsumesignicantbandwidthsandrequire
orders of magnitude larger amount of storage at the me-
diaserversandproxycaches. Understandingthenatureof
mediaserverworkloadsiscrucialtoproperlydesigningand
provisioningcurrentandfutureservices.
Recently, therehave been several studiesattempting to
uncoverthemultimediaworkloadscharacteristics.However,
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
NOSSDAV’02, May 12-14, 2002, Miami, Florida, USA.
Copyright 2002 ACM 1-58113-512-2/02/0005 ...
$5.00.
mostofthestudiesaredevotedtotheanalysisofworkloads
for educational mediaservers[1, 2, 3,10, 11, 14]. Onere-
centstudy[9] characterizes the workloadofa mediaproxy
ofalargeuniversity. Thispaperpresentsandanalyzes the
enterprise mediaserverworkloadsbasedontheaccess
logs from two dierent media servers in Hewlett-Packard
Corporation. Both logs are collected over long period of
time(2.5yearsand1year 9months). Thedurationof the
logsmakesthemquiteuniqueandallowsustodiscovertyp-
icaland specicclient accesspatterns, mediaserveraccess
trends,anddynamicsandevolutionofthemediaworkload
overtime.
Webworkloadstudieshaveidentieddierenttypesoflo-
calityinwebtraÆc. Staticlocalityorconcentrationofrefer-
ences[5]observesthat10%ofthelesaccessedontheserver
typicallyaccountfor 90%oftheserverrequestsand90%of
thebytestransferred. Temporallocalityofreferences[4]im-
pliesthatrecentlyaccesseddocumentsaremorelikelytobe
referenced inthe nearfuture. Locality properties strongly
in uencethetraÆcaccesspatternsseenbythewebservers.
Onegoalofouranalysisistocharacterizelocality proper-
ties inmedia serverworkloads and to compare themwith
traditionalwebworkloadscharacterization. Understanding
the nature of locality will help indesigning moreeÆcient
caching,loadbalancing,andcontentdistributionsystems.
Accesspatternsanddynamicsofthesitehavetobetaken
intoaccountwhenmakingadecisionaboutdierentcaching
or content distributionsystems. Forexample, ifthe siteis
verydynamic,i.e. alargeportionoftheclientrequestsare
accessingnewcontent,(newswebsitesbeingaprimeexam-
ple),thenCDNsareclearlyagoodchoicetohandletheload,
becausetraditionalcachingsolutionswillbelesseÆcientin
distributing the load due to time involved in propagating
the content through the network caches. Thus, the other
questionweaddressinthispaperishowtocharacterizethe
dynamicsandevolution ofaccessesatmediasites. The
rstnaturalstepistoobservetheintroductionofnewlesin
thelogs,andtoanalyzetheportionofallrequestsdestinate
for thoseles. Wedenenew lesimpactmetricthataims
to characterize thesite evolution dueto newcontent. We
proposeasecondmetric,calledlifespanmetric,tomeasure
therateofchangeintheaccesspatternofthesite.
WehavedevelopedatoolcalledMediaMetricsthatchar-
acterizes a media server access prole and its system re-
sourceusageinbothaquantitativeand qualitativeway. It
extractsandreportsinformationthatcouldbeusedbyser-
viceproviderstoevaluatecurrent solutionsandtoimprove
Key newobservationsfromouranalysisinclude:
Despite the fact that thetwo studiedworkloads had
signicantlydierentlesizedistribution(onesethad
well represented groups of short, medium, and long
videos, whilethe otherset was skewedinlongvideos
range), the clients' viewing behavior was similar for
both sets: 77-79% of media sessions being less than
10 minlong, 7-12% of sessionsbeing 10-30min, and
6-13% of sessions continued for more than 30 min.
This re ects the browsing nature of the most enter-
priseclientaccesses.
Most of the incomplete sessions (i.e. terminated by
clients before the video was nished) are accessing
the initial segments of media les. The percentage
of sessions with interactive requests (such as pause,
rewind, or fast forward during the media session) is
muchhigherfor mediumandlongvideos.
Likewebworkloads,boththemediaworkloadsexhibit
ahighlocalityofaccesses: 14-30%ofthelesaccessed
on the server account for 90% of the media sessions
and92-94%ofthebytestransferred,andwereviewed
by96-97%oftheuniqueclients.
While there is a signicant numberof les that are
rarely accessed(16% to19% of the lesare accessed
only once), these numbersare somewhat lower com-
paredtowebserverworkloads.
Thedistributionofclientsaccessestomediales can
beapproximatedbyZipf-likedistributionforbothwork-
loads. However, noteworthy is that the time scale
playsimportantroleinthisapproximation. Weconsid-
ered 1-month,6-month,1-yearand awholelog dura-
tionasatimescaleforourexperiments.Foronework-
load,distributionofclientsaccessestomedialesona
6-monthscalestartstotZipf-likedistribution. While
for the otherworkload, le popularity on a monthly
basis can be approximated by Zipf-like distribution.
For longer timescale in thesame workloads, the le
access frequency distributiondoes not follow Zipan
distribution.
Accessestothenewlesconstitutemostoftheaccesses
inanygivenmonth. Also,thebytestransferreddueto
accessestonewlesaredominantinbothworkloads.
It makesthe access patternof enterprise media sites
resembletheaccesspatternofthenewswebsiteswhere
themostoftheclientaccessestargetnewinformation.
We introducethe new les impact metricto measure
sitedynamicsduetonewles. Moreover,weobserved
that forenterprisemediaservers,thetendencyofthe
numberof accesses to be increasing or decreasing in
natureisstronglycorrelatedwiththenumberofnewly
addedles.
Forbothworkloads,51-52%ofaccessestomediales
occurduringtherstweekoftheirintroduction. First
veweeksoftheles'existenceaccountfor70-80%of
alltheaccesses. Wedenealifespanmetrictore ect
therateofchangeinaccessestonewlyintroducedles.
diainformationonasiteislesstimelyandhavemore
consistent percentileof accessesoverlongerperiodof
time.
The remainder of the paper presents our results in more
detail. Section1discussesrelated workandintroducesthe
sitesweusedinourstudy.Section2describesthemediales
lengthand the distributionofthe accesses, media lesen-
coding,availablebandwidthto thesessions,completedand
abortedsessioncharacteristics. Section3providesinsightin
locality characteristics ofstudied workloads. Section4 in-
troducesthenewlesimpactmetric,anddemonstratesthe
trendsinaccesspatternsduetothenewles. Section5de-
nesthelifespanmetricandmeasurestherateofthesite's
access patternchanges. Finally, section 6presentsconclu-
sionandfuturework.
Acknowledgments: Boththetoolandthestudywould
nothave beenpossiblewithoutmedia accesslogsand help
providedbyNicLyons,WraySmallwood,BrettBausk,Mag-
nusKarlsson,WentingTang,YunFu,JohnApostolopoulos,
andSusieWee. Theirhelpishighlyappreciated.
1.1 Related Work
Whilewebserverworkloadshavebeenstudiedextensively,
therehavebeenrelativelyfewerpaperswrittenaboutmul-
timedia workloadanalysis. Acharyaetal. [1]characterized
non-streaming multimedia content stored on web servers.
In their laterwork [2], authors present the analysis of the
six-month trace data from mMOD system (the multicast
MediaonDemand)whichhadamixofeducationalanden-
tertainmentvideos. Theyobservedhightemporallocalityof
accesses,thespecialclientbrowsingpatternshowingclients
preferenceto preview theinitialportion ofthevideos, and
thatrankingsofvideotitlesbypopularity donottaZip-
andistribution.
Recent studieson client access toMANIC system audio
content [14] and low-bit ratevideos in theClassroom2000
system [11] providethe analysis of accessesto educational
media servers in terms of daily variation in server loads,
distributionofmediasessiondurations,andsomeclient in-
teractivity analysis.
Extensiveanalysisofeducationalmediaserverworkloads
isdonein[3]. Theirstudyisbasedontwomediaserversin
useatmajorpublicuniversitiesintheUnitedStates:eTeach
andBIBS.Theauthorsprovideadetailedstudyofclientses-
sionarrivalprocess: the clientsessions arrivalinBIBScan
be characterized as Poisson, and arrivals ineTeach work-
load are closer to heavy-tailed Pareto distribution. They
also observed that mediadeliveredper sessiondependson
the media lelength. They discovered dierent client in-
teractivitypatternsforfrequentlyandinfrequentlyaccessed
les: any videosegmentis equallylikely tobeaccessedfor
frequentles,whileaccessfrequencyishigherforearlierseg-
mentsintheinfrequentvideos. Themaingoalof[3]wasto
identifythe important parametersfor generatingsynthetic
workloads.
While all the above papers used media server logs, the
studybyChesireetal[9]analyzedthemediaproxyworkload
atalargeuniversity.Theauthorspresentedadetailedchar-
acterization ofsessionduration(mostofthemediastreams
arelessthan10min),objectpopularity(78%ofobjectsare
accessedonlyonce),serverpopularity,andsharingpatterns
asthehigh-speedaccessmethodsbecomemoreubiquitous,
streaming media starts to occupy more sizable fraction of
the Internet's bandwidth. Few recent papers [13, 12, 16]
analyzetheimpactofstreamingmediaontheInternettraÆc
andtheperformanceofpopularInternetreal-timestreaming
technologies.
Our paperbuilds upon this previous work in a number
of signicant ways. To our knowledge, this paper is the
rststudyofenterprisemedia serverworkloads. Our data
iscollected oversignicant periodof time,which makesit
unique. Thedurationofthisdataallowedustoconcentrate
ontheanalysisofmediaserveraccesstrends,accesslocality,
dynamicsand evolution of the media workload over time,
andtoproposetwonewmetricstomeasuretheseproperties.
2. WORKLOAD CHARACTERISTIZATION 2.1 Data Collection Sites
Weuseaccesslogsfromtwodierentservers:
HP Corporate Media Solutions server (HPC)
hostsdiverseinformationaboutHP:videocoverageof
majorevents,keynotespeeches,addressesandpresen-
tations,meetingswithindustryanalysts,promotional
events, product introduction, information related to
softwareandhardwareproducts,anddemosforprod-
uctsusage. Additionally,ithassometraininganded-
ucation information. The logscoveralmost 2.5 years
ofduration: fromthemiddleofNovember,1998tothe
middleof April,2001. TheHPCcontent is delivered
byWindowsMediaServer[17].
HPLabs Media server (HPLabs) provides infor-
mation about HP Laboratories, videos of prominent
presentations, seminars, meetings, HP wide business
relatedevents,Cooltown 1
promotionalmaterials,and
some trainingandeducational information. Thelogs
cover1yearand9monthsduration: fromthemiddle
ofJuly,1999 tothemiddleofApril,2001. Itisanin-
ternalserveravailableonlyforaccessestoHPemploy-
ees. The HPLabs content is delivered by RealServer
G2fromRealNetworks[15].
Themediaaccesslogsrecordtheinformationaboutallthe
requestsand responsesprocessed by amedia server. Each
lineoftheaccesslogsprovidesadescriptionofauserrequest
foraparticularmediale. WindowsMediaServerand Re-
alNetworksMediaServerhavedierentlog formatsbutthe
typical elds contain information about the IP-address of
theclient machine making the request, the timestamp at
whichtherequestwas made,thelenameof therequested
document,the advertisedduration ofthe le(in seconds),
thesizeoftherequested le(inbytes), theelapsedtimeof
therequested media lewhentheplay ended, theaverage
bandwidth(Kb/s) available to the user while the lewas
playing, etc. (for more details on the access log formats
see[7]). Clientscanpause,rewind, fastforward,orskipto
apredenedpointusingaslidebarduring theirviewingof
therequestedmediales.
1
HP'svisionofthefuture, aworldwhereeveryoneandev-
erythingisconnectedto thewebthroughwiredorwireless
thesameleaccess. Wewillexplicitlydistinguishtheusage
of theterm: asession isthe accessof aparticular leand
therecanbemultiplerequestswithinthesamesession,due
toclient'sinteractivity.
TheoverallworkloadstatisticsforHPCandHPLabsme-
diaserversissummarizedinthefollowing Table1.
HPC HPLabs
Duration 29months 21months
Totalsessions 666,074 14,489
TotalRequests 1,179,814 NA
UniqueFiles 2,999 412
UniqueClients 131,161 2,482
StorageRequirement 42GB 48GB
BytesTransferred 2,664GB 172GB
Table1: Statistics summaryfor twosites.
A glance at the basic statistics shows that HPC media
server witnesses more activities and reaches larger client
populationthanHPLabsserver. HPLabsserverclearlytar-
getsmorespecic,smallerresearchcommunityatHP,and
asaresulthasaverydierent,\modest"prole. HPCrep-
resentsareasonablybusymedia serverwith300-800client
sessions per weekday and occasional peaksreaching 12000
sessions. HPLabsserverismuchlighterloaded.Bynoticing
this very obvious dierence, it becomes even more inter-
esting whether we can nd common properties typicalfor
enterpriseworkloadsingeneral.
2.2 Files and Session Characteristics
Theadvertisedmedialedurationre ectsthetotallength
ofthevideo,whiletheclientcanstopviewingordownload-
ing the leby hittingstop buttonbefore the video is n-
ished, and it cando so after asequence of pause,rewind,
fastforwardtospecicsectionsofthevideo.
Figure1showsthe distributionofstoredvideosfor both
workloads,andpercentageofcorrespondingaccessestothose
les. Tosimplifythemedialedurationanalysis,wecreated
6 classesfor consideredles: threegroups ofshort videos:
1) less than2 min, 2) 2-5 min, 3) 5-10 min; onegroup of
mediumsize videos: 4)10-30 min,and two groupsof long
videos: 5) 30-60min,and 6)longerthan60min. Figure2
shows the distribution of storedvideos indened above 6
duration classes, and percentage of corresponding sessions
tothoseles.
OuranalysisshowsthatforHPCworkload,thecontentis
wellrepresentedbyvideosofdierentdurations:42%ofles
belongtoashortvideogroup,23%oflesareinamedium
videogroup,and34%oflesbelongtoalongvideogroup.
HPLabsworkloadisstronglyskewedinfavoroflongvideos:
7% ofvideos areinmediumgroup,and79%oflesbelong
toalongvideogroup.
Theinterestingcharacterizationisthatthepercentageof
clients accessesisproportionalto thepercentage oflesin
each of those le duration categories for both workloads!
Thisimpliesthateachoftheledurationgroupsisequally
likelytobeaccessedbyclients. Thispropertyisveryuseful
for syntheticworkloadgeneration, sinceit proposesasim-
plemodelofdeningamedialedurationdistributionand
percentageofcorrespondingclientaccessestothoseles.
a)
0 10 20 30 40 50 60 70 80 90 100
0 20 40 60 80 100 120 140 160
% of Total
File Duration (min) HPC
Files File Accesses (Sessions)
b)
0 10 20 30 40 50 60 70 80 90 100
0 20 40 60 80 100 120 140
% of Total
File Duration (min) HPLabs
Files File Accesses (Sessions)
Figure 1: Distribution of le durations and distribution of client sessions to those les: a) HPC and b)
HPLabs.
a)
0-2min 2-5min 5-10min 10-30min 30-60min >60min 5
10 15 20 25
% of Total
HPC
Files Sessions
b)
0-2min 2-5min 5-10min 10-30min 30-60min >60min 10
20 30 40 50 60
% of Total
HPLabs
Files Sessions
Figure2: Sixclassesofledurationsandpercentageofclientsessionstothoseles: a)HPCandb)HPLabs.
clientsviewedthevideos,thestatisticschangesdramatically
forbothworkloadsasshowninFigure3.Noticethatstatis-
ticspresentedbythisgraphre ectstheoverallclientviewing
timedistribution,itisnotcorrelatedwiththeactualmedia
lesduration. Mostoftheviewedmediasessions,50%-60%,
werelessthan2minlong.
0-2min 2-5min 5-10min 10-30min 30-60min >60min 10
20 30 40 50 60
% of Total
Sessions Duration
HPC HPLabs
Figure 3: Sessionduration characterization.
In spite of signicant dierence in the original le size
distribution, the actual duration for which clients viewed
thevideoswassimilarforbothsets: with77-79%ofmedia
sessionsbeing lessthan10minlong,7-12%of thesessions
being10-30 min,and6-13%ofsessionscontinuedformore
than30 min. It re ects the browsing nature of the most
enterpriseclientaccesses, andthatoftenclientsare looking
for a specic fragment of content ina video, and are not
interestedinwatchingitcompletely. Knowledgeoftheap-
proximate percent of \browsing" clients helps to estimate
2.3 Encoding and Available Bandwidth, Com- pleted and Aborted Sessions
Bothservers, HPCand HPLabs, had videos encoded at
dierentrate. VideosstoredatHPCserverhadmostofthe
les (59%) encoded at 56 Kb/s rateand lower. However,
over the years, the trend is to add more les encoded at
higher rate: for example, in 1999 year, only 1.7% of the
videoswereencodedataratebetween128-256Kb/s,while
in2001thisgroupofvideosconstitutesalreadyupto27.8%
oftotal. HPLabsserverhasmostofthelesencodedathigh
bit rate: 67% ofall the lesare encodedat256Kb/s and
higher.
Mediaaccesslogsreporttheaveragebandwidthavailable
tothe userwhile thelewas playing. HPCmedia sessions
overallhad higheravailable bandwidthto theclients com-
paredtotheHPLabssessions: 57.7%ofsessionshadanav-
erageavailablebandwidthabove56Kb/s(wewillcallthese
sessionsashigh-bandwidthsessions). ForHPLabsworkload,
high-bandwidthsessionsconstitutedonly25%oftotal.
For HPC workload, most of le encodings and average
availablebandwidthpersessionshowagoodallingment as
shown in Figure 4 a). Only the group of videos encoded
atrates between128-256Kb/scouldnotmeettherequire-
ments. Whilefor HPLabsworkload,where themostofthe
leswereencodedat256Kb/sandhigher,thegapbetween
the demandandavailablebandwidthis veryhigh: most of
the sessions havesignicant mismatchbetweenthe leen-
codingand the available bandwidthas showninshownin
Figure4b). Thisinformation,providedbyMediaMetrics,
couldbeusedbytheserviceproviderstoanalyzetheclient
bandwidthavailabilityforchoosingtherightencodingrates.
a)
1 10 100 1000
0 500 1000 1500 2000 2500
File Encoding / Bandwidth (Kb/s)
File Number HPC
Average Bandwidth File Encoding
b)
1 10 100 1000
50 100 150 200 250 300 350 400
File Encoding / Bandwidth (Kb/s)
File Number HPLabs
Average Bandwidth File Encoding
Figure 4: File encoding rates and average available bandwidth of client sessions to those les for both
workloads.
ItexplainswhyahigherpercentageofHPCsessionswere
completedcomparedtoHPLabsworkload(29%ofHPCses-
sions versus 12.6% of HPLabs sessions were completed).
However, the dierence in bandwidth between completed
andabortedsessionsisnotsignicantlydierent.
Most oftheabortedsessionsaccessedinitial segmentsof
mediales. Thenumberofsessions which hadincomplete
accesses to any other segments of the le other than the
beginning,dependonthesize ofthe video: lessthan1.5%
ofsessionsinshortvideogroupaccessedanysegmentofthe
videootherthanthebeginning,2.4%ofsessionsinamedium
videogroup, 4%-7% of sessions inlong videogroup. Such
knowledge about the client viewing patterns is benecial
whendesigningmediacachingstrategies.
Serverlogformat hasaseparateentryforeachclientre-
quest. As aresult,we are able toget information suchas
pause, rewind or fast forward activity by the client dur-
ingthe mediasession. Unfortunately,similardatawasnot
available for HPLabs workload. Analysis ofclient interac-
tivityforHPClogsproducedveryinterestingresults. First
ofall,itrevealedthat99.9%ofthesessionswithinteractive
requests were high-bandwidth sessions. Second, that the
percentageof sessionsthat accessmediumand longvideos
havemuchhigherinteractivity.
3. LOCALITY CHARACTERIZATION
Inthissection,werevisitapreviouslyidentiedinvariant
for web server workloads that web traÆc exhibits strong
concentration ofreferencesand\10%oflesaccessedfrom
theservertypically accountfor 90% of the serverrequests
and90%ofthebytestransferred".
For locality characterization of our logs, we use atable
of all les accessed along withtheir frequency(numberof
times a lewas accessed during the observedperiod) and
the lesizes. This table is ordered in decreasing orderof
frequency.
Figure5a)showsthereferencelocalityforthemediaserver
accesslogsusedinourstudy. Ouranalysisrevealsthat90%
ofthemediasessionstarget14%ofthelesforHPCserver,
and 30% of the les for HPLabs server. This shows high
locality ofclientaccesses, thoughlowerthanforwebwork-
loads. Figure5b)showsthecorrespondingbytestransferred
duetothesemediasessions: 94%forHPCsiteand92%for
Observedgraphsfor bothworkloadsareremarkablysim-
ilar. Figure5c)showsclientslocalityforbothworkloads. It
can be interpreted inthe following way: 14% of the most
popular lesareaccessedby96%ofclientsatHPCserver;
30% of the most popular les are viewed by 97% of the
clientsatHPLabssite.
Wealsoanalyzedworkloadlocalityfromadierentangle:
whatpercentageofactivestoragedidthemostpopularles
account for. Here, the activestorageset is dened by the
combinedsizeofallthemedialesaccessedinthelogs. For
bothworkloads,weobserve ahighactivestoragesetlocal-
ity: 80% to 88% of all sessions are to les that constitute
only 20% of the total active storage setas canbe seen in
Figure 6a). Similarly, 82%to 92% ofall transferred, most
popularbytesareduetolesthatconstituteonly20%ofthe
total active storagesetas canbeseen inFigure6b). This
typeofanalysishelpsinestimatingthestoragerequirements
and potentialbandwidthsavingswhenusingoptimizations
for thepopular portionof the media content. Since these
metricsarenormalizedwithrespecttothesite'sactivestor-
ageset,it allowsus tocomparedierentworkloadsand to
identifythesimilarityinherenttothoseworkloads,indepen-
dentoftheabsolutenumbersforstorageineachworkload.
Answeringthequestion: howdoesthelocalitycharacteri-
zationinworkloadvarywithatimedurationofthelogscol-
lection, wefound that independent onduration (1-month,
6-monthor 12-month durations) both workloads exhibit a
highlocalityofclientaccesses.
Previousstudiesonwebserversandwebproxies[6]ledto
almostuniversalconsensusthatwebpagepopularityfollows
Zipf-likedistribution,wherethepopularityofthei-thmost
popular leis proportional to 1=i
. For web proxies, the
valueofistypicallylessthan1,rangingfrom0.64to0.83,
for web servers the reportedtypical value of is varying
between1.4-1.6. Paper[9],whichanalyzesthemediaproxy
workload,reports aZipf-likedistributionfor theleaccess
frequencies in their study with = 0:47. Paper [3] ap-
proximatededucationalmediaserverdailyworkloadsusing
concatenationoftwoZipf-likedistributions.
Sinceour workloads understudy cover along period of
time, we decidedto investigate whetherthe leaccessfre-
quencies exhibit the same behaviour on a dierent time
scale. Weconsidered1-month,6-month,1-yearandtheen-
tiredurationofthelogsasatimescaleforourexperiments.
Inordertocharacterizethedistributionoftheleaccess
a)
0 10 20 30 40 50 60 70 80 90 100
55 60 65 70 75 80 85 90 95 100
Accessed Files (% of Total)
Media Sessions (% of Total) HPC
HPLabs
b)
50 55 60 65 70 75 80 85 90 95 100
55 60 65 70 75 80 85 90 95 100
Bytes Transfered (% of Total)
Media Sessions (% of Total) HPC
HPLabs
c)
75 80 85 90 95 100
55 60 65 70 75 80 85 90 95 100
Clients (% of Total)
Media Sessions (% of Total) HPC
HPLabs
Figure5: Twoworkloads compared: a)lesetlocality,b)bytes-transferredlocality c) client setlocality.
a)
0 20 40 60 80 100
5 20 40 60 80 100
Media Sessions (% of Total)
% of Site’ Active Storage Set HPC HPLabs
b)
0 20 40 60 80 100
5 20 40 60 80 100
Bytes Transfered (% of Total)
% of Site’ Active Storage Set HPC HPLabs
Figure6: Twoworkloadscompared: storage-setlocality and bytes-transferred-storage locality.
a)
1 10 100 1000 10000 100000
1 10 100 1000 10000
Number of Accesses (log)
File Number (log) HPC HPLabs
b)
1 10 100 1000 10000 100000
1 10 100 1000 10000
Number of Accesses (log)
File Number (log) 2000 year
HPC-2000 HPC-1-half HPC-2-half HPLabs-2000 HPLabs-1-half HPLabs-2-half
c)
1 10 100 1000 10000
1 10 100
Number of Accesses (log)
File Number (log) HPLabs
HPLabs-1-half HPLabs-2-half alpha=1.6
Figure 7: File popularity distribution for both workloads a) over entire duration of the logs, b) over 2000
yearand corresponding rstand second6monthsin2000,c) HPLabsworkload6-monthperiodswithcorre-
sponding Zipf-like functiontting.
a)
1 10 100 1000 10000 100000
1 10 100 1000
Number of Accesses (log)
File Number (log) HPC
HPC-2000-1 HPC-2000-2 HPC-2000-3 HPC-2000-4 HPC-2000-5 HPC-2000-6
b)
1 10 100 1000 10000 100000
1 10 100 1000
Number of Accesses (log)
File Number (log) HPC
HPC-2000-7 HPC-2000-8 HPC-2000-9 HPC-2000-10 HPC-2000-11 HPC-2000-12
c)
1 10 100 1000 10000 100000
1 10 100 1000
Number of Accesses (log)
File Number (log) HPC
HPC-2000-8 alpha=1.5
Figure8: File popularity distribution for HPCworkload a)monthlyperiods,1st to 6thmonths b)monthly
periods,7th to12th months c)8th month withcorrespondingZipf-like functiontting.
a)
0 2000 4000 6000 8000 10000 12000
1 10 100 1000 10000
Number of Unique Clients
File Number (log) HPC
Clients
b)
0 50 100 150 200 250 300 350
1 10 100 1000
Number of Unique Clients
File Number (log) HPLabs
Clients
Figure9: Files sharingstatistics: a)HPCand b)HPLabs.
frequencies for workloads understudy,we rankedthe les
bypopularity(i.e. thenumberofaccessestoeachle),and
plottedthe results onthelog-log scale. Figure7a) shows
thelepopularityoverentiredurationthelogs. Bothwork-
loads exhibitvery similar distribution: the HPLabs curve
\follows" the HPCcurve, but ona lower scale. This can
be explained by almost two orders smaller numberof ac-
cessesand lesintheHPLabs workload. However,bothof
thesecurvesare far from ttingastraight line of Zipf-like
distribution.
Figure 7b)shows lespopularity for HPCand HPLabs
workloadsfortheyearlyperiod(of2000year),aswellas6-
monthsintervals(thecorrespondingrsthalf-yearandsec-
ond half-year periods of 2000 year). HPCcurves(both 1-
yearand6-month)arestillfarfromttingastraightlineof
Zipf-likedistribution.
However,6-monthcurvesfor HPLabst reasonablywell
withthestraightlineofZipf-likedistributionwhenignoring
therst15-20les(in[6],authorsmakesimilarassumptions
aboutignoringthetop100documentsanda attailatthe
end of the curve). The straight line on the log-log scale
impliesthattheleaccessfrequencyisproportionalto1=i
.
Weobtained thevaluesof usingleastsquaretting: for
both6-month curves=1:6 worksverywell. Figure 8c)
shows lepopularity distributionforthe HPLabsworkload
correspondingtothe6-monthperiodsof2000,approximated
byZipf-likefunction1=i
,with=1:6.
Finally,Figure8a)and b)shows lespopularity forthe
HPCworkload on a monthly basis. Most of the monthly
curves t straight line reasonably well when ignoring the
rst10-15lesandfewlastles. Fordierentmonths,value
ofisrangingfrom1.4to1.6. Figure8c)showslepopu-
laritydistributionforHPCworkloadduringAugustof2000,
approximatedbyZipf-likefunction1=i
,with=1:5.
Theobservationthattheleaccessfrequenciesfortheme-
diaworkloadsunderstudycanbeapproximatedbyZipf-like
distributionisveryusefulforsyntheticworkloadgeneration.
Itisinterestingthatthetimescaleplaysanimportantrole
inthisapproximation.
Thehighlocalityofaccessestospecicsubsetoflesand
thehigh concentration ofthe clientsaccessing thesepopu-
larles showninFigure9imply thatthe popular lesare
widelyaccessedbymanydierentclients. InHPCworkload,
rst70lesareaccessedbymorethan1000uniqueclients,
withsomefrequentlesaccessedby10;000 12;000unique
clients. ForHPLabsserver, degreeofsharingislower(itis
expected,becauseofthesmallerclientspopulation),butfor
themostfrequentlesitisstillverysignicant: rst17les
areaccessedby113 341uniqueclients.Thesharingexhib-
itedbytheclients'accesspatternsisessentialfordesigning
aneÆcientcachinginfrastructure.
Complementary to the characterization of the most fre-
quentlyaccessedles,itisusefultohavestatisticsaboutthe
\opposites": thepercentageofthelesthatwererequested
onlyafewtimes,andthepercentageofactivestoragethese
lesaccountfor:
FilesRequested Storage
upto Requirementsfor
1/5/10times CorrespondingFiles
HPC 16%/38%/47% 10%/26%/34%
HPLabs 19%/45%/59% 17%/39%/52%
Table2: Rarelyaccessed lesstatistics.
AstheTable2shows,16%to19%ofthelesareaccessed
only once, and 47% to 59% of the les are accessed less
than10times. Theserarelyaccessedlesaccountfor quite
signicant amount of storage: 34% to 52% of total active
storage set. Thesenumbersaresomewhat lowercompared
to the web server workloads. For web server workloads,
\onetimers"(lesaccessedonlyonce)mayaccountfor20%-
40%ofthelesandtheactivestorage.
4. EVOLUTION OF MEDIA SITES
Inthissection,weinvestigatespecicleaccesspatterns
discoveredthroughtheanalysisoftwoworkloadsunderstudy.
WeobservedthatthetraÆctoboththeHPCandtheHPLabs
sitesisverybursty.Somedaysexhibittwoordersofmagni-
tudehighernumberof sessions. Wewill relatethis bursti-
nesstosessionsaccessingnewleslaterinthesection. Other
studiesofdierentmediaworkloads[9, 3]containedsimilar
observations aboutmediatraÆcburstiness,butthedegree
ofburstinessobservedwassmaller,andmorecorrelatedwith
thedayoftheweek,especiallyforeducationalmediawork-
loads.
Since our logs provide information about the client ac-
cessesoveralongperiodoftime,oneofthemaingoalsofthis
studyistocharacterizethedynamicsandevolutionofmedia
a)
0 100 200 300 400 500 600 700 800
0 5 10 15 20 25 30
Number of Files per Month
Month Number HPC
All Files New Files
b)
0 10000 20000 30000 40000 50000 60000 70000 80000
0 5 10 15 20 25 30
Number of Sessions per Month
Month Number HPC All Sessions Sessions to New Files
c)
0 5000 10000 15000 20000 25000
0 5 10 15 20 25 30
Unique Clients per Month
Month Number HPLabs Unique Clients
Figure10: HPCworkload: a) all andnewles, b)all sessionsand sessionstonewles, c)uniqueclients.
a)
0 20 40 60 80 100 120
0 5 10 15 20 25
Number of Files per Month
Month Number HPLabs
All Files New Files
b)
0 500 1000 1500 2000 2500 3000
0 5 10 15 20 25
Number of Sessions per Month
Month Number HPLabs
All Sessions Sessions to New Files
c)
0 100 200 300 400 500 600
0 5 10 15 20 25
Unique Clients per Month
Month Number HPLabs
Unique Clients
Figure11: HPLabsworkload: a)all and newles,b) allsessionsand sessionstonewles, c) uniqueclients.
ductionofnewlesinthelogs,andtoanalyzetheportion
ofall requestsdestinateforthose les. Wedene ametric
callednewlesimpact,tocharacterizethesiteevolution
duetonewcontent,by computingtheratio ofthe accesses
targetingthesenewlesovertime. Figures10a)and11a)
showtwocurvesforHPCandHPLabsworkloadrespectively.
Thecurvesshowalltheleswhichwereaccessedinapartic-
ularmonth,andallthenewleswhichwereaccessedinthe
samemonth. We denealebeingnew if itwasnot ever
accessedbefore,basedontheinformationintheaccesslogs.
HPCsitehasanexplicitgrowthtrendwithrespectoftotal
numberoflesaccessedpermonth,andconsistentlysteady
amountofnewlesaddedtothesiteduringeachmonth.
Thegrowthoftotalnumberoflesaccessedeachmonth
for HPLabs siteis \negative". Sincethis was unexpected,
we askedtheteamsupportingthis sitewhethertherewere
specicreasons for thetrend we observed. Specically, we
wantedtoknowifthereisasignicantnumberofnewvideo
lesthat\nobodywatches"andhencethelogsdon'tcontain
anyinformationaboutthemoriftheactualnewmediacon-
tentonthatsitedecreasedovertime. Theteamexplained
that lately they had been adding only a limited number
ofnew lesbecause theyare working onatransition plan
to upgrade the entire site design and equipment. So, the
\negative"trendintheadditionofnewlestothesitewas
observedcorrectly.
Figures10b)and11b)showgraphsforHPCandHPLabs
workloadrespectively: thenumberofallsessionspermonth
andthe numberof sessionsto thenew lesinthis month.
Thesegraphsre ectthat theaccessestothenewlescon-
stitutethemostoraverysignicantportionofallaccesses,
excludingafewmonthsthatwereexceptions. Additionally,
both workloads exhibitsimilar trends for the bytes trans-
ferred permonthand the bytes transferreddue tothe ac-
cessestonewles. Sincethenumberofnewlesaddedper
monthplaysacrucialroleindeningthesitedynamics,evo-
lution,andgrowthtrends,evaluatingthenewlesimpact
metricbecomesimportant.
Figures10c)and11c)showthenumberofuniqueclients
permonthaccessingeachoftheHPCandHPLabssitecor-
respondingly. Again, the graphs are correlated with the
trends of the sessions to each site's new les. Thus, the
client populationof enterprise mediasitestronglydepends
on the amount of new information regularly added to the
site.
Dynamicsoftheenterprisewebsites exhibitsmuchmore
stability in terms of the accesses to the \old" documents.
Onlyabout2%ofthemonthlyrequestsaretothenewles
added that month as shownin [8]. Dierently, the access
patternofenterprise media sitesresembleswiththe access
patternof the newsweb sites where most ofthe client ac-
cessestargetnewlyaddedinformation.
5. LIFE SPAN OF FILE ACCESSES
Inthissection,weattempttoanswerthefollowing ques-
tion: howmuchdoesthepopularityoftheleandfrequency
ofleaccesseschangesovertime? Theanswertothisques-
tioniscriticalfordesigningprefetchingorserver-pushalgo-
rithms,aswellasfordesignofeÆcientcontentdistribution
strategiesinCDNnetworkformediacontent.
Enterprisemediaserverworkloadsexhibithighlocalityof
references. AshasbeenshowninSection3,itwasobserved
that90%ofthemediaserversessionstargetonly14%-30%
oftheles. Thus,thissmallsetofleshasthestrongimpact
onthemediasiteperformanceand itsaccesspatterns. We
a)
0 10 20 30 40 50 60 70 80 90 100
50 100 150 200 250 300 350 400 450 500
Files (%)
Days (between the first and the last file accesses) HPC
Files Files (Core-90%)
b)
0 10 20 30 40 50 60 70 80 90 100
50 100 150 200 250 300 350 400 450 500
Files (%)
Days (between the first and the last file accesses) HPLabs
Files Files (Core-90%)
Figure 12: Daysbetweentherst andthe last leaccesses.
a)
30 40 50 60 70 80 90 100
20 40 60 80 100 120 140 160 180 200
Sessions (%)
Days (since the files introduction) HPC
All Sessions Sessions (Core-90%) Sessions (Non-Core)
b)
30 40 50 60 70 80 90 100
20 40 60 80 100 120 140 160 180 200
Sessions (%)
Days (since the files introduction) HPLabs
All Sessions Sessions (Core-90%) Sessions (Non-Core)
Figure13: Percentof sessionson daysbetweentherst andthe last leaccesses.
lesthatmakesupfor90%ofallthemediasessions. From
theperformancepointofviewitisthesecorelesoneshould
concentrate onto obtain goodperformanceas most ofthe
accessesaretothem. Alongwithunderstandingthedynam-
icsof all thelesat thesite, we wouldliketo seewhether
thecorelesexhibitsomespecicproperties.
We dene a life duration for a particular le to be the
time between the rst and the last accesses to this le in
thegivenworkload.
Figure 12 shows the distribution of lelife duration for
bothworkloads. Therearetwocurvesonthegraphsrepre-
sentinga lifeduration distribution for all the les and for
thecore les. Ouranalysis shows that high percentage of
allleshaveashortlifeduration: lesthat\live"lessthan
amonthconstitute37%ofallthelesintheHPCworkload
and50%ofallthelesintheHPLabsworkload(thisnum-
berispartially so high,because 16-19%of all theles are
accessedonlyonce,whichistypicalasformediaasforweb
serverworkloads). 73% of all the les for bothworkloads
have alife duration less than6 months. Only10% of the
lesfor the HPL siteand 8% for the HPCsitelivelonger
thanayear. Asforthefrequentlyaccessedles,muchhigher
percentageofthemlivelongercomparedtothelifeduration
ofalltheles. Andadditionally,the\short-lived"frequent
les in the graphs are mostly represented by the recently
introducedles.
Forthelesofdierentlifeduration,weintroduceanew
metric, called a life span metric, which is dened as the
cumulative distribution of accesses to the les since their
introductionatasite.
Figure13showsthelifespanofthele'saccessesforboth
workloads. The x-axis re ects the days since the les in-
troduction;they-axisrepresentsthecumulativepercentage
ofall thele accessesup to this day (relativeto the total
numberof all the sessions over the entire duration of the
logs).
ForHPC (HPLabs)workload,52% (51%) of all theses-
sionsoccurduringtherstweekoflesexistence,68%(61%)
of all thesessions occur duringtwoweeksofthe les exis-
tence, 74% (66%) - during three weeks of les existence,
77%(69%)-duringfourweeksoflesexistence,80%(70%)
-duringveweeksofthelesexistence. Thus,HPLabssite
haslongerlifespanfor theirlesthanHPCsite.
Abovestatisticscanbeinterpretedinadierentway,re-
ectingtherateofchangesoftheaccessesinagivenwork-
load: 52% (51%) of all the sessions occur during the rst
week of le existence, followed by only 16% (10%) of ac-
cesses during the second week, decreasing 6% (5%) of ac-
cessesduringthethirdweek,andonly3%(1%)-ofaccesses
for4thand5thweekssincetheleintroduction.
Life span of core-90% les is almost identical with life
span of all the les. It is notsurprising, because by de-
nition the core-90% les represent 90% of all the accesses
to the site. Their properties have major impact on char-
acteristicsoflifespanforthe wholesite. Asfortherest of
theles(non-coreles),theirpropertiesaredierentforthe
HPCandtheHPLabsworkloads. Forexample,fortheHPC
workload,70%ofthesessionstonon-corelesoccurduring
rst42daysafterthelesintroduction,while forthe HPL
workload, 70% of corresponding sessions occur during the
rst21daysaftertheintroductionoftheles.
The life span metric is a normalized metric. The les
could havebeen individuallyintroduced at dierent times.
Themetricre ectstherateofchangeoftheleaccesspat-
ternduringthelesexistenceatthesite. Moreover,thelife
span metric re ects the timeliness of the introduced les.
Longerlifespanmeansthat mediainformationonasiteis
less timely and has more consistent percentile of accesses
tointerpolatetheintensityoftheclientaccessestothenew
andtheexisting lesoverafutureperiodoftime.
Webelievethatlocalityproperties,accesspatternsofnewly
introduced les, and their life span are critical metrics in
deningtheeÆcientcachinginfrastructureandfuturecon-
tentdeliverysystems.
6. CONCLUSION AND FUTURE WORK
Mediaserveraccesslogsareinvaluablesourceofinforma-
tion notonly to extract business related information, but
alsoforunderstandingtraÆcaccesspatternsandsystemre-
sourcerequirementsofdierentmediasites. OurtoolMe-
diaMetricsisspeciallydesignedforsystemadministrators
andserviceproviderstounderstandthenatureoftraÆcto
theirmediasites. Issuesofworkloadanalysisarecrucialto
properlydesigningthe site, and itssupportinfrastructure,
especiallyforlarge,busymediasites.
Our analysis aimed to establish a set of properties spe-
cicfortheenterprisemediaserverworkloadsandcompare
them with the well known related observations about the
webserverworkloads. Inparticular,weobservedhighlocal-
ity of references inmedia leaccesses for bothworkloads.
Similarto previous web workloads studies, ouranalysis of
the media le popularity distribution revealedthat it can
be approximated by Zipf-like distribution with parame-
ter ina range 1.4-1.6. The interesting newobservation is
thatthetimescale playsanimportantroleinthisapprox-
imation. Weconsidered1-month,6-month,1-year andthe
entire duration of the logs as a time scale for our experi-
ments. For the HPLabs workload, the distribution of the
clientsaccessestothemedialesona6-monthscalestarts
to t Zipf-like distribution. Whilefor theHPC workload,
thelepopularityonamonthlybasiscanbeapproximated
byaZipf-likedistribution. Forlongertimescaleinthesame
workloads {the leaccess frequencydistributiondoesnot
followZipandistribution.
Weintroducedthe newles impactmetricfor enterprise
media workloads, which re ects that accesses to the new
lesconstitutemostofthemonthlyaccesses,andthebytes
transferredduetothe accessestothenewlesaccount for
most ofthe transferred bytes. Also, we observedthat the
growth trend in the site accesses directly depend on the
amountofthenewlyaddedles.
Wedenedthelifespanmetrictore ecttherateofchange
intheaccessestothenewlyintroducedles. Forthestudied
workloads,51%-52%oftheaccessestothemedialesoccur
duringthe rstweekof their introduction. Thisstressesa
hightemporallocalityofaccessesinmediaserverworkloads
which is consistent with the observations in other media
workloadstudies.
Additionally, we also discovered some interesting facts
aboutthe clients' viewingbehavior. Despite the fact that
thetwostudiedworkloadshadsignicantlydierentlesize
distribution,theclients'viewingbehaviorwas verysimilar
forthebothsets: 77%-79%ofmediasessionswerelessthan
10minlong,7%-12%ofthesessionsbetween10-30min,and
only6%-13% of sessions continued for more than 30 min.
Thisre ects thebrowsingnature ofmostof the enterprise
client accesses. Wealso found that the percentage of ses-
sionswithinteractiverequestsaremuchhigherformedium
andlongvideos.
viewingbehaviorfordesigningeÆcientmediaproxycaching
strategies,appropriatecontentplacementatdistributedme-
diaserversandmediaproxies,aswellas inusingmulticast
forbetterbandwidthutilization.
7. REFERENCES
[1] S.Acharya,B.Smith.Anexperimenttocharacterizevideos
storedontheweb.InProc.ofACM/SPIEMultimedia
ComputingandNetworking1998,January1998.
[2] S.Acharya,B.Smith,P.Parnes.CharacterizingUserAccess
toVideosontheWorldWideWeb.InProc.ofACM/SPIE
MultimediaComputingandNetworking.SanJose,CA,
January2000.
[3] J.Almeida,J.Krueger,D.Eager,andM.Vernon.Analysis
ofEducationalMediaServerWorkloads,Proc.11thInt.
WorkshoponNetworkandOperatingSystemSupportfor
DigitalAudioandVideo,June2001.
[4] V.Almeida,A.Bestavros,M.Crovella,A.Oliviera.
CharacterizingreferencelocalityintheWWW.InProc.of
the4thInt.Conf.ParallelandDistributedInformation
Systems,IEEEComp.Soc.Press,1996.
[5] M.ArlittandC.Williamson.Webserverworkload
characterization:thesearchforinvariants.InProc.ofthe
ACMSIGMETRICS'96,Philadelphia,PA,May1996.
[6] L.Breslau,P.Cao,L.Fan,G.Phillips,S.Shenker.Web
CachingandZipf-likeDistributions: Evidenceand
Implications.InProc.ofIEEEINFOCOM,March1999.
[7] L.Cherkasova,M.Gupta.AnalysisofEnterpriseMedia
ServerWorkloads: AccessPatterns,Locality,Dynamics,
andRateofChange.HPLaboratoriesReport,No.
HPL-2002-56,March2002.
[8] L.Cherkasova,M.Karlsson:DynamicsandEvolutionof
WebSites:Analysis,MetricsandDesignIssues.InProc.of
the6thInt.Symp.onComputersandCommunications
(ISCC'01),Hammamet,Tunisia,July2001.
[9] M.Chesire,A.Wolman,G.Voelker,H.Levy.Measurement
andAnalysisofaStreamingMediaWorkload.InProc.of
the3rdUSENIXSymposiumonInternetTechnologiesand
Systems,SanFrancisco,CA,March2001.
[10] L.He,J.Grudin,A.Gupta.DesigningPresentationsfor
On-DemandViewing.InProc.ofACM2000Conferenceon
ComputerSupportedCooperativeWork,Philadelphia,PA,
Dec.,2000.
[11] N.Harel,V.Vellanki,A.Chervenak,G.Abowd,U.
Ramachandran.WorkloadofaMedia-Enchanced
ClassroomServer.InProc.ofIEEEonWorkload
Characterization,October,1999.
[12] D.Loguinov,H.Radha.MeasurementStudyofLow-bitrate
InternetVideoStreaming.InProc.oftheACMSIGCOMM
InternetMeasurementWorkshop,SanFrancisco,CA,USA,
November2001.
[13] A.Mena,J.Heidemann.AnEmpiricalStudyofRealAudio
TraÆc.InProc.oftheIEEEInfocom,Tel-Aviv,Israel,
March2000.
[14] J.Padhye,J.Kurose.AnEmpiricalStudyofClient
InteractionswithContinuous-MediaCousewareServer.In
Proc.ofthe8thInt'l.WorkshoponNetworkandOperating
SystemSupportforDigitalAudioandVideo,July1998.
[15] RealServeradministrationGuide{RealSyztemG2.
RealNetworks,Inc.,Nov.1998.
http://docs.real.com/docs/serveradminguideg2.pdf
[16] Wang,M.Claypool,Z.Zuo.AnEmpiricalStudyof
RealVideoPerformanceAcrosstheInternet.InProc.ofthe
ACMSIGCOMMInternetMeasurementWorkshop,San
Francisco,CA,USA,November2001.
[17] WindowsMediaServicesSDK,Version4.1.
MicrosoftCorporation.
http://msdn.microsoft.com/workshop
/imedia/windowsmedia/sdk/wmsdk.asp