Data
Trajectories:
Tracking
Reuse
of
Published
Data
for
Transitive
Credit
Attribution
PaoloMissier SchoolofComputingScience
NewcastleUniversity,UK
Abstract
Theabilitytomeasuretheuseandimpactofpublisheddatasetsiskeytothesuccessof theopendata/openscienceparadigm. Adirectmeasureofimpactwouldrequiretracking data(re)useinthewild,whichisdifficulttoachieve.Thisisthereforecommonlyreplaced bysimplermetricsbasedondatadownloadandcitationcounts.Inthispaperwedescribe ascenario whereit ispossible totrackthe trajectoryof adatasetafter its publication, and show how thisenables the design of accuratemodels for ascribing credit to data originators.ADataTrajectory(DT)isagraphthatencodesknowledgeofhow,bywhom, andinwhichcontextdatahasbeenre-used,possiblyafterseveralgenerations. Weprovide atheoreticalmodelofDTsthatisgroundedintheW3CPROVdatamodelforprovenance, andwe showhow DTscanbe usedto automaticallypropagateafraction of the credit associatedwithtransitivelyderiveddatasets,backtooriginaldatacontributors. Wealso showthismodeloftransitivecreditinactionbymeansofaDataReuseSimulator.Inthe longerterm,ourultimatehopeisthatcreditmodelsbasedondirectmeasuresofdatareuse willprovidefurtherincentivestodatapublication. Weconcludebyoutliningaresearch agendatoaddressthehardquestionsofcreating,collecting,andusingDTssystematically acrossalargenumberofdatareuseinstancesinthewild.
Received29June2016 | Revisionreceived20September2016 | Accepted21September2016
Correspondence should be addressed to Paolo Missier, Claremont Road, Newcastle upon Tyne. Email:
Anearlierversionofthispaperwaspresentedatthe11thInternationalDigitalCurationConference.
TheInternationalJournalofDigitalCurationisaninternationaljournalcommittedtoscholarlyexcellenceanddedicated
totheadvancementofdigitalcurationacrossawiderangeofsectors. TheIJDCispublishedbytheUniversityof EdinburghonbehalfoftheDigitalCurationCentre.ISSN:1746-8256.URL:http://www.ijdc.net/
Copyrightrestswiththeauthors.ThisworkisreleasedunderaCreativeCommonsAttribution4.0 InternationalLicence.Fordetailspleaseseehttp://creativecommons.org/licenses/by/4.0/
InternationalJournalofDigitalCuration
Introduction
Thepracticeofpublishingresearchdatahasbeenmaturingrapidly,followingincreasing evidence that the combination of data sharing and emerging data citation practices representnewopportunitiesforextendingthevaluechainofthedata,ratherthanathreat to its owners (Piwowar & Vision, 2013). Reasons for publishing data, andscientific datasetsinparticular,includefaciltatingitsre-useandenablingitsvalidation. Aplethora of data repositories are availablewhere scientists can publish their datasets, assign a persistentidentifiertothem,andmakethemdiscoverable. Muchlessisknownaboutthe lifetimeofthosedatasetsaftertheirpublication,namelytheknowledgeofhow,bywhom,
andinwhatcontexts theyhavebeenre-used,andwhethersuchinstancesofre-usehave producedinteresting deriveddataproducts,possiblyafterseveralgenerations. Werefer tothisnewtypeofknowledgeasthetrajectoriesofpublisheddata(DataTrajectories,or
DT).ThemainhypothesisthatmotivatesourresearchisthatknowledgeofDTsmakesit possibletoquantifytheimpactandinfluenceofresearchdatathroughseveralgenerations ofreuseandderivation, transitively. Inturn, thiswilllead tonew notionsoftransitive credittodataowners,whichmayinformandextend currentdatacitationpractices. We
areawareofveryfewattemptsatdefiningtransitivecreditinthecontextofdatacitation. Amongst these is (Katz, 2014), where the concept is not fully formalised nor made operationalthroughmetadatamanagementandanalysis.
ChallengesinTrackingDataReuseandtheRoleofDataCitation
Whilecountingdatadownloadsfromrepositoriesisstraightforward,trackingtheirusage inthewildismuchmorechallenging. Datacanbereusedinendlesswaysthroughprogram logic, entirely or in part, on its own or combined with other datasets. Furthermore, suchderivationscan extendover severalgenerations, andmay takeplace on different, autonomousinformationsystemsanddataprocessingenvironments.
(Robinson-García,Jiménez-Contreras&Torres-Salinas,2015)describedatacitation practicesthatgobeyondsimpledownloadcountasvalid surrogatestodirecttrackingof datause. (Callaghanetal.,2012)recommendthatdatacitationshouldbebasedonsimilar reviewstagesasjournalarticles, asanecessaryfirststeptotreatingdataasafirstclass scientificobject. However,recognisingthecomplexityofdataderivation,theyalsoargue thatfurther mechanisms are neededto facilitate data transparencyand scrutiny. Even whendatacitationisprimarilyviewedasanextensionoftraditionalarticlepublication, trackingdata citationrequiresdifferentandmoresophisticated processesthan tracking datadownloads(Mayernik,2013).
EffortsinthisdirectionincludeThomson’sdatacitationindex1,aswellascommunity
effortssuchasthePublishingDataBibliometricsWorking Group2 attheResearchData
Alliance (RDA)3; the Snowball Metrics project4; Altmetrics; and Elsevier’s Metrics
1 WebofScienceDataCitationIndex:http://wokinfo.com/products_tools/multidisciplinary/dci/ 2 RDA/WDS Publishing Data Bibliometrics Working Group: https://rd-alliance.org/groups/rdawds
-publishing-data-bibliometrics-wg.html
3 ResearchDataAlliance:http://rd-alliance.org 4 SnowballMetrics:http://snowballmetrics.com
DevelopmentProgram5. In2014, theNSFfundedthe“Make your datacount”project,
managed by the PloS Open Access journal in collaborationwith DataONE6 and the
California Digital Library, to elicit ideas on data metrics from researchers. Earlier on, the MESUR project (Bollen, Van de Sompel & Rodriguez, 2008) focused on collectingevidence ofusage throughmany typesofevents, butmostlythoseassociated withreferences toarticles. OrganisationslikeDataCite7 promotethe useof persistent
identifiers,likeDOIs,fordata,whilethePublishingDataServicesWorkingGroup8atthe
RDAstudieswaystolinkdataandarticlepublications.
Contributions
Thispaperaimsto contributeto theunderstanding ofdirectdata reusemodels, andits implicationsforthe designofnew creditmodels basedondata reuse. Specifically,we makethefollowingcontributions.
Firstly, from the well-known notion of data provenance we derive a definition of thetrajectory ofa dataset D. Informally, this is the graph ofall direct and transitive
derivationsfromDtoanyotherD0,suchthatthereisaprovenancegraphthatincludesD
andD0,andD isreachablefromD0throughderivationandusage/generationpathsinthe
graph.
Secondly,we showhow perfectknowledge ofallsuchdependenciescanbeusedto formallydefinetransitivecredit,andwearegoingpresentonesuchcreditmodelindetail
asa concrete example. Transitive credit is basedon the principle that any credit that isassignedto derivativework D0 shouldpropagate transitively“upstream” to every D
suchthat D0isinthetrajectoryof D, i.e.,toany Dthatcontributedtothederivationof D0. Importantly,thismodelalsoaccommodatesanydirectcreditattributionthatmaybe
definedbythecommunity,beitbasedondatacitations,downloadcounts,orotherindirect criteria. Specifically,howmuchof D’screditshouldbeapportionedto D0isdetermined
bythedependencyrelationshipsalongthetrajectorypathfromDtoD0. Thus,weusethe
provenanceofD0toassignfaircredittoD,andthustoitspublisher(theAgentresponsible
forD,inprovenanceparlance).
Thirdly,wepresentaninstanceofourcreditmodelatworkonasimulateddatareuse scenario. With theunderstanding that many possiblesuch modelscan bedefined, we haveimplementedadatareusesimulator,9whichweuseasaresearchtoolforexploring
differentcreditmodels,andforunderstandingtheirimplicationsfordatapublishers. Thesecontributionsaredesignedtolaythefoundationsforfurtherresearchinthearea ofdatareuseanalysisbasedonprovenance. Inthissense,weareawarethattheconcepts presentedinthe paperarestillonly theoretical. Thereality oftracking datausage isa visionthatpresentsmanychallengesbecauseofthebroaddiversityofwaysinwhichpublic datacanbe usedwithout control, andthe lackof metadatamanagement infrastructure forgeneratingandcollectingprovenanceacrossmanyindependentinformationsystems. Implementingtheseideasin thewild istherefore along-term researchproposition, for
5 MetricsDevelopmentProgram:http://emdp.elsevier.com 6 DataONE:http://dataone.org
7 DataCite: http://datacite.org
8 Publishing Data Services Working Group: https://rd-alliance.org/groups/rdawds-publishing-data
-services-wg.html
9 DataReuseSimulator:https://github.com/PaoloMissier/DRS
whichsimulateddatareuseisonlythebeginning.
Thus, as our final contribution,we highlightsome of the challengesand set out a researchagendaforapracticalrealisationofourvisionofpervasivetrackingofpublished datathroughitslifetime.
Provlets
and
Data
Trajectories
Wenowoutlineanideal,theoreticalscenariowhereweassumethat(i)publisheddatasets areencapsulated as Research Objects (RO) (Bechhofer,De Roure, Gamble, Goble & Buchan,2010), whicharegiven uniqueandpersistent identifiersthroughcertified data repositorymanagers,and(ii)completeprovenancemetadataisavailable,whichdescribes eachinstanceofROreuse,atleastatahighlevel. Theresearch implicationsofrelaxing theseassumptionsarediscussedinthefinalsectionofthepaper.
ResearchObjects
Followingemerging practice for data preservation, specifically for scientific datasets, it is now becoming realistic to assume that units of publishable data be represented asResearch Objects. These arethe mainentities whosetrajectories we want totrack. ROsareencapsulationsofdataandmetadataofany type,describedbyaResource Map inORE format. Metadataartifacts may includethe description ofthe process(script, workflow)used togenerate thedata, theprovenance ofthedata, andother metadataof varyingtypes. Differentvocabularies,or ontologies, canbeusedin theResource Map tobestdescribe suchdiverse metadatacontent. Wealsoassume,following forexample DataCite and FigShare practices amongst others, that data publishers assign unique persistentID(PIDs),suchasDOIs,toROsuponpublication,andthatsuchPIDsareused consistentlythroughoutthederivationchain. ROformats may vary,rangingfromtheir original,complex,specification10,tothesimplernotionofDataPackagesasdefinedby
theDataONEproject11, totheevensimplerbutmoreradicalnotionofnanopublications
(Monsetal.,2011).
ThePROVModelforProvenance
We use the PROV provenance model (Moreau et al., 2012) as a foundation for a formal and machine-processable definition of Data Trajectories. A recent book on PROVdescribesthe W3Crecommendationthroughanumber ofcasestudies(Moreau & Groth, 2013). Using PROV, we can express derivation dependencies of the form “RO2 wasDerivedFrom RO1”, where RO1, RO2 are PROVEntities, i.e., data orother
artifacts to whichwe can associate a provenance. Further, if a program P is known
tohave used RO1 as input, and havegenerated a new RO2 as output, we can express
thederivation ofRO2 fromRO1 through P using thefollowing two PROV assertions:
< Pused RO1 >,< RO2 wasGeneratedBy P>,whichcollectivelyforma(verybasic)
PROVdocument. Here, PisanexampleofanActivity,i.e.,“somethingthatoccursover
10See:http://researchobjects.org
11See:https://releases.dataone.org/online/api-documentation-v1.2.0/design/DataPackage.html
aperiodoftimeandactsuponorwithentities”(Moreauetal.,2012).
WecanalsousePROVtoexplicitlyassociatebothEntitiesandActivitieswithActors, i.e,peoplebut also, possibly,automated systems,whohavebeen responsiblefor those EntitiesandActivities. Thefollowing PROVdocument extendsthe exampleaboveby includingattributionannotationsconcerningtwoactors A1, A2:
< PusedRO1 >,< RO2wasGeneratedByP> (1)
< RO1wasAttributedTo A2 >,< RO2wasAttributedTo A1 >,< PwasAssociatedWith A1> (2)
whereassertionsonline(1)describedependenciesamongsttheROs,andthoseonline (2)associatetheROsandtheprogramPwithAgents.
PROVdefinesthreetypesofsets: (i)Entities(En),i.e.,data,documents;(ii)Activities
(Act),whichrepresenttheexecutionofsomeprocessoveraperiodoftime,and(iii)Agents
(Ag),i.e.,humans,computingsystems,software. Wearegoingtousethefollowingsubset
ofrelationsamongstthesesets:
usage:used ⊆Act×En generation:wasGeneratedBy⊆En×Act derivation:wasDerivedFrom ⊆En×En
association:wasAssociatedWith ⊆Act×Ag attribution:wasAttributedTo⊆ En×Ag
Furthermore,to eachactivity a ∈ Act we associatea type, τact(a). Activity types are
usefultodescribepropertiesthatarecommontoasetofactivities,suchastheparameters used to compute transitive credit for ROs, as defined later. Finally, we represent a provenance documentas aDirected AcyclicGraph(DAG), where nodesdenote either Entities, Activities, or Agents, and an arc of the form x →−r y denotes the directed
relationshipr(x,y),wherer isoneoftherelationtypesabove.
Provlets
Wehavemade theideal assumptionthat completeprovenance is available todescribe eachinstance ofdatareuse. Moreprecisely,eachderivation/reuseeventinvolving ROs isdescribedby a smallPROVdocument, suchasthose shownabove. Wehavecoined theterm provlets to denote suchdocuments. Althoughin reality eachof theseevents
mayoccuronadifferentInformationSystemandatdifferenttimes,wealsoassumethat provlets,possiblycreatedindependentlyofeachother,areavailableforeachreuseevent. Takenindividually, eachprovlettellsalimitedstory ofanRO’s lifetime,aseachis concernedwithasinglederivationstep. However,aslongasthereisagreementamongst thesystemon consistentlyusingthe PIDsassignedto eachRO,it isstraightforward to combinea collectionof provlets thatcontain references to thesame RO,into a larger PROVdocument.
APublication-ReuseScenario
WeshowtheprovletsideaonasimpleROpublication-reusescenario,depictedinFigure 1. The scenario involves an initial RO, RO1, which is created andthen published by
Alicetodatarepository DR1. This ROislater discovered,downloaded,andreusedby
BobthroughaprocessP1,andindependentlybyCharliethroughprocessP2, resultingin
derivativeobjectsRO2, RO3, and RO4, respectively. Thesenew ROsmay bepublished
DR1
DR3
DR2
Alice
RO1 RO1
RO3 S2 RO3 RO4 RO3 RO5 RO2 Bob Charlie RO1 P2 P1 1⃣ 2⃣ 3⃣
Figure1. A hypothetical sequence of
publish-reuseactions. P1 Px P2 RO1 RO2 RO3 RO4 RO5 used used used used genBy genBy genBy genBy
Figure2. PROV graph for the
se-quenceontheleft.
intodifferentandseparatedatarepositories,e.g. DR2, DR3asinthefigure. HereAlice,
Bob,andCharliearemodelledasPROVAgents, andP1, P2asActivities. Notalldetails
about a derivation are always available. For instance, in this example RO2 and RO3
arelater themselvesreusedby some unknownAgent throughsome unknown Activity, generatingRO5asaresult. Table1liststheROreuseeventsforthisscenario,alongwith
thecorrespondingprovletsintextualandgraphform.
DataTrajectories
Givenaprovenance DAGp,considerthegraph p0obtainedbyreversingthedirectionof
thearcsin p. ForeachnodeROof p0, wedefinethetrajectory DT(RO)ofRO tobethe
treeobtainedbytraversing p0startingfromRO. WewriteDT.e(RO) andDT.a(RO) to denotethe setofEntity(i.e. RO)nodesandActivitynodes, respectively,thatappearin theDT(RO) tree. As anexample, the trajectoriesofeachof theROsfor the complete provenancegraphinFigure2arepresentedinFigure3. Notethatthisdefinitionallows anROtoappearinthetrajectoryofanotherROmorethanonce,forinstanceRO5appears
twiceinDT(RO1),becauseitisreachablefromRO1boththroughRO2andRO3.
P1 P2 RO1 DT(RO1): DT(RO3) P2 Px RO3 DT(RO3): DT(RO4) DT(RO5) RO5 DT(RO5): Px RO2 DT(RO5) DT(RO2) RO4 DT(RO4) DT(RO2): DT(RO4): Credit propagation
Figure3. SummaryoftrajectorytreesforeachoftheROsintherunningexample.
From
Data
Trajectories
to
Transitive
Credit
for
Data
Owners
To illustrate how this simple notion of data trajectories provides a foundation for experimentingwithmodelsoftransitivecredit,wedefineonesuchmodelasanexample.
Datareuseevent Provfragment
Alice
generatesRO1 Alice
RO1 DR1 RO1wasAttributedToAlice
Alice
RO1
wasAttributedTo
BobreusesRO1,
generatingRO2,RO3
DR1 RO1 P1 DR3 DR2 RO2 RO3 Bob
P1usedRO1,
RO2wasGeneratedByP1,
RO3wasGeneratedByP1,
RO2wasAttributedToBob,
RO3wasAttributedToBob,
P1wasAssociatedWithBob
wasAttributedTo P1 RO1 RO2 RO3 used genBy genBy Bob
CharliereusesRO1 and RO3,
generatingRO4through
P2
DR1
RO1 P2 DR3
RO4 RO3 Charlie
P1usedRO1,
P2usedRO1,
P2usedRO3,
RO4wasGeneratedByP2,
RO4wasAttributedToCharlie,
P2wasAssociatedWithCharlie
wasAttributedTo P2 RO1 RO3 RO4 used genBy Charlie
Unknown Agent reuses
RO2andRO3,
generatingRO5
through an unknown activity DR2 RO2 Px DR3 RO5 RO3
PxusedRO2,
PxusedRO3,
RO5wasGeneratedByPx
Px RO2
RO3
RO5
used genBy
Table1. ROreuseeventsandcorrespondingprovletsfortherunningexample.
Themodel isunderpinned by a simple principle: whena deriveddata productRO0 is
credited,i.e. bythecommunity,asavaluable researchdatacontribution,thenallofthe otherROsthatmadeRO0possibleshouldreceivesomeofthatcredit,inaproportionthat
dependson theirimportanceoncreatingRO0. ThemoreindispensableROis perceived
toRO0sderivation,themorecreditROshouldreceive. Thisprincipleappliestransitively
toaccountformultiplegenerationsofreuseandderivation. WeuseDataTrajectoriesto determinehow creditpropagates “upstream” from derivedROs,possibly several steps removed fromthe originalRO.We introduceanumber ofparameters, one foreach of thetypesτact(a) ofactivitiesathataccountfor theROtransformations,toquantifythe
notionofrelativeimportanceoftheupstreamROsinthederivationprocess. Ultimately, credittransfersfromtheROs totheAgents whoareresponsibleforthem, accordingto theEntityattributionassertionsinthePROVdocument.
Followingthisrationale,weseparatethetotalcreditascribedtoRO,denotedcr(RO), into two separate components. The first is the external credit, denoted crext(RO).
Thiscomponentaccommodatesany criteriathata communitymay decidetoadoptfor associatinga score toa published RO,and whichis independent onthe reuse history oftheRO.Suchscoremay,for example, reflectemergingcommunitypracticesondata citationsinrepositories. Thesecondcomponentofcr(RO) reflectsthereusehistoryof
RO. ItallowseachROintheprovenance graphtoreceiveafractionofthecreditthatis
ascribedtoeach“downstream”RO0∈DT.e(RO). Forthesakeoftheexample,weassume thatdownstreamcreditscombinelinearlytoprovidecredittoupstreamnodes.
Notethat thisis adefinition by induction, following thetree structureof DT(RO). The base case is that of a RO0 that has not been reused at all. In this case, only the
external, baseline credit component crext(RO0) applies. For the induction, we now
distinguishseveralPROVpatternsofreuse. Asummaryofthesepatterns,alongwiththeir correspondingcreditpropagationrulesandthetrajectorypatterns,isdepictedinFigure4.
RO reused p times
single-input, single-output activity
many-input, many-output activity
RO derivation with unknown activity
RO used genBy Ap A1 … … … … … genBy used credit
cr(RO) =
p
k:1
crak(RO)
RO used genBy A credit RO’
αa credit
cra(RO) = (a)·cr(RO ) +crext(RO)
used genBy A αa RO’m RO’1
RO1 ROi ROn
βa
γa
cra(RO) = a· ia· m
j:1
a
j·cr(ROj) +crext(RO)
wasDerivedFrom αder
RO’ RO
cr(RO) =
der
n ·cr(RO ) +crext(RO)
Figure4. ROreusepatterns,trajectories,andcreditpropagationrules
Tobegin,considerthemostgeneralcase, wherewe assumethatROhasbeenreused
byr differentactivities,a1...ar,possiblyatdifferenttimes,asinFigure4(a). Following
thestructureofDT(RO)fromFigure3,wedefinecr(RO)tobethesumofrdistinctcredit
components,cra1(RO)...crar(RO),eachduetooneactivityak thathasreusedRO:
cr(RO)=
r
X
k:1
crak(RO) (3)
We now progressively build up to a general definition of cra(RO), for a generic
activitya. WebeginwiththesimplestcasewhereROis usedby atogenerate asingle
newRO,RO0,asinFigure4(b). Asmentioned,wewantRO toreceiveafractionofRO0s
credit. To modeltheextenttowhichcreditpropagatesthrougha,we introduceacredit transferparameterα(a),with0 ≤ α(a) ≤ 1. Toexplainitsfunction,recallthattheideaof
creditpropagation throughareusepattern < aused RO >,< RO0wasGeneratedBy a >
isbasedupon theintuition thatRO0 owesitsvalue to bothRO, andthe transformation
a. Introducing α(a) allows us to explicitly model the value contribution due to the
transformation a, relative to that of its input data RO. For instance, consider a data
cleaningalgorithmthattakes noisydataRO andproducesacleanerversion,RO0, ofthe
samedata. OnemayarguethatmuchofthevalueinRO0isduetothealgorithm,rather
thantothe data. Wemodelthis by only transferringasmall portion ofcr(RO0) credit backtoRO,i.e.,by settingalowvalueforα(a). Notethatdiscussingspecificcriteriafor
settingthevaluesofthisandotherparametersintroducedinthemodelisbeyondthescope ofthispaperandleftforfurtherresearch,asmentionedinthelastsectionofthepaper.
Formally,wedefinethecreditpropagationruleforthegraphpatterninFigure4(b)as:
cra(RO) = α(a)·cr(RO0)+crext(RO) (4)
where cra(RO) is defined inductively in terms of cr(RO0), with the external credit
crext(RO0) asthebasecase.
Next, we extend Equation (4) to the case where RO is only one of n > 1 inputs
usedby a. Thisnewpattern isshowninFigure4(c). Inthisscenario,inadditiontothe
transferparameterα(a),wealsoaccountfortherelativeimportanceofeachoftheninputs
RO1...ROn. Wethereforeintroducennewfactors,0< βi(a) ≤1,i :1...n,subjectto:
n
X
i:1
β(a)
i =1
anddefine:
cra(ROi)= α(a) · βi(a) ·cr(RO0)+crext(ROi) (5)
Withthisnewdefinition,ROaccruesaproportionofthetotalcreditofRO,whichaccounts
for itsperceived importance in computingRO0using a. Notethat whenthere is only
oneinput,Equation(5)reducestoEquation(4)asexpected,andwhenallinputstoa are
equallyimportant,i.e. β(a)
i =
1
n foralli,Equation(5)becomes
cra(ROi)=
αa
n ·cr(RO
0
)+crext(ROi) (6)
Finally, weextend Equation(5)onemoretime, toaccountfor themostgeneral pattern wherenot only is RO only oneof the inputs, but also, a generates m > 1outputs, as
showninFigure4(d). Inthissituation,ROreceivescreditfromeachoftheoutputsRO0,
whichareallpartofDT(RO). Again,wemodelthedifferentimportanceascribedtoeach ofthesederiveddataproductsbyintroducingmnewfactorsγ(ja),subjectto
m
X
j:1
γa
j = m
anddefine:
cra(ROi) =αa· βia· m
X
j:1
γa
j ·cr(RO
0
j)+crext(ROi) (7)
Weconcludebyaddingthespecialcasewheretheactivitythataccountsforthe RO reuseisunknown. Inthiscase,weusethegenericdataderivationrelationship:
RO0wasDerivedFromRO (8)
where of course more than one RO0 may have been derived from RO. According to
thePROV constraints document (Cheney,Missier & Moreau, 2012), frompattern (8) we can infer the existenceof an activity a, suchthat both assertions < a used RO > ,< RO0wasGeneratedBy a > hold. Weintroduceafinalcredittransferparameter, αder,
to modelcredit propagation due to derivation. In this case, when there are n known
derivationsofRO,rule(4)becomes:
cr(RO) = α
der
n ·cr(RO
0
)+crext(RO) (9)
Finally, we stipulatethattheAgents Ag thatarementioned inthePROVdocument
accrueacreditcrag(Ag)thatissimplythesumofeverycreditassociatedtotheROsthey areresponsiblefor:
crag(Ag)= X
r
cr(r) overallROrs.t. < rwasAttributedTo Ag> holds. (10)
ModelSummary
We have shown how a formal notion of a data trajectory DT(RO), derived from a compositionof multiple, independently generated provlets, can be used to apportion credittodatapublishers. Asanexample,wehavepresentedamodelthatconsistsofthree mainelements:
• An external credit function, crext(RO), which associates a value to each RO
that appears in the compound provenance graph. Such value can follow any community-basedscoringschemeofdatarelevance;
• Asetofcreditpropagationrules(3)through(9)thatarecomputedinductivelyfrom
DT(RO)andwhichformalisethenotionoftransitivecredit,cr(RO);
• A setof credittransferparameters, whichaccountfor thenature oftheactivities involves inthetrajectoryofRO, including,wherethisinformationisavailable,the
relativeimportanceofeachofitsinputsandoutputs.
Simulating
Data
Trajectories
and
Credit
Propagation
Realisingan information management infrastructurethat iscapable ofgenerating data trajectoriesforallinstancesofdatareuseisalong-term,challengingresearchproposition, whichwearticulateinthefinalsectionofthispaper. Asastartingpointfortheresearch, wehaveimplementedaDataReuseSimulator,whichweuseasatoolforexperimenting
with various assumptions regarding the completeness of data trajectories, and with
differentcreditmodels.12. The simulatoriscapable ofgenerating two typesofevents:
(i)newinstances of datareuseandderivation, and(ii)updatesto theexternal creditof oneormoreoftheROs,ontheassumptionthatcommunity-ascribed creditmaychange over time. Data reuse events cause the generation of more ROs, the creation of the correspondingprovlets,and theupdateofdatatrajectoriestoreflectthenew derivation andusage/generationrelationships,asshownintheexampleofFigure3. Theyalsotrigger thepropagationoftheinitialexternal creditassociatedwithnew ROs,backwardsalong eachof therelevanttrajectories. Thesecondtype ofevents,changes toexternal credit, alsotriggersthepropagationofthecreditupdates.
Thesimulatorcanbeusedtoexploremanyscenariosofpossibletrajectorystructures and credit propagation dynamics, through the generation of random interleavings of events,withsomeusercontrol. Hereweshow thesimulatorin action,toreproducethe scenarioinFigure1. Wehavealsopresentedamorecomplexdatareusescenariointhe Appendix,toprovideabetterintuitionforthesimulator’scapabilities. TheplotinFigure5 showshowcreditchangesfortheROs,inresponsetokeyeventsinourexample,shown atthebottom. Initially, allnew ROshave thesameexternal credit value 1. Following thereference scenario,these valuespropagate through activitiesP1 andP2, as well as throughathirdunknownactivity.
Figure5. TotalcreditchangestoROsfollowingreuseandexternalcreditadjustmentevents.
Inthesimulator,we makethesimplifyingassumptionthatallinputstoanactivitya
areequallyimportant, i.e. weuseEquation(5) where β(a)
i = 1n foralli. Similarly,we
useasinglevalue γ(a) = m, thenumberofinputstoa. Withtheseassumptions,wecan expressthetype τact(a) ofanactivity a asatriple τact(a) = [α,β,γ]. Intheexample,
wehaveused τact(P1) = [0.5,1,0.5], and τact(P2) = [0.8,0.5,1]. Theimplicit activity
dt:act_297isassignedτact(P1) bydefault.
ThefigureillustratesthedifferentwaysthatthetotalcreditofeachROprogresses,ata fasterorslowerpacethanthatofothers,dependingontheamountofreuseandthetypeof
12Thecurrentversionofthesimulatorsoftwareisavailableathttp://github.com/PaoloMissier/DRS. Itis
implementedinPythonandmakesuseoftheSouthamptonprovenancesuite: http://provenance.ecs.soton .ac.uk
activitythatconsumestheRO.Asexpected,theoldestRO,RO1acquiresthehighestcredit
asits trajectoryextends over time, and asits descendentsacquire recognitionthrough additionalexternalcredit. NotethatcreditcanbetransferredfromROstotheagentsthat areresponsibleforthem,byusingtheattributionandassociationPROVrelationships.
Data
Trajectories
in
Practice:
Challenges
and
Research
The data trajectoriesand the transitive credit modelillustrated in this paper are both theoretical. Inreality,becauseofthebroaddiversityofwaysinwhichpublicdatacanbe usedwithoutcontrol,thevisionoftrackingdatausageinthewildfacesmanychallenges.
Weconcludebyhighlightingsomeofthesechallenges,andsetoutaresearchagendafor realisingtransitivecreditinpractice.
Trajectories are compositions of independently created provlets, which must be systematically generated by multiple, diverse, autonomous information systems, to theextent possiblethrough observationof data transformation processes. This is not unrealistic,asprovenancerecordersexistforlanguageslikePython(Murta,Braganholo, Chirigati, Koop& Freire, 2014)and R(Liu &Pounds, 2014; Lerner& Boose, 2014), aswellasformanyworkflowmanagementsystemsincludingTaverna,eScienceCentral, SciCumulus,Pegasus, Kepler. However,no systemtoday systematically harveststhese traces in a central place, where trajectories can be computed. This is a long-term infrastructureproblem,requiringconcertedeffortsacrossdatarepositoriesorganisations. Also,thegranularityatwhichprovenanceisrecordedvaries,dependingonthesystems’ provenancecapturecapabilities. Further,provletcompositionrequirestheconsistentuse ofdataidentifiersacrossinstancesofdatareuseandacrosssystems. Thisisbynomeans thenormtoday,althoughstandardsfor dataPIDs,likethosepromotedbyDataCite, are gaining acceptancein forums like the DigitalCuration Centre inthe UK13, and more
globally,theRDA.However,evenwhenidentifiersareavailabledataconsumershaveno obligationtoacknowledgetheirprimarysourceofdata. Thisisparticularlyproblematicin theso-calledlongtailofscience(Wallis,Rolando&Borgman,2013),whereconsumers
are less likely to record reuse in any systematic way. Credit management is further complicatedwhenROsareonlypartiallyreused,asthisviolatestheassumptionthatROs
areatomicdataentities.
To someextent,these issuescanbeaddressed throughalong-term plantodevelop infrastructuretosupportthenotionofdatatrajectoriesacrossthebroadresearchscience community. Morefundamentally,however,weshouldassumethattrajectoriesarealways boundtobefragmentedandincompleterepresentationsof actualdatareuse,leadingin turntounrealisticcreditassignments. Oursuggestedresearchagendaisthereforefocused onaddressingthefollowingkeyresearchquestions.
• Firstly, underwhatcircumstancesitispossibletoestimatethelikelihood ofsome of the missing derivations (for instance, using machine learning and predictive analyticstechniques)?
• Secondly, to what extent can the resulting probabilistic provenance graphs and trajectoriesbeusedtosupportuseful,fair,andcredibletransitivecreditmodels?
13DigitalCurationCentre: http://dcc.ac.uk
• Thirdly,whenusingacreditmodelthatreliesoncredittransferparameters, aswe haveshown,howarethesedetermined? Cantheybelearnt,oradjusteddynamically followingfeedbackfromthecommunity?
References
Bechhofer,S., De Roure, D., Gamble, M., Goble, C. & Buchan, I. (2010). Research objects: Towards exchange andreuse ofdigital knowledge. Nature Precedings. doi:10.1038/npre.2010.4626.1
Bollen, J., Van de Sompel, H. & Rodriguez, M. A. (2008). Towards usage-based impact metrics. In Proceedings of the 8thACM/IEEE-CS Joint Conference on DigitalLibraries -JCDL’08(p. 231). NewYork, NewYork, USA: ACMPress. doi:10.1145/1378889.1378928
Callaghan, S., Donegan, S., Pepler, S., Thorley, M., Cunningham, N., Kirsch, P., ... Wright, D. (2012). Makingdataa firstclassscientificoutput: Datacitationand publicationbyNERC’senvironmentaldatacentres. doi:10.2218/ijdc.v7i1.218
Cheney,J.,Missier,P.&Moreau, L. (2012). Constraintsoftheprovenancedatamodel
(Tech.Rep.). Retrievedfromhttp://www.w3.org/TR/prov-constraints/
Katz, D. S. (2014). Transitive credit asa means to address social and technological concerns stemming from citationand attributionof digitalproducts. Journal of OpenResearchSoftware,2(1),e20.
Lerner,B. S.&Boose, E. R. (2014). Collectingprovenance inaninteractive scripting environment. ProceedingsofTAPP’14.
Liu, Z. &Pounds, S. (2014). AnR package that automatically collects andarchives detailsforreproduciblecomputing. BMCBioinformatics,15(1),138. doi:10.1186/ 1471-2105-15-138
Mayernik,M. S. (2013). Bridgingdatalifecycles: Trackingdatauseviadatacitations workshopreport(Tech.Rep.). NationalCenterforAtmosphericResearch.
Mons,B.,vanHaagen,H.,Chichester,C.,Hoen,P.-B.T.,denDunnen,J.T.,vanOmmen, G., ... Schultes, E. (2011). The value ofdata. NatureGenetics, 43(4), 281–3. doi:10.1038/ng0411-281
Moreau, L. & Groth, P. (2013). Provenance: An introduction to PROV. Synthesis LecturesontheSemanticWeb: TheoryandTechnology,3(4),1–129.
Moreau,L.,Missier,P.,Belhajjame,K.,B’Far,R.,Cheney,J.,Coppens,S.,... Tilmes, C. (2012). PROV-DM: ThePROV datamodel (Tech.Rep.). World WideWeb
Consortium. Retrievedfromhttp://www.w3.org/TR/prov-dm/
Murta, L., Braganholo, V., Chirigati, F., Koop, D. & Freire, J. (2014). noWorkflow: Capturingandanalyzingprovenanceofscripts. ProceedingsofIPAW’14.
Piwowar,H.A.&Vision,T.J. (2013). Datareuseandtheopendatacitationadvantage.
PeerJ,1,e175. doi:10.7717/peerj.175
Robinson-García, N., Jiménez-Contreras, E. & Torres-Salinas, D. (2015). Analyzing datacitationpracticesusingthedatacitationindex. JournaloftheAssociationfor InformationScienceandTechnology. doi:10.1002/asi.23529
Wallis,J. C.,Rolando,E. &Borgman,C. L. (2013). Ifwe sharedata, willanyoneuse them? Datasharing and reusein thelongtail ofscience andtechnology. PLoS ONE,8(7),e67332. doi:10.1371/journal.pone.0067332
Appendix:
A
More
Complex
Instance
of
Simulated
Data
Reuse
Figure 6 shows a more complex simulated data reuse scenario, which includes 15 ROs, managed by nine Data Operators (the Agents at the top of the figure), with a randomcombinationoftenderivationandusage/generationevents. Theseare(randomly) interleavedwithtenexternalcreditupdateevents. Theresultingprogressionoftotalcredit overtimeisshowninFigure7.
Figure6. Theglobalprovenancegraphfortheentirereusehistory.
Figure7. ROtotalcreditprogressionforthedatareusescenarioofFigure6.