• No results found

Download Download PDF

N/A
N/A
Protected

Academic year: 2020

Share "Download Download PDF"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Data

Trajectories:

Tracking

Reuse

of

Published

Data

for

Transitive

Credit

Attribution

PaoloMissier SchoolofComputingScience

NewcastleUniversity,UK

Abstract

Theabilitytomeasuretheuseandimpactofpublisheddatasetsiskeytothesuccessof theopendata/openscienceparadigm. Adirectmeasureofimpactwouldrequiretracking data(re)useinthewild,whichisdifficulttoachieve.Thisisthereforecommonlyreplaced bysimplermetricsbasedondatadownloadandcitationcounts.Inthispaperwedescribe ascenario whereit ispossible totrackthe trajectoryof adatasetafter its publication, and show how thisenables the design of accuratemodels for ascribing credit to data originators.ADataTrajectory(DT)isagraphthatencodesknowledgeofhow,bywhom, andinwhichcontextdatahasbeenre-used,possiblyafterseveralgenerations. Weprovide atheoreticalmodelofDTsthatisgroundedintheW3CPROVdatamodelforprovenance, andwe showhow DTscanbe usedto automaticallypropagateafraction of the credit associatedwithtransitivelyderiveddatasets,backtooriginaldatacontributors. Wealso showthismodeloftransitivecreditinactionbymeansofaDataReuseSimulator.Inthe longerterm,ourultimatehopeisthatcreditmodelsbasedondirectmeasuresofdatareuse willprovidefurtherincentivestodatapublication. Weconcludebyoutliningaresearch agendatoaddressthehardquestionsofcreating,collecting,andusingDTssystematically acrossalargenumberofdatareuseinstancesinthewild.

Received29June2016 | Revisionreceived20September2016 | Accepted21September2016

Correspondence should be addressed to Paolo Missier, Claremont Road, Newcastle upon Tyne. Email:

[email protected]

Anearlierversionofthispaperwaspresentedatthe11thInternationalDigitalCurationConference.

TheInternationalJournalofDigitalCurationisaninternationaljournalcommittedtoscholarlyexcellenceanddedicated

totheadvancementofdigitalcurationacrossawiderangeofsectors. TheIJDCispublishedbytheUniversityof EdinburghonbehalfoftheDigitalCurationCentre.ISSN:1746-8256.URL:http://www.ijdc.net/

Copyrightrestswiththeauthors.ThisworkisreleasedunderaCreativeCommonsAttribution4.0 InternationalLicence.Fordetailspleaseseehttp://creativecommons.org/licenses/by/4.0/

InternationalJournalofDigitalCuration

(2)

Introduction

Thepracticeofpublishingresearchdatahasbeenmaturingrapidly,followingincreasing evidence that the combination of data sharing and emerging data citation practices representnewopportunitiesforextendingthevaluechainofthedata,ratherthanathreat to its owners (Piwowar & Vision, 2013). Reasons for publishing data, andscientific datasetsinparticular,includefaciltatingitsre-useandenablingitsvalidation. Aplethora of data repositories are availablewhere scientists can publish their datasets, assign a persistentidentifiertothem,andmakethemdiscoverable. Muchlessisknownaboutthe lifetimeofthosedatasetsaftertheirpublication,namelytheknowledgeofhow,bywhom,

andinwhatcontexts theyhavebeenre-used,andwhethersuchinstancesofre-usehave producedinteresting deriveddataproducts,possiblyafterseveralgenerations. Werefer tothisnewtypeofknowledgeasthetrajectoriesofpublisheddata(DataTrajectories,or

DT).ThemainhypothesisthatmotivatesourresearchisthatknowledgeofDTsmakesit possibletoquantifytheimpactandinfluenceofresearchdatathroughseveralgenerations ofreuseandderivation, transitively. Inturn, thiswilllead tonew notionsoftransitive credittodataowners,whichmayinformandextend currentdatacitationpractices. We

areawareofveryfewattemptsatdefiningtransitivecreditinthecontextofdatacitation. Amongst these is (Katz, 2014), where the concept is not fully formalised nor made operationalthroughmetadatamanagementandanalysis.

ChallengesinTrackingDataReuseandtheRoleofDataCitation

Whilecountingdatadownloadsfromrepositoriesisstraightforward,trackingtheirusage inthewildismuchmorechallenging. Datacanbereusedinendlesswaysthroughprogram logic, entirely or in part, on its own or combined with other datasets. Furthermore, suchderivationscan extendover severalgenerations, andmay takeplace on different, autonomousinformationsystemsanddataprocessingenvironments.

(Robinson-García,Jiménez-Contreras&Torres-Salinas,2015)describedatacitation practicesthatgobeyondsimpledownloadcountasvalid surrogatestodirecttrackingof datause. (Callaghanetal.,2012)recommendthatdatacitationshouldbebasedonsimilar reviewstagesasjournalarticles, asanecessaryfirststeptotreatingdataasafirstclass scientificobject. However,recognisingthecomplexityofdataderivation,theyalsoargue thatfurther mechanisms are neededto facilitate data transparencyand scrutiny. Even whendatacitationisprimarilyviewedasanextensionoftraditionalarticlepublication, trackingdata citationrequiresdifferentandmoresophisticated processesthan tracking datadownloads(Mayernik,2013).

EffortsinthisdirectionincludeThomson’sdatacitationindex1,aswellascommunity

effortssuchasthePublishingDataBibliometricsWorking Group2 attheResearchData

Alliance (RDA)3; the Snowball Metrics project4; Altmetrics; and Elsevier’s Metrics

1 WebofScienceDataCitationIndex:http://wokinfo.com/products_tools/multidisciplinary/dci/ 2 RDA/WDS Publishing Data Bibliometrics Working Group: https://rd-alliance.org/groups/rdawds

-publishing-data-bibliometrics-wg.html

3 ResearchDataAlliance:http://rd-alliance.org 4 SnowballMetrics:http://snowballmetrics.com

(3)

DevelopmentProgram5. In2014, theNSFfundedthe“Make your datacount”project,

managed by the PloS Open Access journal in collaborationwith DataONE6 and the

California Digital Library, to elicit ideas on data metrics from researchers. Earlier on, the MESUR project (Bollen, Van de Sompel & Rodriguez, 2008) focused on collectingevidence ofusage throughmany typesofevents, butmostlythoseassociated withreferences toarticles. OrganisationslikeDataCite7 promotethe useof persistent

identifiers,likeDOIs,fordata,whilethePublishingDataServicesWorkingGroup8atthe

RDAstudieswaystolinkdataandarticlepublications.

Contributions

Thispaperaimsto contributeto theunderstanding ofdirectdata reusemodels, andits implicationsforthe designofnew creditmodels basedondata reuse. Specifically,we makethefollowingcontributions.

Firstly, from the well-known notion of data provenance we derive a definition of thetrajectory ofa dataset D. Informally, this is the graph ofall direct and transitive

derivationsfromDtoanyotherD0,suchthatthereisaprovenancegraphthatincludesD

andD0,andD isreachablefromD0throughderivationandusage/generationpathsinthe

graph.

Secondly,we showhow perfectknowledge ofallsuchdependenciescanbeusedto formallydefinetransitivecredit,andwearegoingpresentonesuchcreditmodelindetail

asa concrete example. Transitive credit is basedon the principle that any credit that isassignedto derivativework D0 shouldpropagate transitively“upstream” to every D

suchthat D0isinthetrajectoryof D, i.e.,toany Dthatcontributedtothederivationof D0. Importantly,thismodelalsoaccommodatesanydirectcreditattributionthatmaybe

definedbythecommunity,beitbasedondatacitations,downloadcounts,orotherindirect criteria. Specifically,howmuchof D’screditshouldbeapportionedto D0isdetermined

bythedependencyrelationshipsalongthetrajectorypathfromDtoD0. Thus,weusethe

provenanceofD0toassignfaircredittoD,andthustoitspublisher(theAgentresponsible

forD,inprovenanceparlance).

Thirdly,wepresentaninstanceofourcreditmodelatworkonasimulateddatareuse scenario. With theunderstanding that many possiblesuch modelscan bedefined, we haveimplementedadatareusesimulator,9whichweuseasaresearchtoolforexploring

differentcreditmodels,andforunderstandingtheirimplicationsfordatapublishers. Thesecontributionsaredesignedtolaythefoundationsforfurtherresearchinthearea ofdatareuseanalysisbasedonprovenance. Inthissense,weareawarethattheconcepts presentedinthe paperarestillonly theoretical. Thereality oftracking datausage isa visionthatpresentsmanychallengesbecauseofthebroaddiversityofwaysinwhichpublic datacanbe usedwithout control, andthe lackof metadatamanagement infrastructure forgeneratingandcollectingprovenanceacrossmanyindependentinformationsystems. Implementingtheseideasin thewild istherefore along-term researchproposition, for

5 MetricsDevelopmentProgram:http://emdp.elsevier.com 6 DataONE:http://dataone.org

7 DataCite: http://datacite.org

8 Publishing Data Services Working Group: https://rd-alliance.org/groups/rdawds-publishing-data

-services-wg.html

9 DataReuseSimulator:https://github.com/PaoloMissier/DRS

(4)

whichsimulateddatareuseisonlythebeginning.

Thus, as our final contribution,we highlightsome of the challengesand set out a researchagendaforapracticalrealisationofourvisionofpervasivetrackingofpublished datathroughitslifetime.

Provlets

and

Data

Trajectories

Wenowoutlineanideal,theoreticalscenariowhereweassumethat(i)publisheddatasets areencapsulated as Research Objects (RO) (Bechhofer,De Roure, Gamble, Goble & Buchan,2010), whicharegiven uniqueandpersistent identifiersthroughcertified data repositorymanagers,and(ii)completeprovenancemetadataisavailable,whichdescribes eachinstanceofROreuse,atleastatahighlevel. Theresearch implicationsofrelaxing theseassumptionsarediscussedinthefinalsectionofthepaper.

ResearchObjects

Followingemerging practice for data preservation, specifically for scientific datasets, it is now becoming realistic to assume that units of publishable data be represented asResearch Objects. These arethe mainentities whosetrajectories we want totrack. ROsareencapsulationsofdataandmetadataofany type,describedbyaResource Map inORE format. Metadataartifacts may includethe description ofthe process(script, workflow)used togenerate thedata, theprovenance ofthedata, andother metadataof varyingtypes. Differentvocabularies,or ontologies, canbeusedin theResource Map tobestdescribe suchdiverse metadatacontent. Wealsoassume,following forexample DataCite and FigShare practices amongst others, that data publishers assign unique persistentID(PIDs),suchasDOIs,toROsuponpublication,andthatsuchPIDsareused consistentlythroughoutthederivationchain. ROformats may vary,rangingfromtheir original,complex,specification10,tothesimplernotionofDataPackagesasdefinedby

theDataONEproject11, totheevensimplerbutmoreradicalnotionofnanopublications

(Monsetal.,2011).

ThePROVModelforProvenance

We use the PROV provenance model (Moreau et al., 2012) as a foundation for a formal and machine-processable definition of Data Trajectories. A recent book on PROVdescribesthe W3Crecommendationthroughanumber ofcasestudies(Moreau & Groth, 2013). Using PROV, we can express derivation dependencies of the form “RO2 wasDerivedFrom RO1”, where RO1, RO2 are PROVEntities, i.e., data orother

artifacts to whichwe can associate a provenance. Further, if a program P is known

tohave used RO1 as input, and havegenerated a new RO2 as output, we can express

thederivation ofRO2 fromRO1 through P using thefollowing two PROV assertions:

< Pused RO1 >,< RO2 wasGeneratedBy P>,whichcollectivelyforma(verybasic)

PROVdocument. Here, PisanexampleofanActivity,i.e.,“somethingthatoccursover

10See:http://researchobjects.org

11See:https://releases.dataone.org/online/api-documentation-v1.2.0/design/DataPackage.html

(5)

aperiodoftimeandactsuponorwithentities”(Moreauetal.,2012).

WecanalsousePROVtoexplicitlyassociatebothEntitiesandActivitieswithActors, i.e,peoplebut also, possibly,automated systems,whohavebeen responsiblefor those EntitiesandActivities. Thefollowing PROVdocument extendsthe exampleaboveby includingattributionannotationsconcerningtwoactors A1, A2:

< PusedRO1 >,< RO2wasGeneratedByP> (1)

< RO1wasAttributedTo A2 >,< RO2wasAttributedTo A1 >,< PwasAssociatedWith A1> (2)

whereassertionsonline(1)describedependenciesamongsttheROs,andthoseonline (2)associatetheROsandtheprogramPwithAgents.

PROVdefinesthreetypesofsets: (i)Entities(En),i.e.,data,documents;(ii)Activities

(Act),whichrepresenttheexecutionofsomeprocessoveraperiodoftime,and(iii)Agents

(Ag),i.e.,humans,computingsystems,software. Wearegoingtousethefollowingsubset

ofrelationsamongstthesesets:

usage:usedAct×En generation:wasGeneratedByEn×Act derivation:wasDerivedFromEn×En

association:wasAssociatedWithAct×Ag attribution:wasAttributedToEn×Ag

Furthermore,to eachactivity a ∈ Act we associatea type, τact(a). Activity types are

usefultodescribepropertiesthatarecommontoasetofactivities,suchastheparameters used to compute transitive credit for ROs, as defined later. Finally, we represent a provenance documentas aDirected AcyclicGraph(DAG), where nodesdenote either Entities, Activities, or Agents, and an arc of the form x →−r y denotes the directed

relationshipr(x,y),wherer isoneoftherelationtypesabove.

Provlets

Wehavemade theideal assumptionthat completeprovenance is available todescribe eachinstance ofdatareuse. Moreprecisely,eachderivation/reuseeventinvolving ROs isdescribedby a smallPROVdocument, suchasthose shownabove. Wehavecoined theterm provlets to denote suchdocuments. Althoughin reality eachof theseevents

mayoccuronadifferentInformationSystemandatdifferenttimes,wealsoassumethat provlets,possiblycreatedindependentlyofeachother,areavailableforeachreuseevent. Takenindividually, eachprovlettellsalimitedstory ofanRO’s lifetime,aseachis concernedwithasinglederivationstep. However,aslongasthereisagreementamongst thesystemon consistentlyusingthe PIDsassignedto eachRO,it isstraightforward to combinea collectionof provlets thatcontain references to thesame RO,into a larger PROVdocument.

APublication-ReuseScenario

WeshowtheprovletsideaonasimpleROpublication-reusescenario,depictedinFigure 1. The scenario involves an initial RO, RO1, which is created andthen published by

Alicetodatarepository DR1. This ROislater discovered,downloaded,andreusedby

BobthroughaprocessP1,andindependentlybyCharliethroughprocessP2, resultingin

derivativeobjectsRO2, RO3, and RO4, respectively. Thesenew ROsmay bepublished

(6)

DR1

DR3

DR2

Alice

RO1 RO1

RO3 S2 RO3 RO4 RO3 RO5 RO2 Bob Charlie RO1 P2 P1 1⃣ 2⃣ 3⃣

Figure1. A hypothetical sequence of

publish-reuseactions. P1 Px P2 RO1 RO2 RO3 RO4 RO5 used used used used genBy genBy genBy genBy

Figure2. PROV graph for the

se-quenceontheleft.

intodifferentandseparatedatarepositories,e.g. DR2, DR3asinthefigure. HereAlice,

Bob,andCharliearemodelledasPROVAgents, andP1, P2asActivities. Notalldetails

about a derivation are always available. For instance, in this example RO2 and RO3

arelater themselvesreusedby some unknownAgent throughsome unknown Activity, generatingRO5asaresult. Table1liststheROreuseeventsforthisscenario,alongwith

thecorrespondingprovletsintextualandgraphform.

DataTrajectories

Givenaprovenance DAGp,considerthegraph p0obtainedbyreversingthedirectionof

thearcsin p. ForeachnodeROof p0, wedefinethetrajectory DT(RO)ofRO tobethe

treeobtainedbytraversing p0startingfromRO. WewriteDT.e(RO) andDT.a(RO) to denotethe setofEntity(i.e. RO)nodesandActivitynodes, respectively,thatappearin theDT(RO) tree. As anexample, the trajectoriesofeachof theROsfor the complete provenancegraphinFigure2arepresentedinFigure3. Notethatthisdefinitionallows anROtoappearinthetrajectoryofanotherROmorethanonce,forinstanceRO5appears

twiceinDT(RO1),becauseitisreachablefromRO1boththroughRO2andRO3.

P1 P2 RO1 DT(RO1): DT(RO3) P2 Px RO3 DT(RO3): DT(RO4) DT(RO5) RO5 DT(RO5): Px RO2 DT(RO5) DT(RO2) RO4 DT(RO4) DT(RO2): DT(RO4): Credit propagation

Figure3. SummaryoftrajectorytreesforeachoftheROsintherunningexample.

From

Data

Trajectories

to

Transitive

Credit

for

Data

Owners

To illustrate how this simple notion of data trajectories provides a foundation for experimentingwithmodelsoftransitivecredit,wedefineonesuchmodelasanexample.

(7)

Datareuseevent Provfragment

Alice

generatesRO1 Alice

RO1 DR1 RO1wasAttributedToAlice

Alice

RO1

wasAttributedTo

BobreusesRO1,

generatingRO2,RO3

DR1 RO1 P1 DR3 DR2 RO2 RO3 Bob

P1usedRO1,

RO2wasGeneratedByP1,

RO3wasGeneratedByP1,

RO2wasAttributedToBob,

RO3wasAttributedToBob,

P1wasAssociatedWithBob

wasAttributedTo P1 RO1 RO2 RO3 used genBy genBy Bob

CharliereusesRO1 and RO3,

generatingRO4through

P2

DR1

RO1 P2 DR3

RO4 RO3 Charlie

P1usedRO1,

P2usedRO1,

P2usedRO3,

RO4wasGeneratedByP2,

RO4wasAttributedToCharlie,

P2wasAssociatedWithCharlie

wasAttributedTo P2 RO1 RO3 RO4 used genBy Charlie

Unknown Agent reuses

RO2andRO3,

generatingRO5

through an unknown activity DR2 RO2 Px DR3 RO5 RO3

PxusedRO2,

PxusedRO3,

RO5wasGeneratedByPx

Px RO2

RO3

RO5

used genBy

Table1. ROreuseeventsandcorrespondingprovletsfortherunningexample.

Themodel isunderpinned by a simple principle: whena deriveddata productRO0 is

credited,i.e. bythecommunity,asavaluable researchdatacontribution,thenallofthe otherROsthatmadeRO0possibleshouldreceivesomeofthatcredit,inaproportionthat

dependson theirimportanceoncreatingRO0. ThemoreindispensableROis perceived

toRO0sderivation,themorecreditROshouldreceive. Thisprincipleappliestransitively

toaccountformultiplegenerationsofreuseandderivation. WeuseDataTrajectoriesto determinehow creditpropagates “upstream” from derivedROs,possibly several steps removed fromthe originalRO.We introduceanumber ofparameters, one foreach of thetypesτact(a) ofactivitiesathataccountfor theROtransformations,toquantifythe

notionofrelativeimportanceoftheupstreamROsinthederivationprocess. Ultimately, credittransfersfromtheROs totheAgents whoareresponsibleforthem, accordingto theEntityattributionassertionsinthePROVdocument.

Followingthisrationale,weseparatethetotalcreditascribedtoRO,denotedcr(RO), into two separate components. The first is the external credit, denoted crext(RO).

Thiscomponentaccommodatesany criteriathata communitymay decidetoadoptfor associatinga score toa published RO,and whichis independent onthe reuse history oftheRO.Suchscoremay,for example, reflectemergingcommunitypracticesondata citationsinrepositories. Thesecondcomponentofcr(RO) reflectsthereusehistoryof

RO. ItallowseachROintheprovenance graphtoreceiveafractionofthecreditthatis

(8)

ascribedtoeach“downstream”RO0∈DT.e(RO). Forthesakeoftheexample,weassume thatdownstreamcreditscombinelinearlytoprovidecredittoupstreamnodes.

Notethat thisis adefinition by induction, following thetree structureof DT(RO). The base case is that of a RO0 that has not been reused at all. In this case, only the

external, baseline credit component crext(RO0) applies. For the induction, we now

distinguishseveralPROVpatternsofreuse. Asummaryofthesepatterns,alongwiththeir correspondingcreditpropagationrulesandthetrajectorypatterns,isdepictedinFigure4.

RO reused p times

single-input, single-output activity

many-input, many-output activity

RO derivation with unknown activity

RO used genBy Ap A1 … … … … … genBy used credit

cr(RO) =

p

k:1

crak(RO)

RO used genBy A credit RO’

αa credit

cra(RO) = (a)·cr(RO ) +crext(RO)

used genBy A αa RO’m RO’1

RO1 ROi ROn

βa

γa

cra(RO) = a· ia· m

j:1

a

j·cr(ROj) +crext(RO)

wasDerivedFrom αder

RO’ RO

cr(RO) =

der

n ·cr(RO ) +crext(RO)

Figure4. ROreusepatterns,trajectories,andcreditpropagationrules

Tobegin,considerthemostgeneralcase, wherewe assumethatROhasbeenreused

byr differentactivities,a1...ar,possiblyatdifferenttimes,asinFigure4(a). Following

thestructureofDT(RO)fromFigure3,wedefinecr(RO)tobethesumofrdistinctcredit

components,cra1(RO)...crar(RO),eachduetooneactivityak thathasreusedRO:

cr(RO)=

r

X

k:1

crak(RO) (3)

We now progressively build up to a general definition of cra(RO), for a generic

activitya. WebeginwiththesimplestcasewhereROis usedby atogenerate asingle

newRO,RO0,asinFigure4(b). Asmentioned,wewantRO toreceiveafractionofRO0s

credit. To modeltheextenttowhichcreditpropagatesthrougha,we introduceacredit transferparameterα(a),with0 ≤ α(a) ≤ 1. Toexplainitsfunction,recallthattheideaof

(9)

creditpropagation throughareusepattern < aused RO >,< RO0wasGeneratedBy a >

isbasedupon theintuition thatRO0 owesitsvalue to bothRO, andthe transformation

a. Introducing α(a) allows us to explicitly model the value contribution due to the

transformation a, relative to that of its input data RO. For instance, consider a data

cleaningalgorithmthattakes noisydataRO andproducesacleanerversion,RO0, ofthe

samedata. OnemayarguethatmuchofthevalueinRO0isduetothealgorithm,rather

thantothe data. Wemodelthis by only transferringasmall portion ofcr(RO0) credit backtoRO,i.e.,by settingalowvalueforα(a). Notethatdiscussingspecificcriteriafor

settingthevaluesofthisandotherparametersintroducedinthemodelisbeyondthescope ofthispaperandleftforfurtherresearch,asmentionedinthelastsectionofthepaper.

Formally,wedefinethecreditpropagationruleforthegraphpatterninFigure4(b)as:

cra(RO) = α(a)·cr(RO0)+crext(RO) (4)

where cra(RO) is defined inductively in terms of cr(RO0), with the external credit

crext(RO0) asthebasecase.

Next, we extend Equation (4) to the case where RO is only one of n > 1 inputs

usedby a. Thisnewpattern isshowninFigure4(c). Inthisscenario,inadditiontothe

transferparameterα(a),wealsoaccountfortherelativeimportanceofeachoftheninputs

RO1...ROn. Wethereforeintroducennewfactors,0< βi(a) ≤1,i :1...n,subjectto:

n

X

i:1

β(a)

i =1

anddefine:

cra(ROi)= α(a) · βi(a) ·cr(RO0)+crext(ROi) (5)

Withthisnewdefinition,ROaccruesaproportionofthetotalcreditofRO,whichaccounts

for itsperceived importance in computingRO0using a. Notethat whenthere is only

oneinput,Equation(5)reducestoEquation(4)asexpected,andwhenallinputstoa are

equallyimportant,i.e. β(a)

i =

1

n foralli,Equation(5)becomes

cra(ROi)=

αa

n ·cr(RO

0

)+crext(ROi) (6)

Finally, weextend Equation(5)onemoretime, toaccountfor themostgeneral pattern wherenot only is RO only oneof the inputs, but also, a generates m > 1outputs, as

showninFigure4(d). Inthissituation,ROreceivescreditfromeachoftheoutputsRO0,

whichareallpartofDT(RO). Again,wemodelthedifferentimportanceascribedtoeach ofthesederiveddataproductsbyintroducingmnewfactorsγ(ja),subjectto

m

X

j:1

γa

j = m

anddefine:

cra(ROi) =αa· βia· m

X

j:1

γa

j ·cr(RO

0

j)+crext(ROi) (7)

(10)

Weconcludebyaddingthespecialcasewheretheactivitythataccountsforthe RO reuseisunknown. Inthiscase,weusethegenericdataderivationrelationship:

RO0wasDerivedFromRO (8)

where of course more than one RO0 may have been derived from RO. According to

thePROV constraints document (Cheney,Missier & Moreau, 2012), frompattern (8) we can infer the existenceof an activity a, suchthat both assertions < a used RO > ,< RO0wasGeneratedBy a > hold. Weintroduceafinalcredittransferparameter, αder,

to modelcredit propagation due to derivation. In this case, when there are n known

derivationsofRO,rule(4)becomes:

cr(RO) = α

der

n ·cr(RO

0

)+crext(RO) (9)

Finally, we stipulatethattheAgents Ag thatarementioned inthePROVdocument

accrueacreditcrag(Ag)thatissimplythesumofeverycreditassociatedtotheROsthey areresponsiblefor:

crag(Ag)= X

r

cr(r) overallROrs.t. < rwasAttributedTo Ag> holds. (10)

ModelSummary

We have shown how a formal notion of a data trajectory DT(RO), derived from a compositionof multiple, independently generated provlets, can be used to apportion credittodatapublishers. Asanexample,wehavepresentedamodelthatconsistsofthree mainelements:

• An external credit function, crext(RO), which associates a value to each RO

that appears in the compound provenance graph. Such value can follow any community-basedscoringschemeofdatarelevance;

• Asetofcreditpropagationrules(3)through(9)thatarecomputedinductivelyfrom

DT(RO)andwhichformalisethenotionoftransitivecredit,cr(RO);

• A setof credittransferparameters, whichaccountfor thenature oftheactivities involves inthetrajectoryofRO, including,wherethisinformationisavailable,the

relativeimportanceofeachofitsinputsandoutputs.

Simulating

Data

Trajectories

and

Credit

Propagation

Realisingan information management infrastructurethat iscapable ofgenerating data trajectoriesforallinstancesofdatareuseisalong-term,challengingresearchproposition, whichwearticulateinthefinalsectionofthispaper. Asastartingpointfortheresearch, wehaveimplementedaDataReuseSimulator,whichweuseasatoolforexperimenting

with various assumptions regarding the completeness of data trajectories, and with

(11)

differentcreditmodels.12. The simulatoriscapable ofgenerating two typesofevents:

(i)newinstances of datareuseandderivation, and(ii)updatesto theexternal creditof oneormoreoftheROs,ontheassumptionthatcommunity-ascribed creditmaychange over time. Data reuse events cause the generation of more ROs, the creation of the correspondingprovlets,and theupdateofdatatrajectoriestoreflectthenew derivation andusage/generationrelationships,asshownintheexampleofFigure3. Theyalsotrigger thepropagationoftheinitialexternal creditassociatedwithnew ROs,backwardsalong eachof therelevanttrajectories. Thesecondtype ofevents,changes toexternal credit, alsotriggersthepropagationofthecreditupdates.

Thesimulatorcanbeusedtoexploremanyscenariosofpossibletrajectorystructures and credit propagation dynamics, through the generation of random interleavings of events,withsomeusercontrol. Hereweshow thesimulatorin action,toreproducethe scenarioinFigure1. Wehavealsopresentedamorecomplexdatareusescenariointhe Appendix,toprovideabetterintuitionforthesimulator’scapabilities. TheplotinFigure5 showshowcreditchangesfortheROs,inresponsetokeyeventsinourexample,shown atthebottom. Initially, allnew ROshave thesameexternal credit value 1. Following thereference scenario,these valuespropagate through activitiesP1 andP2, as well as throughathirdunknownactivity.

Figure5. TotalcreditchangestoROsfollowingreuseandexternalcreditadjustmentevents.

Inthesimulator,we makethesimplifyingassumptionthatallinputstoanactivitya

areequallyimportant, i.e. weuseEquation(5) where β(a)

i = 1n foralli. Similarly,we

useasinglevalue γ(a) = m, thenumberofinputstoa. Withtheseassumptions,wecan expressthetype τact(a) ofanactivity a asatriple τact(a) = [α,β,γ]. Intheexample,

wehaveused τact(P1) = [0.5,1,0.5], and τact(P2) = [0.8,0.5,1]. Theimplicit activity

dt:act_297isassignedτact(P1) bydefault.

ThefigureillustratesthedifferentwaysthatthetotalcreditofeachROprogresses,ata fasterorslowerpacethanthatofothers,dependingontheamountofreuseandthetypeof

12Thecurrentversionofthesimulatorsoftwareisavailableathttp://github.com/PaoloMissier/DRS. Itis

implementedinPythonandmakesuseoftheSouthamptonprovenancesuite: http://provenance.ecs.soton .ac.uk

(12)

activitythatconsumestheRO.Asexpected,theoldestRO,RO1acquiresthehighestcredit

asits trajectoryextends over time, and asits descendentsacquire recognitionthrough additionalexternalcredit. NotethatcreditcanbetransferredfromROstotheagentsthat areresponsibleforthem,byusingtheattributionandassociationPROVrelationships.

Data

Trajectories

in

Practice:

Challenges

and

Research

The data trajectoriesand the transitive credit modelillustrated in this paper are both theoretical. Inreality,becauseofthebroaddiversityofwaysinwhichpublicdatacanbe usedwithoutcontrol,thevisionoftrackingdatausageinthewildfacesmanychallenges.

Weconcludebyhighlightingsomeofthesechallenges,andsetoutaresearchagendafor realisingtransitivecreditinpractice.

Trajectories are compositions of independently created provlets, which must be systematically generated by multiple, diverse, autonomous information systems, to theextent possiblethrough observationof data transformation processes. This is not unrealistic,asprovenancerecordersexistforlanguageslikePython(Murta,Braganholo, Chirigati, Koop& Freire, 2014)and R(Liu &Pounds, 2014; Lerner& Boose, 2014), aswellasformanyworkflowmanagementsystemsincludingTaverna,eScienceCentral, SciCumulus,Pegasus, Kepler. However,no systemtoday systematically harveststhese traces in a central place, where trajectories can be computed. This is a long-term infrastructureproblem,requiringconcertedeffortsacrossdatarepositoriesorganisations. Also,thegranularityatwhichprovenanceisrecordedvaries,dependingonthesystems’ provenancecapturecapabilities. Further,provletcompositionrequirestheconsistentuse ofdataidentifiersacrossinstancesofdatareuseandacrosssystems. Thisisbynomeans thenormtoday,althoughstandardsfor dataPIDs,likethosepromotedbyDataCite, are gaining acceptancein forums like the DigitalCuration Centre inthe UK13, and more

globally,theRDA.However,evenwhenidentifiersareavailabledataconsumershaveno obligationtoacknowledgetheirprimarysourceofdata. Thisisparticularlyproblematicin theso-calledlongtailofscience(Wallis,Rolando&Borgman,2013),whereconsumers

are less likely to record reuse in any systematic way. Credit management is further complicatedwhenROsareonlypartiallyreused,asthisviolatestheassumptionthatROs

areatomicdataentities.

To someextent,these issuescanbeaddressed throughalong-term plantodevelop infrastructuretosupportthenotionofdatatrajectoriesacrossthebroadresearchscience community. Morefundamentally,however,weshouldassumethattrajectoriesarealways boundtobefragmentedandincompleterepresentationsof actualdatareuse,leadingin turntounrealisticcreditassignments. Oursuggestedresearchagendaisthereforefocused onaddressingthefollowingkeyresearchquestions.

• Firstly, underwhatcircumstancesitispossibletoestimatethelikelihood ofsome of the missing derivations (for instance, using machine learning and predictive analyticstechniques)?

• Secondly, to what extent can the resulting probabilistic provenance graphs and trajectoriesbeusedtosupportuseful,fair,andcredibletransitivecreditmodels?

13DigitalCurationCentre: http://dcc.ac.uk

(13)

• Thirdly,whenusingacreditmodelthatreliesoncredittransferparameters, aswe haveshown,howarethesedetermined? Cantheybelearnt,oradjusteddynamically followingfeedbackfromthecommunity?

References

Bechhofer,S., De Roure, D., Gamble, M., Goble, C. & Buchan, I. (2010). Research objects: Towards exchange andreuse ofdigital knowledge. Nature Precedings. doi:10.1038/npre.2010.4626.1

Bollen, J., Van de Sompel, H. & Rodriguez, M. A. (2008). Towards usage-based impact metrics. In Proceedings of the 8thACM/IEEE-CS Joint Conference on DigitalLibraries -JCDL’08(p. 231). NewYork, NewYork, USA: ACMPress. doi:10.1145/1378889.1378928

Callaghan, S., Donegan, S., Pepler, S., Thorley, M., Cunningham, N., Kirsch, P., ... Wright, D. (2012). Makingdataa firstclassscientificoutput: Datacitationand publicationbyNERC’senvironmentaldatacentres. doi:10.2218/ijdc.v7i1.218

Cheney,J.,Missier,P.&Moreau, L. (2012). Constraintsoftheprovenancedatamodel

(Tech.Rep.). Retrievedfromhttp://www.w3.org/TR/prov-constraints/

Katz, D. S. (2014). Transitive credit asa means to address social and technological concerns stemming from citationand attributionof digitalproducts. Journal of OpenResearchSoftware,2(1),e20.

Lerner,B. S.&Boose, E. R. (2014). Collectingprovenance inaninteractive scripting environment. ProceedingsofTAPP’14.

Liu, Z. &Pounds, S. (2014). AnR package that automatically collects andarchives detailsforreproduciblecomputing. BMCBioinformatics,15(1),138. doi:10.1186/ 1471-2105-15-138

Mayernik,M. S. (2013). Bridgingdatalifecycles: Trackingdatauseviadatacitations workshopreport(Tech.Rep.). NationalCenterforAtmosphericResearch.

Mons,B.,vanHaagen,H.,Chichester,C.,Hoen,P.-B.T.,denDunnen,J.T.,vanOmmen, G., ... Schultes, E. (2011). The value ofdata. NatureGenetics, 43(4), 281–3. doi:10.1038/ng0411-281

Moreau, L. & Groth, P. (2013). Provenance: An introduction to PROV. Synthesis LecturesontheSemanticWeb: TheoryandTechnology,3(4),1–129.

Moreau,L.,Missier,P.,Belhajjame,K.,B’Far,R.,Cheney,J.,Coppens,S.,... Tilmes, C. (2012). PROV-DM: ThePROV datamodel (Tech.Rep.). World WideWeb

Consortium. Retrievedfromhttp://www.w3.org/TR/prov-dm/

Murta, L., Braganholo, V., Chirigati, F., Koop, D. & Freire, J. (2014). noWorkflow: Capturingandanalyzingprovenanceofscripts. ProceedingsofIPAW’14.

(14)

Piwowar,H.A.&Vision,T.J. (2013). Datareuseandtheopendatacitationadvantage.

PeerJ,1,e175. doi:10.7717/peerj.175

Robinson-García, N., Jiménez-Contreras, E. & Torres-Salinas, D. (2015). Analyzing datacitationpracticesusingthedatacitationindex. JournaloftheAssociationfor InformationScienceandTechnology. doi:10.1002/asi.23529

Wallis,J. C.,Rolando,E. &Borgman,C. L. (2013). Ifwe sharedata, willanyoneuse them? Datasharing and reusein thelongtail ofscience andtechnology. PLoS ONE,8(7),e67332. doi:10.1371/journal.pone.0067332

Appendix:

A

More

Complex

Instance

of

Simulated

Data

Reuse

Figure 6 shows a more complex simulated data reuse scenario, which includes 15 ROs, managed by nine Data Operators (the Agents at the top of the figure), with a randomcombinationoftenderivationandusage/generationevents. Theseare(randomly) interleavedwithtenexternalcreditupdateevents. Theresultingprogressionoftotalcredit overtimeisshowninFigure7.

(15)

Figure6. Theglobalprovenancegraphfortheentirereusehistory.

(16)

Figure7. ROtotalcreditprogressionforthedatareusescenarioofFigure6.

References

Related documents

Own mailchimp form style block and songwriter of satisfaction guaranteed live dvd sales sound recording industry association of children at the rock songs about the dvd of

Funding: Black Butte Ranch pays full coost of the vanpool and hired VPSI to provide operation and administra- tive support.. VPSI provided (and continues to provide) the

Abstract In this paper the well-known minimax theorems of Wald, Ville and Von Neumann are generalized under weaker topological conditions on the payoff function ƒ and/or extended

In Avesta, nouns, adjectives, participles and other parts of speech are formed by adding suffixes to roots. These nouns and adjectives are crude forms. If they have to

Whether grown as freestanding trees or wall- trained fans, established figs should be lightly pruned twice a year: once in spring to thin out old or damaged wood and to maintain

Players can create characters and participate in any adventure allowed as a part of the D&amp;D Adventurers League.. As they adventure, players track their characters’

А для того, щоб така системна організація інформаційного забезпечення управління існувала необхідно додержуватися наступних принципів:

As you may recall, last year Evanston voters approved a referendum question for electric aggregation and authorized the city to negotiate electricity supply rates for its residents