Simplifying the Design of Workflows
for Large- Scale
Data Exploration and Visualization
Juliana Freire Claudio Silva
http://www .cs. utah. edu/ "-J juliana http://www .cs. utah. edu/ "-J csilva
Workflows and Com putational Processes
• Workflows are em erging as a paradigm forrepresenting and managing com plex com putations
- Simulations, data analysis, visualization, data integration
• They capture computation and analysis processes, enabling
- Automation, reproducibility, result sharing
• Workflows are rapidly replacing prim itive shell scr i pt s
- Apple's Mac OS X Autom ator, Microsoft Windows Workflow Foundation, and Yahoo! Pipes
• Business Workflows => Scientific Workflows
- Important differences!
Workflows: Scientific vs. Business
• Express sequence of data transformations • Dataflow: Stateless, fu nct ion al • Data intensive,com puting intensive • Cater to a broad set of
users
• Ensure rules and
prescribed processes are followed
• Control flow (e.g., BPEL): State and side effects
c
Exploration and Workflows
• Workflows have been traditionally used to automate repet it ive tasks
• I n exploratory tasks, change is the norm!
Data analysis and exploration are iterative processes
Data Data ~ Workflow Specification .... ~ Data Product Manipulation
Microsoft eScience Workshop, 2008
T ~ Perception & Cognition Exploration .... ~ Knowledge User
Figure modified from J. van Wijk, IEEE Vis 2005
•
Data Exploration and Workflows
workflowraw data: CT scan
Files (workflow specifications) anon4877 _voxel_scale_1_zspace_20060331.srn anon4877 _textureshading_20060331.srn anon4877 _textureshading_planeO_20060331.srn anon4877 _goodxferfunction_20060331.srn Notes r lI\..~t~til L V~SUIA tLZlAtLDIA ,
WI.: Aototeot textuye til V1 Aototeot pLtilll\..e to
VI . ,
Explorat ion and Creat ivity Su pport
• Exploratory processes require reflective reasoning
• "Reflective reasoning requires the ability to store temporary results, to make inferences from stored knowledge, and to follow chains of reasoning
backward and forward, sometimes backtracking when a prom ising line of thought proves to be unfruitful .... the process is slow and laborious"
Donald A. Norm an
• Need external aids-tools to facilitate this process
- Creativity support tools [Shneiderman, CACM 2002]
• Need aid from people-collaboration
Data Exploration and Workflows: Issues
• Hard to assem ble and iteratively refine workflows
• Combine many tools and libraries: Need in-depth knowledge to weave them together
• No support for reflective reasoning
- E.g., history of the exploration trail maintained manually through file-nam ing conventions and detailed notes
- Hard to understand the exploratory process and relationships am ong workflows
• Lack of support for collaboration
Existing systems fail to provide the necessary
infrastructure for exploratory tasks. As a result, the
generation and maintenance of workflows is a major
VisTrails: Managing Scientific Exploration
• Goal: reduce time to insight
• Build infrastructure to streamline exploratory tasks
such as data analysis and visualization
• Support for collaboration
• Usability-provide tools and intuitive interfaces
• The VisTrails System: an open-source provenance-enabled scientific workflow system
- > 6,000 downloads since 2007
- Used in many applications: environmental modeling (OHSU), physics sim ulation (Cornell, LANL), medical studies (University of Utah), ...
Outline
• Using provenance to support reflective reasoning
• Explori ng and re- u si ng proven ance
- Querying workflows by exam pie - Creating workflows by analogy - Auto-com pletion for workflows
• Emerging applications
c
Change-Based Provenance
• Captures provenance of workflow
evolution
• Records user actions
• Provenance
=
changes tocom putational tasks
- Add a module, add a connection, change a parameter value
addModule deleteConnection addConnection addConnection setParameter
Q
= ~~n;V
--E~~
•
Change-Based Provenance
• Records user actions
• Provenance
=
changes to com putational tasks- Add a module, add a connection, change a parameter value
• Extensible change algebra
• A vistrail node vt corresponds to the workflow that is constructed by the sequence of actions from the root to vt
vt
=
xn
0xn_
1 ° ... 0
x
1 0
0
[Freire et ai, I PAW 2006]
Microsoft eScience Workshop, 2008
vistrail
Exploring the Change Space
• Script i ng workflows: Param eter explorat ions are
sim pie to specify and apply
• Exploration of param eter space for a workflow
v
t(setParameter(idn, valuen) 0 • • • 0 (setParameter(idt , value t ) 0 v
t )
l .... lVi.TraiI>Oulidor lorminalor,lIml g~~
"lkCan..era Set'/\eWUP(O,O, ,I) set9osIIon(7~S. ~S3, 369) 5«Foc-<13S. 135, ISO) vtkContourfilt ... 5et-.iue(O.67) vtkStructu,edPointsRe_
Exploring the Change Space
• Script i ng workflows: Param eter explorat ions are sim pie to specify and apply
• Exploration of param eter space for a workflow
v
t(setParameter(idn , valuen) 0 • • • 0 (setParameter(idt , valuet ) 0 v
t )
• Exploration of multiple workflow specifications
(addModule(id;, ..
J
0 (deleteModule(id;) 0 v1 )
(addModule(id;, ..
J
0 (deleteModule(id;) 0 vn )• Results can be conveniently compared in the VisTrails spreadsheet
• Can create an im at ions too!
• Caching to avoid redundant com putations [Bavoil et aI., I EEE Vis 2005]
Com puting Workflow Differences
• No need to com pute subgraph
isom orph ism!
• A vistrail is a rooted tree: all
nodes have a com m on ancestor-diffs are
well-defined and simple to
compute
vt1
=
Xio Xi_tO ... oxto0
vt
2=
Xj 0 Xj -1 0 • • • 0 X 1 00
vt 1-vt2 = {Xi' X i- t ' ... , X t , 0)
( Xj , xj - 1, •• .,X 1 , 0)
• Different sem ant ics:
Collaborative Exploration
• Collaboration is key to data exploration
- Translational, integrative approaches to science
• Store provenance information in a database
• Synchronize concurrent updates through locking
- Real-tim e collaboration [Ellkvist et aI., I PAW 2008]
• Asynchronous access: similar to version control
system s
- Check out, work offline, synchronize - Users exchange patches
• Synchronization is sim pie-provenance is
monotonic
• No need for a central repository-support for
distributed collaboration
- For details see Callahan et ai, SCI I nstitute Technical Report, No. UUSCI-2006-016 2006
Change- Based Provenance: Su m mary
• General: Works with any system that has undo/redo!
• Concise representation
• Uniform Iy captures data and workflow provenance
- Data provenance: where does a specific data product come from?
- Workflow evolution: how has workflow structure changed over time?
• Results can be reproduced
• Detailed information about the exploration process
• Provenance beyond reproducibility:
- Support for reflective reasoning
- Scalable exploration of the par am eter space-results can be com pared side-by-side in the spreadsheet
• Storing detailed information is important, but not
enough!
• Need appropriate user
interface and operations to leverage information
- Understand and re- use the history
• Simplify the creation of new workflows
Looking for Exam pies
• Need to query workflow collection:
- Find workflows that process a particular type of file - Find workflows that output a particular data product
- Find workflows that contain a given module or sequence of modules
• Workflow are graphs: hard to specify queries using text
Querying Workflows by Exam pie
• WYSIWYQ --
What You See Is
What You Query
• I nterface to create workflow is sam e as to query
Pipeline Interface 000000 vtkStructuredPolntsReader o 00000000 vtkOpenGLVolumeTextureMapper3D
Microsoft eScience Workshop, 2008
o
[Scheidegger et al., TVCG 2007
Query Result
.,-;..
---Query Interface
Refining Workflows
• Com plex workflows are hard to create
- Dam ain knowledge
- Fam iliarity with different tools Steep learning curve
r
... T ... ...
Bit fl:Iit »I'll' &M1 ~~ ~
JW ~ ~ PI' 0 ~Imt\l"'ton .... lQof:'Y p ... ~ .. r:~a!lon
:~DJ-·~- i~:
~~ e~ COl 00) PO~ l a--
0.4 005. 0.' o,~.. ,--""': .. =n-" ,...,.0..' 0 ... "10 0 +1 I. "to 0 +1 • :OMetrO.flSlOVom;: , ~:0".
~'
.
j !:: 1004 ." .. • O!Refining Workflows by Analogy
• Leverage the wisdom of the crowds in shared
provenance
- Sam e workflow refinem ents are com man, e. g., change the rendering technique, publish im age on the Web
• Apply refinem ents by analogy, autom atically
[Scheidegger et ai, IEEE TVCG 2007]
AnalogyTemplate
MicrosoTt e~clence vvorKsnop, ~UUts
~
1 - - - 1 1 1 1 1 l _ _ _ _ _ ... ~ ... '·r. r , :,.-.. .. ' --.
. ---. ...Automatically constructed visualizations
o
Refining Workflows by Analogy
poe Report
ProteIn Title
Authors
Atom Count
is to
NEURAL CELL ADHESION MOLECULE, MODULE 2, NMR, 20 STRUCTURES P.H.JENSEN, V.SOROKA, N.K.THOMSEN, V.BEREZIN, E.BOCK, F.M.POULSEN C:9560 H: 15440 N:2580 0: 2680 5:60
Links POB Entry
as
is to7
o
The Analogy Algorithm
1. Com pute difference: 8(A,8)
- Just like a patch! - But ...
D = 8(A,B) 0 C may not be a valid
workflow
2. Find correspondences between A
and C: map(A,C)
Diffuse similarity scores across the
product graph Axe using Eigenvalue
is to
as
lIJ
istoD
111 = DIFF(A, 8) 1 ~.
A
decom pos ition s PDBReport
~~---~~~temT~me-=STRU~CTU~REO~FTH~E ~
3. Com pute mapped difference Authors ~~~!f!~f~~f~~~:~~TER
1\ (A 8) (A C) 1\( A 8) Atom Count C:5934
UAC '
=
m ap , U , ~::]~~4. D
=
8 AC(A,8) 0C
Microsoft eScience Workshop, 2008
C: 20 N: 12 NA: 4 0:44 P: 6 Links .E.Qa..E.o1[y. ~ ________ ~ ______ ~ D
The Analogy Operation
• Allows workflows to be refined without requiring users to direct Iy m od ify the speeifieat io n
• Basis for scalable updates
• Analogies are not foolproof
- If it works, great. If it doesn't, it may help - User can edit and fix the new version
• I m prove by
- Using domain knowledge
The Need
for
Guidance
Workflow Design
vtkCeliTypes vtkCenteredSliderWidget •In
vtkChacoReader 00000000 000000000 OOOOOOOO[ 0 DO
vtkCh, 000000 lateErode31
1 vtkWarpVector vtkExtractEdgE vtkFieldDat; vtklmplicitSum n
vtkCh. vtkPlecewlseFunction DO
~~rD 0 J - ouuuuuu __ :.JUUUUL 0000000000 JOO 00000000
vtk vtkRungeKutta2 IUnaY;jl)OO :one vtkLODActor vtkAxes vtkTransformFilter metl'1 vtkSphereSource
vtk 0 0 DO . DO
vtk L-JL-JL-J nn~ooo ~UUUUUOOOOOOOO
D~D[tmlc..tL"J'- tklmageActor
I
0 DataSetReader 10 [ vtkOpenG LVolu meTextureMapper3Do vtkCylin vtkProperty2D - - 0
vtkTransform 01000000 0 0000000000 ...JL-JL-JL-JL-JL-JL-JL-JL-JL-JL-JL-J&:::J
DO I ... JrowSource vtkPolyDataMapper vtklmageReslice
VTKC~~tTnCnnr.rAtA r'r 0 DO
~kD'Dc10nOOOOI 0 IctorStyle 00000000 0_______ Juuu ...
vt vtkElevation vtkCylindricalTransform -.JUUUL 0 vtkDICOMlmageRe vtkLlneSourc~r:J !IIDataToPointDat~n
~ ~LJLJLJLJLJLJLJLJ 0 vtkActor DO JOOOOOOOOOO
vtkConeSource vtkBVUReader ""] 0 0 0 0 0 0 0 0 vtkActor2D 0 vtkGlyph3D 000
vtkConnectivityFilter vtkP rt I vtklmageFFT vtkTextActor
~~~~IODOOOOOOOOOOO ~ rope y 0 JGL-JL-JL-J oog~ooooo 0 0
vtkCo vtklmageResample b Ax LJ
A t 2D vtkHyperOctree vtkWarpScalar
vtkCo DO e es c or - vtkParametricSuperEllipsoid 0 L
vt~ 00000000 [
00-vtkCoordinate 0000______ S bd' .. F'lt VII\\,onnec;;t uuuuuuuu
)erOctrE vtkLoop u IVlslon I er
0000000 0 DO vtkCamer;l vtklmageDataGeometryFilter vtkStructuredGridReader 00000000 DO DO :J 0 0 0 0 0 0 0 0 vtkM-000000000 . edGraalem~naaer- 000000000 vtkCursor2[ WWWWWWWW vtkDelaunay2D c . . [ vtkXVZMolReader vtkCursor3[ vtklmageClip~~ vtkWelghtedTranSformFllteor 0 0000000 DO vtkCurvatun vtkCutter 00000000000 tkFrustumCoverage\,ul u u u u u u u u u u u u t..J 0 0 0 0 0 0 0 0 0 U U U
vtkCylinder . .&1,"1 n'T'''''U~eader vtkMergeFllter vtkRungeKutta4
V~CYI~n~e.rsource 00000000 DO. nnnnnoooo 0
~ , vtkPla 00000000 rammableF 00000000 0000000
00000 vtkButterflySubdivisionFilter 100000 vtkSurfaceReconstructionFilter vtkRenderer
vtkScalarE DO DO 0
u VlK:")ueamTracer
DO
VisCom plete: A Workflow
Recom m endation System
• Mine provenance collection: Identify graph
fragments that co-occur in a collection of workflows
• Predict sets of likely workflow additions to a given
part ial workflow
• Similar to a Web browser suggesting URL
com pletions ~kOSPFiIt'rGroup \<tkOataOb,ectlbDataSlt .. IAkOttaS.I'SutfaceFdtlt ~kOatiiSeIToDataObl·ct ... -.tkOataSelltlanoleFlhv oA.kOecunal.pO~ln.F,'t.r \rtkElevatlonFllter IAkFleldDiltaTDAttribute. IAkGenerlcContoutFll!.er 'A.kGenerICGeomttryRlter-\AkGenerlcG/yptlJDAlter IotkGlnelicOUtllMRlttr \otkGener\cProbeFlller \OtkGtometf)flll:er IAkGradlentFiller IAkG~phLayOUtl1lt.r tA.kH~chlc.lOatilLewt., tAkH\et.rchicalDataS.tG.
[Keep et al., IEEE Vis 2008]
Builder
.BI. ~dlt lilI-am ~ ElIlr.n Help
1I1kDataSeITo08IaOb,eCI ... \o1kDalaSft"',an~F"" VlI:Declm.teP~nef1lter ..n.kElevauonAlttr vtkFI.ldO.t.ToAttnbut .... vtkGtnlncContourFllttf vllffia"aNCGoomltl)'flhaf vtkGenlncGlyphlOFliter vtkG"n.rkOut~n"FNt.r vtkGenencProbeFllter IItkGtOrMll)fiIttr \/Iir:GJadl.ntAIt,t \'tkG~hl.¥lIlRIt.,. vt"HlerarchlcaIOalaLew:l •• vtkHler.rchlc .. fO .. tiiSetG ... vtkIottP8fOdr . . Contourf •••
VisCom plete: A Workflow
Recom m endation System
• Identify graph fragments that co-occur in a
collect ion of workflows
• Predict sets of likely workflow additions to a given
part ial workflow
• Similar to a Web browser suggesting URL
com pletions
OOOOOD
Microsoft eScience Workshop, 2008
o 0000
vtkOataSetRaadar
D
VisCom plete: Dem
0http://www.cs.utah.edu/''-Jjuliana/videos/viscomplete_h_264.mov
[Koop et al., IEEE Vis2008]
Resu Its Su m mary
• Eliminates over 50% of actions
• Selected completions are almost always in the first
four suggestions
• A database of sim pie pipelines can aid users
constructing more complex pipelines
• See [Koop et al., TVCG 2008] for details on how the
path database is constructed and on the completion algorithm
Conclusions and Future Work
• Appropriate support for exploratory tasks is
essential for a wider adoption and more effective use of scientific workflow systems
• Provenance can be used to support reflective
.
reasoning
• Intuitive interfaces for simplifying the construction and refinem ent of workflows
• Sharing workflows provenance at a large scale creates new opportunities
- Workflow/ provenance repositories; provenance- enabled publications
- Scientists can learn by exam pie; expedite their scientific
Acknowledgments: Funding
• This work is partially supported by the National
Science Foundation, the Department of Energy, an IBM Facu Ity Award, and a Un iversity of Utah Seed Grant.
Microsoft eScience Workshop, 2008
DOE SciDAC VISUALIZATION AND ANALYTICS CENTER FOR ENABLING TECHNOLOGIES
1 • •
1
.
1
-.
. 1. - . .' I • •
-- -- -- ----- ------ - -
---- --
---
Acknowledgments: People
• VisTrails Group - Claudio Silva - Erik Anderson - Jason Callahan - Steven P. Callahan - Lorena Carlo - David Koop - Lauro Lins - Emanuele Santos - Carlos E. Scheidegger - Huy T. Vo - Geoff DraperFor more info about VisTrails
Visit: http://www.vistrails.org