iDigBio
Technology,
Cloud and
Appliances
Jose Fortes
(on behalf of the
iDigBio IT team)
iDigBio External Advisory Board Meeting 2012 (Project Year 1)
Advanced Computing and Information Systems laboratory
CI Stakeholders
Domain Data Producers Infrastructure Providers Domain Service Providers Domain Data Consumers National/Global Data Aggregators 2iDigBio
Museums Amazon WS Google Microsoft Azure DataONE TCNs Collectors GBIF ALA Researchers Amazon Turk Georeferencing Imaging services Data quality Mapping EOL TCNs TCNs Government Translation OCR BISON NESCent Data Conservancy iPlant iPlant Teachers Citizens TCNsAdvanced Computing and Information Systems laboratory
Stakeholders APIs
Domain Data Producers Infrastructure Providers Domain Service Providers Domain Data Consumers National/Global Data Aggregators 3iDigBio
Museums Amazon WS Google Microsoft Azure TCNs Collectors GBIF ALA Researchers Citizens Amazon Turk Georeferencing Imaging services Data quality Mapping EOL TCNs TCNs Government Translation OCR Domaindata Appliances BLOBs
Updates Notification Query results Customer Requests Processed data Domain-level data Updates Notification Usage track BISON DataONE TCNs Data Conservancy NESCent iPlant Teachers
Advanced Computing and Information Systems laboratory
Interface Model for iDigBio and TCNs
4
Infrastructure Providers, National/Global Data Aggregators, Domain Service Providers, Domain Data Consumers
. . . . . . iDigBio + Resources TDWG XMPP OCCIWG REST WS WS-I TAPIR HTTP SQL UTF-8 RDF XML X.509 OpenID SAML TCP JPEG2000 ODBC Virtual Appliances Machines Storage Networking Learning Modules Archiving Data Collections Structured Data Services Wiki Workshop Resources Workflow Engines Taxonomic Validation Data Conversion Geographical Mapping Collaboration Tools Non-structured Data Services TCNs National History Museums Google App Engine XSEDE Microsoft Live Amazon EC2/S3 Applied Innovations Microsoft Azure Google Apps Federal Collections
Advanced Computing and Information Systems laboratory
Building the iDigBio Cloud
Cloud-based strategy
Providing useful services/APIs (programmatic and web-based) Federated scalable object storage and information processing Digitization-oriented virtual appliances
Reliance on standards, proven solutions and sustainable software
Continuous consultation with stakeholders
Advanced Computing and Information Systems laboratory
Unique UF+FSU record
Track record of building cyberinfrastructure
PUNCH and In-VIGO
Nanohub, Netcare, In-VIGOBlast …
Morphbank AFRESH
Telecenter Archer
Advanced Computing and Information Systems laboratory
Keeping our eyes on the ball
7 Common/frequent needs: archival storage, server hosting, feedback on the data, data intensive transformations …
10-year tsunami of requirements: from being on Facebook to multilingual search-and-compute across multiple data sets…
Advanced Computing and Information Systems laboratory
Evolution of iDigBio capabilities
8
Time
Data ingestion Data access, provision and visualization Provide and enable data feedback Data linking and federation Process and visualize integrated dataIncreasing storage and server hosting in support of the above Increasing number of appliances in support of the above
Web site for interaction with public, community, education and above Q3/2012 Q3/2013 Q3/2014 Q3/2015
Advanced Computing and Information Systems laboratory
• Textual data
o JSON document database
o Data ingestion via DwC-a files o Get / Set API
• Image Data
o Internet-accessible object storage
o Upload appliance
o Limited access to low-level APIs
Textual Data (RIAK) Image Data (SWIFT) API Gateway Internet access
Advanced Computing and Information Systems laboratory
• Textual Data
o JSON document database o Data Ingestion via DwC-a files o Rich RESTful API
• Image Data
o Web-accessible object storage o Upload appliance
o Fully abstracted storage • Indexing and Search
o Extract EXIF data from images o Limited but useful set of indexes o Intuitive search UI
o Search available via API • Portal
o Consumes and interfaces text, image and search APIs (minimal server side code)
o Web-based mapping - client side javascript limits useable record count to about 50k records at a time.
Textual Data (RIAK) Image Data (SWIFT) API Gateway Internet access Filter Set Query interfa ce EXIF extrac tion iDigBio Portal
Advanced Computing and Information Systems laboratory
Virtual Appliances in iDigBio
Packaging of software and dependences in virtual machines
End user/desktop (e.g. VMware, Virtualbox)
Infrastructure-as-a-Service clouds (e.g. OpenStack)
Enhance user experience, facilitate integration with cloud
Image ingestion appliances (short term)
Batch upload of images from a local storage to cloud
Generate GUID/URLs for later processing
Reliable transfers using cloud APIs (e.g. Swift/iDigBio)
Post-processing appliances
(OCR tools; end-user or batch)
Geo-referencing appliances
(Training/verification)
Advanced Computing and Information Systems laboratory
Archer cyber-infrastructure
Hundreds of distributed compute/routers nodes 24/7 operation, 650+ cores
Custom appliance image for computer architecture community
Job scheduling across participating institutions
Advanced Computing and Information Systems laboratory
Now: appliance proposal process
By users/developers through the iDigBio Web portal
Requirements – demonstrates usage/buy-in, software license, documentation, etc
Queue of appliances for integration
iDigBio will prioritize and work with developers
Leverage expertise in appliance development
Focus on images that users can download and run on VMware, Virtualbox
Advanced Computing and Information Systems laboratory
Short term
Ingestion appliance Web-based UI Images captured(e.g. HD/flash media)
/images/1/100.tif /1/101.tif /2/200.tif … iDigBio object Storage cloud (Swift) Batch upload, Cloud APIs Web server Cloud client File interface /1/100.tif GUID1 /1/101.tif GUID2
Facilitate data ingestion, interface with iDigBio
Advanced Computing and Information Systems laboratory
Medium-term – “Marketplace”
iDigBio Portal Users/ Developers Community appliances End users iDigBio Personnel iDigBio appliances ProposalsAdvanced Computing and Information Systems laboratory
Long-term – information processing
iDigBio Portal Users/ Developers Community appliances End users iDigBio
Advanced Computing and Information Systems laboratory
Summary
iDigBio cloud
Service-oriented standards-based cyberinfrastructure focused on the ADBC community needs
Scalable data management and information processing using standard interfaces, data formats, protocols, tools
Toolboxes as appliances
Evolving collection of community-selected tools Built-in interfaces for effortless iDigBio integration
Embedded best practices and standards in biocollections work
Software re-use when open-source, well maintained,
manageable, sustainable and efficient to re-purpose
Feedback and suggestions welcome
Advanced Computing and Information Systems laboratory
Acknowledgments
National Science Foundation
Judith Skog and Anne Maglia
IDigBio team at University of Florida and Florida State
University
Advanced Computing and Information Systems laboratory
Extras
Advanced Computing and Information Systems laboratory
iDigBio IT Vision
Cyberinfrastructure to enable
the collaborative creation, integration and management of digitized biocollections,
their use in scientific research, education and outreach
Visible as a collection of persistent Internet-accessible
services, data and resources
For biocollection “producers” For biocollection “consumers”
For biocollection service providers For cyberinfrastructure providers For national/global data aggregators