Data Lab Architecture
Astronomer’s Desktop Legacy Apps User Code Cmdline Tools Web PageData Lab Ops
User Mgmt Monitoring
Data Access Services
VOSpace UWS SCS SSA SIA TAP UWS SQL Service UWS Public Services Resource Resolver Storage Mgr Query Manager Job Manager Authentication Private Services Ops Monitor Private Repo Public Repo Storage Resource User Space Virtual Space Compute Resource Compute Jobs External Resources VO Data VO Svcs NSA Databases Data Pub Ops DBs
Large Cats UWS
MyDB Presentation Layer Services Layer Data Access Layer Resources Layer
Data Lab Architecture
Storage Resource User Space Virtual Space Compute Resource Compute Jobs External Resources VO Data VO Svcs NSA Databases Data Pub Ops DBsLarge Cats UWS
MyDB
Data Access Services
VOSpace UWS SCS SSA SIA TAP UWS SQL Service UWS Public Services Resource Resolver Storage Mgr Query Manager Job Manager Authentication Private Services Ops Monitor Private Repo Public Repo Astronomer’s Desktop Legacy Apps User Code Cmdline Tools Web Page
Data Lab Ops
User Mgmt Monitoring Presentation
Layer Services Layer Data Access Layer Resources Layer
Presentation Layer
This layer contains the primary user interfaces.
Astronomer’s Desktop
– Web clients -- data query forms, content browsers, monitors, etc – Command-line tools -- for local desktop access
– Legacy Apps -- inc. scripting environments such as Python – User-written code -- custom science clients
– Login shells
Operators Tools
– System Monitoring / Administration – User and Resource management
Services Layer
This layer provides interfaces used mostly by software.
Public Services
– Authentication / Authorization – controlled access to D/L – Job Manager – manage compute jobs
– Query Manager – manage large data queries
– Storage Manager – manage virtual storage resource
– Resource Resolver – locate services / resource within D/L
Private Services
Data Access Layer
This layer provides interfaces to data services.
Simple VO data services
– Catalog/images/spectra – positional (+constraint) based query – Anonymous access allowed
Advanced Catalog Services
– Full SQL query capability
• VO standard interface (public access)
• Custom SQL interface (authorized access)
Virtual storage
Service vs. Access Layers
Why the need for different layers?
Service Layer Access Layer
Astronomer Friendly X
Authorized Access X
Anonymous Access X X
Direct VO Protocols X
Job Control X Depends
Data Lab API X X
Virtual Observatory API X
Web Interface X* X
Programmatic (Desktop) Interface X* X*
Resources Layer
This layer describes physical / logical resources in the D/L.
Databases
– Large (distributed) Catalog DB
– Personal DB (similar to SDSS MyDB) – User-published datasets
– Operational DB
Physical Storage
– Persistent user storage – Virtual storage
Compute Resources
– Servers for processing workflows
External Services
Large Catalogs
• Require a low-cost, scalable and reliable solution
• No viable turnkey system available
• The LSST
QServ
project will gain us valuable experience
• Presents a “normal” DB interface to client
- Can put TAP/SQL service in front of it
QServ
• Can optimize data partitioning
thru experimentation
• Requires dedicated hardware
for each catalog instance
Virtual Storage
• Implemented using disk filesystem as back-end
– Simplifies exported service for use on local user file systems – Provides options for D/L operations:
• User-based partition scheme
• Legacy code can bypass VOSpace protocols (via FUSE mounted filesystem)
• Cons: Potential synchronization issues
• Containers used to package service
– Bundle dependencies
– FUSE mounts for other containers
• Exploit protocol’s support of:
– Capabilities – Views
Virtual Storage Service Container
Image/Table Support Apps Data Lab Interfaces
Python
VOSpace
Database
Base Docker OS
Local Disk Container
Example - Bringing It All Together
NOAO Data Lab DL Task
Virtual Storage Svcs Large Catalog Svcs DL Task
Data Publication Svcs PI/Survey NSA
MyDB
User 1 Desktop
Virtual Storage Svc DL Task DL Task
MyDB
User 2 Laptop
Virtual Storage Svc Legacy Tools
Data Publication Svc
1(a)
1(b) 1(c)
2(a)
Compute Services / Virtualization
Task Containers
• Why are they interesting?
– Provide task-level virtualization
– Much smaller in size, faster to startup – Bundles / isolates dependencies
– Container images can be layered
• E.g. a “base Python 2.7 environment” – Containers have their own IP address – Users can “login” to a container
– Can be deployed to other Clouds easily – Growing user / developer community – Repository of public containers available
Tasking
Interface <<Task>>
Data Lab Support Code Base OS Image Disk Cache Mount Virtual Storage F U S E Task Container Params Results
Compute Services / Virtualization
Task Containers
• What can you contain?
– Web applications– Desktop Tools
– Almost anything….
Tasking Interface
– Handles UWS communications with the Job Manager
• Allows for setting of parameters, results collection, timeouts – Redirects stdio streams back to calling client
Container Storage
– Persistent cache container shared in a workflow
– Virtual storage can be mounted as part of environment
Tasking
Interface <<Task>>
Data Lab Support Code Base OS Image Disk Cache Mount Virtual Storage F U S E Task Container Params Results
Compute Services / Job Manager
Job Manager
• Parallelizes a request based on user parameters
– User-defined independent input list to parallelize
• Initializes a job on the remote compute server
• Executes as
sync
or
async
job
– UWS for job control
• Polls for completion
• Gets result objects
• Returns results to client
– Or, creates new transfer job
• Manages hundreds of jobs
Tasking Interface <<Task>> Tasking Interface UWS Client <<Task>> fork() fork() stdio streams stdio streams
Job Manager Job Manager
ssh ssh
Query Manager / SQL Service
Query Manager
• Provides a high-level, uniform, interface for clients to query
data services
– Hides the sync/async job handling and VO protocols from clients – Orchestrates result handling (download, save to virtual storage, etc)
SQL Service
• Provides job control for query by implementing
UWS
• Offers options for query-result handling
– Store to personal database, virtual storage, direct download, etc. – Download format options (FITS, etc)
• Offers alternative to VO TAP
Data Publication
• Capability is used in multiple contexts
– Public access to high-level data products (static)
– Private access used in workflows (transient)
– Semi-private access within a collaboration (shared)
• Shared responsibility between D/L and Users
– D/L provides tools, resources and a publishing framework – Users provide the content and the scientific curation
• Low-cost,
simple
, services for all datasets
Storage Manager
• Provides a
simple interface
for user applications
– Hides details of the Virtual Storage implementation (VOSpace) – Can map to idiomatic filesystem interfaces easily (i.e. get, put, list)
• Abstracts
easily to web, desktop and programmatic APIs
• Provides
authenticated access
to data holdings
• Manages the details for other Data Lab services
Authentication / Authorization
• Deferred implementation
in Year-1 due to potential
landmines in a changing landscape
– General user support not needed, trusted-users only
– Y1 services to use null interface to identify need for service in the code w/out requiring a working service
– Various authentication methods under discussion
• Requests to
public services
passed-thru automatically
– Implies, service knows public vs private services