Data Appliance
– Sailing to Data Islands
By
Simon Ellwood-Thompson
SAIL Databank – Swansea, WALES
SAIL Databank – Swansea, WALES
Following the PechaKucha just some clarity:-WALES and Scotland are the most interesting
SAIL Databank – Swansea, WALES
Following the PechaKucha just some clarity:-WALES and Scotland are the most interesting
SAIL DATABANK – Recent Developments
Medical Research Council (MRC) - Centre of Excellent • SAIL DATABANK major asset
• CIPHER - one of the four co-ordinating centres of the Farr Institute Economic and Social Research Council (ESRC)
• CADRE – one of four Administrative Data Research Centres (ADRCs)
Bio-Informatics award – Large compute cluster for Genetic Research
Wales Scotland Manchester UCL London Wales Scotland Southampton (England) North Ireland
FARR @ Swansea – Capital Investment
Additional capital investment, Our
GOALS:-1. UKSeRP: Offer our an expanded version of infrastructure as a service (IaS) to other major programmes (none-SAIL)
2. Data Appliance: Provide local capabilities to manage datasets so that dataset discover and availability become easier
3. Natural Language Processing
Context: Large amount of automation already developed but predicted massive increase in workload without increase on staffing
FARR – UKSeRP (quick overview)
• Existing infrastructure
• large IBM DB2 data warehouse (Database and management/processing code) • Remote access technology – SAIL Gateway, based on Vmware View
• Policies and procedures
• Hosting, power, cooling, IT staff to support infrastructure
• Expand Technical Platform
• Double SAIL Gateway and increase power of each desktop • Add software e.g. SAS & BI tools
• Add SQL server 2012 3 node AG Cluster • Add HADOOP cluster – big data
• HyperV clusters • Additional 10 racks
• Additional management requirements
• Now three database platforms
• Selective delegated management and control • Multiple configuration and security models
IBM DB2 Warehouse SAIL VDI SQL 2012 Availability Group NLP HyperV HADOOP
SAN 24TB NODE2 NODE1 SAN 24TB NODE4 NODE3 Backup DB2 Head SAN 60TB VDI VDI SAN 60TB VDI VDI VDI VDI SAN 21TB NLP NLP DAS 21TB SQL 2012 Ent DAS 21TB SQL 2012 Ent DAS 21TB SQL 2012 Ent V. Tape SAN Hadoop Hadoop Hadoop Hadoop SAN 100TB HyperV HyperV HyperV HyperV
FARR – Data Appliance
• Goal : “Development of hardware and software appliance for deployment into the NHS, Local Government and within SAIL to provide dataset collection, management, documentation and local linkage. These appliances will bring the capabilities previously only found in large data linkage center to the organisation in which they are deployed”
• “A key outcome is to make documented dataset visible to the wider research community for include into national projects, subject to information governance approval. These units are designed to be as low a unit cost as possible.“ • Funded to provide 15 Appliances to NHS/Goverment free of charge
• What's the point – provide a carrot not a stick
• Give business benefit to an organisation to allow them to create and management datasets • Provide locally linked dataset
• Create identifiable and anonymised view for these staff • Provide documentation and validation of dataset
Data Appliance – simplistic viewpoint
Web Based Application
Empower end user to create and manage datasets.
No database expertise required Lowering the technical bar.
DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files Web Front End FTP / ETL DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files
D at a “sch ema” au tomat ical ly comp ut ed base d on d at a con tain ed in u pl oad ed f ile
Publish based on permissions, configuration &
capabilities
UKSeRP Key deliverable: Permission / Configuration / Capabilities
DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files Web Front End FTP / ETL DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files
Security, Configuration & Capability Model
Publ
ishi
ng
Local Data Catalogue
Linkage & Matching Database Loader (F ile S pl itt er ) Sharing & IG
Data Quality and Metrics MS SQL PostgreSQL External MS SQL IBM DB2 HADOOP PostgreSQL Trusted Third Party Linkage & Matching Other Appliance Regional / Global
Data Catalogue – Key Component
Additional points following previous sessions:
All DA carry a DC, DS can inherit from other DS DC entries, DC related to Programme/Security domain. DC’s replicate to Regional/Global DC. Road map: DC used to define and create DS
A Dataset
Specific version & Date
All section attach files
Contact Request VIMO Theme / Type / Level Tags
A Dataset (cont.)
Data Appliance very modular and configurable
Physical Server running a set of virtualised servers configured and scaled appropriately for the environment.
Architecture is based on loosely coupled async message passing between code blocks (Presented at SHIP 2013)
UKSeRP
Presentation Data Appliance Presentation
3 initial configurations – plug and play single cable
Small (Development / Demo)
• Single servers everything on. 4 cores, 6gb
• Web site, Workflow engine, Modules, SQL Express, MongoDB, RabbitMQ
Medium (Single Physical Server)
• Single HyperV server, multiple v-servers for different roles. Dual 10 core CPU (40 virtual cores), 160GB memory, 6TB Disk • SQL Express replaced by SQL server Standard 2012
• 10 special versions having extra modules for CliniThink NLP Large (Four Physical Clustered Servers)
• Dual Server HyperV server, Dual 10 core CPU (40 virtual cores), 160GB memory, shared 24TB Disk
• Dual SQL server 2012 Enterprise, Dual 8 core, 96GB memory, 22 x 300gb local disk, SQL 2012 Server AG Cluster
Software : Custom software in C# .net 4 / MVC 4. RabbitMQ, MongoDB, RavenDB for Large version, GoodSync FTP Replication
The Appliance is a disruptive technology
GAME CHANGER–
• Challenge: fit everything that a large data linkage center does into a single shrink wrapper product
• Opportunity: Look back on what we have done and question the design, unique opportunity for reflection and rare in successful operational systems
Challenges
• UKSeRP: additional database systems – need to support Microsoft SQL server, PostgreSQL, Cloudera HADOOP as well as IBM DB2 Warehouse
• Opportunity to make these system agnostic : remove vender tie in allowing for options in the future
• Need a probabilistic matching engine to do data linkage
• Opportunity: our existing system is very slow and unable to be support very well by our trusted third party due to its age. Very gold standard bias
• Partnership with Curtin University, Australia allowing use to embed there system in the
appliance and replace the trusted 3rd party system. Additional benefit of increasing the capabilities of our matching beyond gold standard machining and looking forward to a continued partnership to explore Bloom Filter matching, Automated matching tuning, Dynamic recompilation of matching relationships based on project needs
• Special Thanks to James Boyd & James Semmens, Curtin University
• Replace our residential matching and anonymisation system to Experian AddressBase allowing integration with the matching engine and finer matching down to flats in multi occupancy residences
Linkage: Migrating from ALF to New ALF2 and RALF2
SAIL uses a trusted third party, additional benefit is to inclusion of process monitoring and
remote reporting. End Users will be able to see where in the
process there requests are – much better user experience
Challenges
• Need delegable/devolved account management. Both appliance and UKSeRP
• Authorisation, Authorisation, Accounting
• Opportunity to develop a new security model which covers all aspects of the infrastructure not just the database – allowing the model to be validated and used in many ways
• Modular and extensible provisioning system taking the model and applying the intension • Event drive rather than time based, better service to the user
• Modular User activation :e.g. SAIL DAA, HR system lookup • Ability to support multiple two factor authentication systems
Challenges
• Dataset documentation is patchy and vague
• Structured Documentation is now mandatory for a dataset to be loaded into the appliance. Dataset can only be loaded into SAIL using the appliance.
• Ability to attach artefacts (supporting documents) to a dataset • Ability to load data as reference / lookup data within a dataset
• Opportunity: Partnership with Manitoba Centre for Health Policy, Canada – Special
thanks to Mark Smith. Bring the automated data quality reporting to the appliance and
turn the appliance back on the SAIL Databank to automatically collate and measure the quality metrics. Now have database system agnostic data quality module. Looking forward to a continued partnership to look at measuring not just variable quality but relationships between variables and other datasets as well as creating a pluggable architecture to do dataset specific statistical analysis
• Opportunity: Create an automatic “Data Catalogue” based on the datasets documentation, computed metrics and validation rules. Link into IHDLN Working Group on Metadata
Challenges
Data movement and routing both data and metadata Automated splitting of data files
Trusted third parties Data refreshes
Organisations outside NHS Sharing / Subscriptions
Why such a disruptive technology
(6 months to build !!)The system was fine and everybody happy
As we started designing became obvious how the system should be reconfigured with the data appliance as the central component.
IBM
DB2 Remote Access Security / IG
SAIL DATABANK
Additional Data
SAIL Technical
Why such a disruptive technology
(6 months to build !!)SAIL Databank has/will/is becoming an instance of UKSeRP and fully dependant on Data Appliance
Other Major Programme
IBM DB2 SAIL DATABANK HADOOP MS SQL NLP Data Appliance Data Catalogue Versioning Data Management and Loading Security / IG Auto Documentor
Remote Access / VDI
BI /
Compute
File Splitting & Separataion
Data transportation
Additional Data
Data Metrics and Quality reporting
Why such a disruptive technology
(6 months to build !!)Deployment of SAIL Databank Satellites / SAIL Mini 7 NHS trusts of Wales
4 NHS trusts of Bristol (England) 1 NHS Trust of North Devon
Major upgrades to our trusted third party (NWIS) SAIL Databank major technology upgrade
Data Appliance
Simon Ellwood-Thompson SWANSEA UNIVERSITY