• No results found

Data Appliance Sailing to Data Islands

N/A
N/A
Protected

Academic year: 2021

Share "Data Appliance Sailing to Data Islands"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Appliance

– Sailing to Data Islands

By

Simon Ellwood-Thompson

(2)

SAIL Databank – Swansea, WALES

(3)

SAIL Databank – Swansea, WALES

Following the PechaKucha just some clarity:-WALES and Scotland are the most interesting

(4)

SAIL Databank – Swansea, WALES

Following the PechaKucha just some clarity:-WALES and Scotland are the most interesting

(5)

SAIL DATABANK – Recent Developments

Medical Research Council (MRC) - Centre of Excellent • SAIL DATABANK major asset

• CIPHER - one of the four co-ordinating centres of the Farr Institute Economic and Social Research Council (ESRC)

• CADRE – one of four Administrative Data Research Centres (ADRCs)

Bio-Informatics award – Large compute cluster for Genetic Research

Wales Scotland Manchester UCL London Wales Scotland Southampton (England) North Ireland

(6)

FARR @ Swansea – Capital Investment

Additional capital investment, Our

GOALS:-1. UKSeRP: Offer our an expanded version of infrastructure as a service (IaS) to other major programmes (none-SAIL)

2. Data Appliance: Provide local capabilities to manage datasets so that dataset discover and availability become easier

3. Natural Language Processing

Context: Large amount of automation already developed but predicted massive increase in workload without increase on staffing

(7)

FARR – UKSeRP (quick overview)

• Existing infrastructure

• large IBM DB2 data warehouse (Database and management/processing code) • Remote access technology – SAIL Gateway, based on Vmware View

• Policies and procedures

• Hosting, power, cooling, IT staff to support infrastructure

• Expand Technical Platform

• Double SAIL Gateway and increase power of each desktop • Add software e.g. SAS & BI tools

• Add SQL server 2012 3 node AG Cluster • Add HADOOP cluster – big data

• HyperV clusters • Additional 10 racks

• Additional management requirements

• Now three database platforms

• Selective delegated management and control • Multiple configuration and security models

IBM DB2 Warehouse SAIL VDI SQL 2012 Availability Group NLP HyperV HADOOP

SAN 24TB NODE2 NODE1 SAN 24TB NODE4 NODE3 Backup DB2 Head SAN 60TB VDI VDI SAN 60TB VDI VDI VDI VDI SAN 21TB NLP NLP DAS 21TB SQL 2012 Ent DAS 21TB SQL 2012 Ent DAS 21TB SQL 2012 Ent V. Tape SAN Hadoop Hadoop Hadoop Hadoop SAN 100TB HyperV HyperV HyperV HyperV

(8)

FARR – Data Appliance

• Goal : “Development of hardware and software appliance for deployment into the NHS, Local Government and within SAIL to provide dataset collection, management, documentation and local linkage. These appliances will bring the capabilities previously only found in large data linkage center to the organisation in which they are deployed”

• “A key outcome is to make documented dataset visible to the wider research community for include into national projects, subject to information governance approval. These units are designed to be as low a unit cost as possible.“ • Funded to provide 15 Appliances to NHS/Goverment free of charge

• What's the point – provide a carrot not a stick

• Give business benefit to an organisation to allow them to create and management datasets • Provide locally linked dataset

• Create identifiable and anonymised view for these staff • Provide documentation and validation of dataset

(9)

Data Appliance – simplistic viewpoint

Web Based Application

Empower end user to create and manage datasets.

No database expertise required Lowering the technical bar.

DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files Web Front End FTP / ETL DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files

(10)
(11)
(12)

D at a “sch ema” au tomat ical ly comp ut ed base d on d at a con tain ed in u pl oad ed f ile

(13)

Publish based on permissions, configuration &

capabilities

UKSeRP Key deliverable: Permission / Configuration / Capabilities

DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files Web Front End FTP / ETL DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files DATASET • Access Control • Data storage • Documentation • Schema Editor • ER Diagram • Metrics and Validation • Artefacts / Files

Security, Configuration & Capability Model

Publ

ishi

ng

Local Data Catalogue

Linkage & Matching Database Loader (F ile S pl itt er ) Sharing & IG

Data Quality and Metrics MS SQL PostgreSQL External MS SQL IBM DB2 HADOOP PostgreSQL Trusted Third Party Linkage & Matching Other Appliance Regional / Global

(14)
(15)

Data Catalogue – Key Component

Additional points following previous sessions:

All DA carry a DC, DS can inherit from other DS DC entries, DC related to Programme/Security domain. DC’s replicate to Regional/Global DC. Road map: DC used to define and create DS

(16)

A Dataset

Specific version & Date

All section attach files

Contact Request VIMO Theme / Type / Level Tags

(17)

A Dataset (cont.)

(18)
(19)

Data Appliance very modular and configurable

Physical Server running a set of virtualised servers configured and scaled appropriately for the environment.

Architecture is based on loosely coupled async message passing between code blocks (Presented at SHIP 2013)

UKSeRP

Presentation Data Appliance Presentation

(20)

3 initial configurations – plug and play single cable

Small (Development / Demo)

• Single servers everything on. 4 cores, 6gb

• Web site, Workflow engine, Modules, SQL Express, MongoDB, RabbitMQ

Medium (Single Physical Server)

Single HyperV server, multiple v-servers for different roles. Dual 10 core CPU (40 virtual cores), 160GB memory, 6TB DiskSQL Express replaced by SQL server Standard 2012

10 special versions having extra modules for CliniThink NLP Large (Four Physical Clustered Servers)

• Dual Server HyperV server, Dual 10 core CPU (40 virtual cores), 160GB memory, shared 24TB Disk

• Dual SQL server 2012 Enterprise, Dual 8 core, 96GB memory, 22 x 300gb local disk, SQL 2012 Server AG Cluster

Software : Custom software in C# .net 4 / MVC 4. RabbitMQ, MongoDB, RavenDB for Large version, GoodSync FTP Replication

(21)

The Appliance is a disruptive technology

GAME CHANGER–

• Challenge: fit everything that a large data linkage center does into a single shrink wrapper product

• Opportunity: Look back on what we have done and question the design, unique opportunity for reflection and rare in successful operational systems

(22)

Challenges

• UKSeRP: additional database systems – need to support Microsoft SQL server, PostgreSQL, Cloudera HADOOP as well as IBM DB2 Warehouse

• Opportunity to make these system agnostic : remove vender tie in allowing for options in the future

• Need a probabilistic matching engine to do data linkage

• Opportunity: our existing system is very slow and unable to be support very well by our trusted third party due to its age. Very gold standard bias

Partnership with Curtin University, Australia allowing use to embed there system in the

appliance and replace the trusted 3rd party system. Additional benefit of increasing the capabilities of our matching beyond gold standard machining and looking forward to a continued partnership to explore Bloom Filter matching, Automated matching tuning, Dynamic recompilation of matching relationships based on project needs

Special Thanks to James Boyd & James Semmens, Curtin University

• Replace our residential matching and anonymisation system to Experian AddressBase allowing integration with the matching engine and finer matching down to flats in multi occupancy residences

(23)

Linkage: Migrating from ALF to New ALF2 and RALF2

SAIL uses a trusted third party, additional benefit is to inclusion of process monitoring and

remote reporting. End Users will be able to see where in the

process there requests are – much better user experience

(24)

Challenges

• Need delegable/devolved account management. Both appliance and UKSeRP

• Authorisation, Authorisation, Accounting

• Opportunity to develop a new security model which covers all aspects of the infrastructure not just the database – allowing the model to be validated and used in many ways

• Modular and extensible provisioning system taking the model and applying the intension • Event drive rather than time based, better service to the user

• Modular User activation :e.g. SAIL DAA, HR system lookup • Ability to support multiple two factor authentication systems

(25)

Challenges

• Dataset documentation is patchy and vague

• Structured Documentation is now mandatory for a dataset to be loaded into the appliance. Dataset can only be loaded into SAIL using the appliance.

• Ability to attach artefacts (supporting documents) to a dataset • Ability to load data as reference / lookup data within a dataset

• Opportunity: Partnership with Manitoba Centre for Health Policy, Canada – Special

thanks to Mark Smith. Bring the automated data quality reporting to the appliance and

turn the appliance back on the SAIL Databank to automatically collate and measure the quality metrics. Now have database system agnostic data quality module. Looking forward to a continued partnership to look at measuring not just variable quality but relationships between variables and other datasets as well as creating a pluggable architecture to do dataset specific statistical analysis

• Opportunity: Create an automatic “Data Catalogue” based on the datasets documentation, computed metrics and validation rules. Link into IHDLN Working Group on Metadata

(26)

Challenges

Data movement and routing both data and metadata Automated splitting of data files

Trusted third parties Data refreshes

Organisations outside NHS Sharing / Subscriptions

(27)

Why such a disruptive technology

(6 months to build !!)

The system was fine and everybody happy 

As we started designing became obvious how the system should be reconfigured with the data appliance as the central component.

IBM

DB2 Remote Access Security / IG

SAIL DATABANK

Additional Data

SAIL Technical

(28)

Why such a disruptive technology

(6 months to build !!)

SAIL Databank has/will/is becoming an instance of UKSeRP and fully dependant on Data Appliance

Other Major Programme

IBM DB2 SAIL DATABANK HADOOP MS SQL NLP Data Appliance Data Catalogue Versioning Data Management and Loading Security / IG Auto Documentor

Remote Access / VDI

BI /

Compute

File Splitting & Separataion

Data transportation

Additional Data

Data Metrics and Quality reporting

(29)

Why such a disruptive technology

(6 months to build !!)

Deployment of SAIL Databank Satellites / SAIL Mini 7 NHS trusts of Wales

4 NHS trusts of Bristol (England) 1 NHS Trust of North Devon

Major upgrades to our trusted third party (NWIS) SAIL Databank major technology upgrade

(30)

Data Appliance

Simon Ellwood-Thompson SWANSEA UNIVERSITY

References

Related documents