19th January 2015
BDA Technologies & Selected
Case Studies
Ettikan Kandasamy Karuppiah (Ph.D),
Principal Researcher & Director of Accelerative Technologies Lab MIMOS Berhad
SEMINAR INTERNET COMPUTING TECHNOLOGY “Theme: Delivering Values From Hyperconnectivities”
2 Big data is defined by the high volume, velocity, variety, veracity and value of data which are
generated every second, minute, hour, day….by device, human etc
Turning big data into
Value
ECONOMIC BENEFITS GOVERNMENT BENEFITS SOCIETAL BENEFITSVOLUME
Growing data 90% of world’s data generated over last 2 yearsVELOCITY
Increasing data 175,000 tweets per secondVARIETY
Broadening data80% of the world’s data is unstructured (text, geospatial, audio, video)
VERACITY
Establishing the of big data
sources
Big Data technology allows us to establish quality and
accuracyespecially in unstructured data
Big Data Computing in ICT Sector
The Malaysian ICT services sub-sector has huge potential growth, with a
projected share of
35%
in the nation’s Digital Economy in 2020...
Requires Transformative
Platform
Source: MDEC, as taken from APeJ Big DataMaturityScape Assessment 2013 by IDC
Software Solutions and Support is the Key GDP Contributor
B u s i n e s s V a l u e
Data Modeling & Visualization
for PDRM Workforce
Planning & GPGPU Data Security Library
MIMOS BigData Technologies R&D
Establish work on General Purpose
Graphics Processing Unit for text manipulation, Hadoop Trainings MultiCore Java Compiler Acquire Train Conducted Workshop, Hadoop Programming training to Malaysian Research Community Collaboration R&D MiAccLib Cleansing MiAccLib Finance Data Cleansing Engine for PERKESO & Data Warehouse for PERKESO MiAccLib Algo/Map nVidia COE for
GPGPU Established MiAccLib Crypto Sentiment Analysis Model & Data Modeling & Data Warehouse for PIK MOH & GPGPU Video
Data Analytics Library R&D Data Encryption/Decr yption for National Data Protection MiAccLib Video GPU Accelerated Libraries for Data Cleansing & Financial Risk Modeling MiAccLib BigData Accelerated Libraries for Database Accelerator Library (Galactica)
2014 MIMOS Berhad. All Rights Reserved. 4
GE’13 Electoral Roll Analysis with Hadoop & GPU
MiAccLib Cleansing
ESRI Inc/US Mou Established
Acquire Train Intel Malaysia
/US MoU /US/Europe MoU AMD Malaysia
High Risk Profiling, Illicit, Taxable & Drugs
Detection (PoC)
MiAccLib Image
RM10 ->
Foundation & Early Adaptation for Heterogenetic Computing
RM11 -> Maturation & Progressive
Deployment of Scalable Heterogenetic Computing
Assisting Both Government & Private Sector
Needs
Private Sector to Go Global
National Public Sector
Source : MDeC
DECISIONS REQUESTED
FCC is requested to:
1. Take note of data science upskilling for civil servants
2. Take note of MAMPU developing the Government Open Data framework by 2015
3. Endorse the DG Lab on BDA to identify use cases and pilot projects that address societal wellbeing
4. Take note of MIMOS defining and developing the Big Data technology platform for Government by 2015.
5. Mandate opening up of all relevant data (Open/Non-Open) to the DG Lab on BDA for the pilot projects
Rahsia Besar Rahsia Sulit Terhad Terbuka
Opening Up Non-Sensitive Government Data
Policy for all government agencies to open up data categorised under terbuka
o E.g. - non-sensitive data like meteorology, transport timetables and pricing of essential goods based on Open Data criteria
+
Developing BDA Open Innovation Platform
An open-innovation platform between Government, businesses and Rakyat to improve e-participation and user satisfaction. Prioritization through the development of high impact, low-cost, demand driven life-event solutions
POCs, pilots &
apps
Secure environment (sandbox) for Government Data
BDA DG (Digital Government) LAB Exper
tise - Community Data - Government Data Project Sponsor
Sector-specific use cases /life-events: eg. Welfare, Education, Healthcare, Transportation
BDA Technology Platform
DATA OUTCOMES
Open Data
DATA Community Government
Research & Development on KEY Data Extraction, Processing & Analytics Components
i. National Data Sovereignty
ii. Trusted Data
iii. Secured Data
Localized Entity (ie. MIMOS, Cybersecurity) Key Values Data Visualization Data Staging Cleansing Harmonisation Anonymisation
Data Model & Analytics Security Infrastructure Management Data DB Store Data Extraction Traceability Machine Learning - Malaysian Context - (BM, English, Chinese, Tamil) Accelerated Computing Secured Cloud Services Visualization - Malaysian Perspective
8 Mi-Cloud Mi-Harmony Mi-UAP Mi-Mobile Mi-MOCHA Mi-Helio Mi-Morphe Mi-Harvester Mi-CLIP
Mi-Doc Mi-Scrambler Mi-Portal
Mi-BIS Mi-ARMC
Mi-Trust
Mi-SP (Video Analytics) Mi-STP Mi-Target Mi-HPDW Mi-AccLytics Mi-DSS Mi-AccLib Mi-Trace Mi-ROSS Mi-DW Mi-Market Galactica Customization
3
rdParty Systems & Hardware
Data Security
Data Extraction
Data Staging
Data DB Store
Data Visualization
Data Model & Analytics
Security Data Management Infrastructure Management Traceability Cleansing Harmonisation Anonymisation Data Source Structured + Open Linked Data Unstructured
Applications
Extracting Value from Data
Data Sharing Data Visualization Scrambled database & Datamarts Granular Primary Database Data Anonymisation Published Data Marts Harmonisation Data Harmonisation Harmonisation Terminologies Cleansing Data Cleansing Data Correction Staging Data Data Harvesting UnStructured Data Sources Structured Data SourcesVirtualized Platform & Integrity Manager
Mi-CLOUD + Mi-Mocha Unstructured Data Collector Mi-Clip Data Harmonisation Mi-Harmony + Mi-Semantics Detect Correction Exception Mi-Morphe + Mi-AccLib Data Anonymisation Mi-Scramble + Mi-Crypto + MiAccLib Authentication & Authorization Mi-UAP Mi-ARMC
Data Warehouse Platform
(Mi-Galactica, Mi-AccConnect, Mi-HPDW) Data Modeling
2014 MIMOS Berhad. All Rights Reserved. 9
Data Statistics Mi-AccStat Sentiment Analytics Mi-Intelligence; Mi-NLP Data Visualization Mi-HELIO;
Mi-BIS Data Analytics
Mi-Portal Social Network Analytics Mi-Visualitic Knowledge Harvester (LOD) Mi-Harvester Data Analytics Mi-HPDW Data Analytics Data Analytics Mi-Target
10 Mi-Cloud Mi-Harmony Mi-UAP Mi-Mobile Mi-MOCHA Mi-Helio Mi-Morphe Mi-Harvester Mi-CLIP
Mi-Doc Mi-Scrambler Mi-Portal
Mi-BIS Mi-ARMC
Mi-Trust
Mi-SP (Video Analytics) Mi-STP Mi-Target Mi-HPDW Mi-AccLytics Mi-DSS Mi-AccLib Mi-Trace Mi-ROSS Mi-DW Mi-Market Galactica
New Platforms & Revisions
Technology Challenges Ahead (11
thMalaysia Plan)
NEWER Channels
of Consumption
(eg. Omni channel data market)
NEWER Sources
of Data
(eg. high speed streams)
NEWER Methods
of Visualization
(eg. Multi dimensional view)
NEWER Paradigms
on Computing
(eg. Dockers) Technology Pull Techn olog y Pu sh
11 IoA Internet of Anything II Industrial Internet IoE Internet of Everything IoT Internet of Things
12 IoA Internet of Anything II Industrial Internet IoE Internet of Everything IoT Internet of Things Software Defined Network Big Data Processing Mobile Systems Wearables Cloud Computing Cyber-biological systems Cyber-physical systems Internet of Humans
Open Platform & BDA Middleware Architecture
Data Extraction Flume Mi-Clip Mi-Harvester Mi-Morphe Structured, Semi-structured & Un-structured Data Sources Open Linked Data Web & Social Media RDBM S Files Sqoop Data Model Mi-HPDW Kafka Data Cleansing Mi-Morphe Mi-AccLib Data Anonymisation Mi-Scramble Data Harmonization Mi-Harmony Data Source Mi-Crypto Mi-AccLib RDBMS Galactica FS HDFS, NoSQLGalactica Hadoop Data warehouse / Data mart Data Storage Mi-HPDW STORAGE Infrastructure Mi-Cloud Mi-Mocha Galactica YARN Mi-AccConnect
Pig Hive Impala Shark
Galactica Connector
R Mahout ML-Lib (Spark) Mi-NLP
Mi-AccStat Mi-Helio
Mi-BIS Mi-Portal
Data Visualisation
Data Analytics Tools (Machine Learning)
Mi-UAP Data Security Mi-HPDW Mi-HPDW Mi-HPDW Mi-Target GIS
Apache Drill | Spark/Shark | Hue Cloudera
Search & Solr
RDF Graph DB Mi-Intelligence Cloudera Manager/ Falcon Zoo Keeper Oozie Sentry Data Management Data Staging MIMOS Solution 3 rd Party Solution Mi-Trust Mi-Visualitics
(Data Sources Type)
RDBMS
Streaming(twitter, logs, etc) NoSQL Data Type
Stream
Spark | Kafka | Spring XD & Storm
Search Cloudera Search & Solr
Application Program Interface Thrift | REST | Java API | AVRO
Management YARN (resource management)| Big Data Orchestration Engine/Layer | Zookeeper (configuration and synchronization) Oozie (work flow scheduler) | Cloudera Manager | Management for Luster
Storage HDFS | HPDW-Storage |Galactica FS | NoSQL (Hbase)
Distributed Database (Cassandra) |RDBMS (Postgress, MySQL)
Visualization Mi-Helio | Mi-Portal | Mi-BIS
(Mi-AccConnect)| 3rd Party Apps
Batch Query
MapReducev2 | Pig | Hive
Real Time Query
Mi-BIS with Impala throughMi-AccConnect
Hue| Galactica | Apache Drill | Spark/Shark |
HPDW-BigData DB
Machine Learning
Mi-BIS (Weka) | Accstats (R and Cloudera C++)
ML-LIB(Spark) | Revolution R, Weka
Processing
Mi-Morphe | Morphlines| Mi-Acclib MapReducev2(Accelerated ETL)
HPDW Data Model Plugin
(For MiMorphev3/Pentaho)
Analytics
Simulator | Planning Tool | Predictive Prescriptive | Prediction Algorithm
Mi-BIS (Mi-Accstats) Mi-BIS (Data Mining)
Revolution R 3rd Party GIS 3rd party Legend: Security a nd Authent ica tion Sent ry | Mi -UAP | Mi -A R MC | Mi -T rust Data Management Sqoop | Flume
MIMOS BigData Stack With Reference to Hadoop Stack
Multi & Many Cores Processors(CPU + GPU)
Complete 3rd Party 3rd Party & MIMOS Offering MIMOS Technologies
15
Proof of Concepts
Selected Use Cases
16
Proof of Concepts
-Mixed Scenario-
17
Challenges to be Addressed
During Initial Roll-Outs
Data Challenges (Stage 1)
• Data is stored in partial & distributed locations
• Format of data both in digital & non digital while some are in paper based format
• Incomplete data set (Q issues) • Cleanliness of the data
– Missing values, Random, Non-Random, CR, Noise – Cleaning while maintaining integrity & ‘value’
• Extracting the ‘features’
• Data in plural languages (at least English & Malay) • Structured has longer historical value to be acquired
– Data storage media & format for extraction and usage
• How to authenticate the key values? Where is the reference point?
• As for unstructured data (e.g social media), current technology is adequate to support the pre-processing, analytics…
– With some local challenges
• Who are the data owner? How to ensure the security level of the data for sharing? PDP compliance confusion ….
Analytics Challenges (Stage 2)
• Tools are available but right approach is still critical for evaluation • Which are the best/right algorithms to be used?
• Can you identify the right ‘domain expert’ within the organization? • Who are the local ‘domain experts’ to be consulted for the
methods/algorithms selection?
– You may not have data scientist in specific gov. organization, but how to form one (external + internal) -> ‘analytics team’
• What exactly are the data owners ‘business needs’?
– Why do they need to do this?
– Headache for them…best to leave the data to ‘rest in peace’ !!
• Which data to be included and which to be excluded, what to be ‘anonymized’?
– concern of ‘meaning/trend’ extraction
• Plurality of languages & interpretation accuracy
– ‘Semantification’ of the language specific analytics
• Bottlenecks to be identified and accelerated approach required for the specific processing
Results Challenges (Stage 3)
• Visualization of the results in simple, ‘action-able’ and ‘communicable’ • how to handle continuously changing analytics (and the results) due to
– New data inclusion
– New ‘domain expert’ inclusion
– New additional factors to be considered
• Who validates the results?
• How to translate results to value – for (gov) organization • How to translate the ‘value’ to actions?