Principal partenaire financier
Leveraging Big Data
Technologies to Support Research in Unstructured Data Analytics
BY FRANÇOYS LABONTÉ GENERAL MANAGER JUNE 16, 2015
ABOUT CRIM
• Applied research centre in IT
• Dual mission:
– Provide expertise in IT to support enterprises and organisations in developing innovative products and solutions
– Contribute to the creation of new knowledge through scientific activities and publications
Major financial partner
THREE MAJOR AREAS OF EXPERTISE
• Voice, movement, emotions
• Augmented reality
• User activity-related aspects
• Analysis and processing of video, imagery, audio, text
• Semantics, natural language processing
• Geospatial imaging
• Client / cloud / mobile architectural approaches
• Test modeling and automation
• Code generation, model inference
• Development, test and technological management methodologies
INTERACTION AND HUMAN-SYSTEMS INTERFACES
ADVANCED DATA ANALYTICS
ADVANCED ARCHITECTURES AND TECHNOLOGIES FOR DEVELOPMENT AND TESTING
1
2
3
THE BIG DATA HYPE
• From Gartner:
• At CRIM, since many years:
– Volume: we have dealing with large data sets: videos, satellite imagery, large text corpus – Variety: we have been processing multi-modal data sets (text, images, audio, video)
– Velocity: we have been working on analyzing continuous data streams (surveillance) – Visualisation: we have been investigating and developing human-machine interfaces
– Value (actionable items): we have been developing “intelligent” decision-support systems
• SO WHAT IS IT ALL ABOUT ?
BIG DATA TECHNOLOGIES
• Open up new possibilities to solve complex problems in much simpler ways than before
– Hadoop and other related technologies:
No limitation on computing resources
No need to worry about “scaling up”
– NoSQL and other related technologies:
No need to know in advance the relations between the elements in a database
Capacity to combine “as needed” various heterogeneous data sources – Dynamic data processing (streams):
Going away from the “batch processing” approach
Capacity to develop more adaptive and reactive systems
Emergence of machine-to-machine / connected objects / Internet of things applications – Data centers and cloud technologies
Data storage and file management is simplified
• Promising technologies which do not offer yet simple, stable and mature solutions.
CRIM AND BIG DATA
• To continue developing our expertise by leveraging Big Data technologies in advanced analytics, but also in human-systems interactions and in architectures and advanced technologies for software development and testing
– New ways to think about complex problems
– Emphasis on problems involving unstructured data
– Empirical knowledge of Big Data technologies to accompany enterprises and organisations – Application-driven with concrete use-case
– Looking for the 5thV: Value
• We prefer talking about SMART DATA
• Multidisciplinary approach:
– Data science
– Advanced analytics / machine learning – Visualisation and interaction
– Business analysts
– Governance and data quality – Product management
– Data governance
– Architecture and software development
SMART DATA: ADVANCED ANALYTICS
value
difficulty
Descriptive
Diagnostic
Predictive
Prescriptive
What happened?
Why?
What will happen?
How to make it happen?
THE A
2DI PROJECT (ADVANCED ANALYTICS FOR DATA INTELLIGENCE)
– Goals
Develop a practical expertise with Big Data Technologies (analytics, interaction, visualisation)
Consolidate CRIM’s advanced analytics components
Build concrete use-case that can be used as an interactive « Vitrine technologique »
Foster multidisciplinary projects
Develop new collaborations and partnerships
THE A
2DI INFRASTRUCTURE
Data collection and preparation
Storage Data enrichment
Metadata
Analytics, data mining, machine learning, inference, fusion, statistical, heuristics
Visualisation Decision support
Configurable environment: specific deployments for selected use-cases
Openstack
Data analytics tools
Hadoop / Spark
Partners and external environment
DATA SET FROM OCEAN NETWORKS CANADA
Streaming Data Text Data
Multi‐dimensional Time Series Geo Spatial Video & Image Audio
Relational Social Network RT Monitoring
Manual annotations, log files
Spectrogram, echo sounder, hydrophone
Vertical profiling system, sonar
Navigation information, bathymetry, maps
Fixed cameras and cameras mounted on a rover
Narrative description Ontologies
Video & audio streams
USE-CASE # 1
– Key word detection from the audio information of submarine maintenance videos
Approximately 300 hours to process
Specialized vocabulary in biology and submarine navigation
– Apache
High level library for the processing of very large data sets
Developed at AMPLab in 2009 (Berkley)
Generalized MapReduce paradigm: 30x faster, with low latency for streaming applications
Distributed in-memory computing
Now more popular than Hadoop
Native integration with: Hadoop, ElasticSearch, Cassandra, RDBMS, Play!, etc…
ELASTIC SEARCH
– Distributed search engine
NoSQL document database
High-availability
Linear horizontal scalability
– Widely used in industry:
– Features:
Full-text advanced search (Lucene)
Geospatial queries
Approximate string matching
Real-time analytics
– Native integration with: Hadoop (HDFS), Spark, etc…
USE-CASE # 2
– Integration of geolocation data
Keywords position
Rover position
Satellite imagery
Sonar location
– Spatio-temporal layer for Accumulo (NoSQL)
GeoMesa + Accumulo = big-data + PostGIS + PostgreSQL
Storage, querying and processing of vector spatial-temporal big-data
OGC standards support: WMS, WFS, WPS
– Use-cases:
Density heatmaps
Batch or streaming analytics
Spatio-temporal predictive analytics
– Native integration with:
Spark (analytics et clustering)
GeoServer (webmapping) et OpenLayers (frontend)
GeoTrellis for raster geospatial data (satellite imagery, etc…)
USE-CASE # 3
– Keyword search enhancement with ontologies from Web resources
Natural langage processing
PLATFORM DEMONSTRATION
ANOTHER BIG DATA PROJECT
• VESTA
– Video Evaluation System for Task Analysis
• LEADS research network : Learning Environment Across Disciplines
– Education sciences: How do students learn?
– 6 universities et 11 partner organizations (Canada)
– 13 universities et 4 partner organization (North Amercia, Europe, Australia) – Led by Dr Susanne Lajoie (McGill University)
LEADS CONTEXT
• Video analysis of students in learning situations
– Video content: typically one student, many tasks
– Audio content: Think aloud, reading, conversation, answering questions
– Video
– Local sources
– Access rights management
– Manual transcripts
– Manual coding
– Data sharing
THE VESTA PLATFORM FEATURES
• A Web-based platform relying on some of the most recent HTML5 features
• 5 semi-automated annotation services
– Speaker identification – Transcription
– Audio-text correspondence – Transition detection (video) – Face detection
• 3 utility services
– Annotation storage
– Load balancing / task dispatching – Multimedia file storage
• Access rights management taking into account ethics approval for research protocols
THE VESTA PLATEFORM
CONCLUSIONS
• Big Data offer a huge potential, largely underexploited at this time
• Like numerous fundamental changes, expect a long journey
• Establish an ambitious vision, accomplish modest first steps but with a tangible value
• There is no “one size fits all” approach; it must be tailored to the specific use cas
• The question is not “Too Big or not Too Big”, what is important is data intelligence (“Smart Data”) that brings concrete value to the organisation
• Big Data technologies can also be used in other contexts
• On top of technological challenges, human challenges will dominate and determine the success or failure of specific initiatives.
PITFALLS
“A wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the over abundance of information sources that might consume it.”
Herbert Simons: Designing organization to an information-rich World; 1