• No results found

Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics

N/A
N/A
Protected

Academic year: 2021

Share "Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

Principal partenaire financier

Leveraging Big Data

Technologies to Support Research in Unstructured Data Analytics

BY FRANÇOYS LABONTÉ GENERAL MANAGER JUNE 16, 2015

(2)

ABOUT CRIM

Applied research centre in IT

Dual mission:

Provide expertise in IT to support enterprises and organisations in developing innovative products and solutions

Contribute to the creation of new knowledge through scientific activities and publications

Major financial partner

(3)

THREE MAJOR AREAS OF EXPERTISE

• Voice, movement, emotions

• Augmented reality

• User activity-related aspects

• Analysis and processing of video, imagery, audio, text

• Semantics, natural language processing

• Geospatial imaging

• Client / cloud / mobile architectural approaches

• Test modeling and automation

• Code generation, model inference

• Development, test and technological management methodologies

INTERACTION AND HUMAN-SYSTEMS INTERFACES

ADVANCED DATA ANALYTICS

ADVANCED ARCHITECTURES AND TECHNOLOGIES FOR DEVELOPMENT AND TESTING

1

2

3

(4)

THE BIG DATA HYPE

• From Gartner:

• At CRIM, since many years:

– Volume: we have dealing with large data sets: videos, satellite imagery, large text corpus – Variety: we have been processing multi-modal data sets (text, images, audio, video)

– Velocity: we have been working on analyzing continuous data streams (surveillance) – Visualisation: we have been investigating and developing human-machine interfaces

– Value (actionable items): we have been developing “intelligent” decision-support systems

• SO WHAT IS IT ALL ABOUT ?

(5)

BIG DATA TECHNOLOGIES

Open up new possibilities to solve complex problems in much simpler ways than before

– Hadoop and other related technologies:

No limitation on computing resources

No need to worry about “scaling up”

– NoSQL and other related technologies:

No need to know in advance the relations between the elements in a database

Capacity to combine “as needed” various heterogeneous data sources – Dynamic data processing (streams):

Going away from the “batch processing” approach

Capacity to develop more adaptive and reactive systems

Emergence of machine-to-machine / connected objects / Internet of things applications Data centers and cloud technologies

Data storage and file management is simplified

Promising technologies which do not offer yet simple, stable and mature solutions.

(6)

CRIM AND BIG DATA

To continue developing our expertise by leveraging Big Data technologies in advanced analytics, but also in human-systems interactions and in architectures and advanced technologies for software development and testing

– New ways to think about complex problems

– Emphasis on problems involving unstructured data

– Empirical knowledge of Big Data technologies to accompany enterprises and organisations – Application-driven with concrete use-case

– Looking for the 5thV: Value

• We prefer talking about SMART DATA

• Multidisciplinary approach:

– Data science

– Advanced analytics / machine learning – Visualisation and interaction

– Business analysts

– Governance and data quality – Product management

– Data governance

– Architecture and software development

(7)

SMART DATA: ADVANCED ANALYTICS

value

difficulty

Descriptive

Diagnostic

Predictive

Prescriptive

What happened?

Why?

What will happen?

How to make it happen?

(8)

THE A

2

DI PROJECT (ADVANCED ANALYTICS FOR DATA INTELLIGENCE)

– Goals

 Develop a practical expertise with Big Data Technologies (analytics, interaction, visualisation)

 Consolidate CRIM’s advanced analytics components

 Build concrete use-case that can be used as an interactive « Vitrine technologique »

 Foster multidisciplinary projects

 Develop new collaborations and partnerships

(9)

THE A

2

DI INFRASTRUCTURE

Data collection and preparation

Storage Data enrichment

Metadata

Analytics, data mining, machine learning, inference, fusion, statistical, heuristics

Visualisation Decision support

Configurable environment: specific deployments for selected use-cases

Openstack

Data analytics tools

Hadoop / Spark

Partners and external environment

(10)

DATA SET FROM OCEAN NETWORKS CANADA

Streaming Data Text Data

Multi‐dimensional Time Series Geo Spatial Video & Image Audio

Relational Social Network RT Monitoring

Manual annotations, log files

Spectrogram, echo sounder, hydrophone

Vertical profiling system, sonar

Navigation information, bathymetry, maps

Fixed cameras and cameras mounted on a rover

Narrative description Ontologies

Video & audio streams

(11)

USE-CASE # 1

– Key word detection from the audio information of submarine maintenance videos

 Approximately 300 hours to process

 Specialized vocabulary in biology and submarine navigation

– Apache

 High level library for the processing of very large data sets

 Developed at AMPLab in 2009 (Berkley)

 Generalized MapReduce paradigm: 30x faster, with low latency for streaming applications

 Distributed in-memory computing

 Now more popular than Hadoop

 Native integration with: Hadoop, ElasticSearch, Cassandra, RDBMS, Play!, etc…

(12)

ELASTIC SEARCH

– Distributed search engine

 NoSQL document database

 High-availability

 Linear horizontal scalability

– Widely used in industry:

– Features:

 Full-text advanced search (Lucene)

 Geospatial queries

 Approximate string matching

 Real-time analytics

– Native integration with: Hadoop (HDFS), Spark, etc…

(13)

USE-CASE # 2

– Integration of geolocation data

 Keywords position

 Rover position

 Satellite imagery

 Sonar location

– Spatio-temporal layer for Accumulo (NoSQL)

 GeoMesa + Accumulo = big-data + PostGIS + PostgreSQL

 Storage, querying and processing of vector spatial-temporal big-data

 OGC standards support: WMS, WFS, WPS

– Use-cases:

 Density heatmaps

 Batch or streaming analytics

 Spatio-temporal predictive analytics

– Native integration with:

 Spark (analytics et clustering)

 GeoServer (webmapping) et OpenLayers (frontend)

 GeoTrellis for raster geospatial data (satellite imagery, etc…)

(14)

USE-CASE # 3

– Keyword search enhancement with ontologies from Web resources

 Natural langage processing

(15)

PLATFORM DEMONSTRATION

(16)

ANOTHER BIG DATA PROJECT

• VESTA

– Video Evaluation System for Task Analysis

• LEADS research network : Learning Environment Across Disciplines

– Education sciences: How do students learn?

– 6 universities et 11 partner organizations (Canada)

– 13 universities et 4 partner organization (North Amercia, Europe, Australia) – Led by Dr Susanne Lajoie (McGill University)

(17)

LEADS CONTEXT

• Video analysis of students in learning situations

– Video content: typically one student, many tasks

– Audio content: Think aloud, reading, conversation, answering questions

– Video

– Local sources

– Access rights management

– Manual transcripts

– Manual coding

– Data sharing

(18)

THE VESTA PLATFORM FEATURES

• A Web-based platform relying on some of the most recent HTML5 features

• 5 semi-automated annotation services

– Speaker identification – Transcription

– Audio-text correspondence – Transition detection (video) – Face detection

• 3 utility services

– Annotation storage

– Load balancing / task dispatching – Multimedia file storage

• Access rights management taking into account ethics approval for research protocols

(19)

THE VESTA PLATEFORM

(20)

CONCLUSIONS

• Big Data offer a huge potential, largely underexploited at this time

• Like numerous fundamental changes, expect a long journey

• Establish an ambitious vision, accomplish modest first steps but with a tangible value

• There is no “one size fits all” approach; it must be tailored to the specific use cas

• The question is not “Too Big or not Too Big”, what is important is data intelligence (“Smart Data”) that brings concrete value to the organisation

• Big Data technologies can also be used in other contexts

• On top of technological challenges, human challenges will dominate and determine the success or failure of specific initiatives.

(21)

PITFALLS

“A wealth of information creates a poverty of attention and a need to allocate that attention efficiently among the over abundance of information sources that might consume it.”

Herbert Simons: Designing organization to an information-rich World; 1

• Do not plan enough

• Plan too much

• Weak commitment

• Thinking it will be easy to implement

• Minimise issues related to change management

References

Related documents

Nurses feel that both the software and the nurse are essential to clinical decision-making, and describe a process of ‘dual decision- making’, with the nurse as active decision

The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item.. Where records

human body can persist through death is equally a reason to suppose that a. human animal can persist through death, and any reason to deny

In this scenario total energy consumed is above the “target” energy demand for the transport sector for this scenario of 403 TWh (Table 3) and again, it was not possible to push

The 10 resident domains cluster into three groups : universal requirements for older people living in residential settings (privacy, the ability to personalise their

to the Convention for the Protection of Human Rights and Dignity of the Human Being with Regard to the Application of Biology and Medicine, on the Prohibition of Cloning Human

The role of dopamine in chemoreception remains to be fully established, but it is clear that stimulus evoked transmitter release from type I cells on to afferent nerve endings is a

The International Board held that any organisation composed of at least 75% of Business and Professional Women was eligible for membership in the Federation, and the Council of