• No results found

Language Technology for Big Data Analytics

N/A
N/A
Protected

Academic year: 2021

Share "Language Technology for Big Data Analytics"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Language Technology for

Big Data Analytics

Connecting Europe for New Horizons

Berlin, 20 September 2013

Wolfgang Wahlster

Saarbrücken, Kaiserslautern, Bremen, Berlin, Osnabrück Phone: +49 (681) 85775-5252/4162

Email: [email protected]

(2)

© DFKI GmbH

Life Log Data

for Individuals and Objects Digital Product Memories

Financial Messaging Data

Supervision of Banks, Stock Exchange and High Speed Trade

Mass Data from Social Networks Mass Data from

Smart Grid and Smart Metering

Sensor Data for Weather, Climate, Smart City and

Smart Home

Production and Machine Data from Industry 4.0

Mass Data from Individualized Medicine,

from Genom Analysis and Imaging Methods

Big Data: Data as Tradeable Assets

3D-Internet-Data and Media Streams

95% of 1.2 zettabyte worldwide digital data are unstructured – with a data growth of 62% per year.

Mass Data from Mobility by Car2Car-Communication and

(3)

© DFKI GmbH

Outline of the Talk

1.

Human- and Machine-generated Big Data

2.

Major German and European Big Data Initiatives

3.

The Role of Language Technology for Big Data

Analytics

4. The Need for Real-Time NL Analysis and

Generation for Application Impact

(4)

© DFKI GmbH

Development of Global Data Volumes

by 2020 (in Zettabyte)

Source: AT Kearney 2013 :

Mainly Machine-Generated Data, but also encoded in Natural Language

for Human Inspection, Multilingual Natural Language Generation becomes

more and more important

(5)

© DFKI GmbH

@

More than 2 Million

Queries Every Minute 3125 new Photos uploaded 284.166.667 new Emails send 68.4478 messages 100.000 New Tweets 571 new homepages 47.000 Apps downloaded 48  Hours Videos uploaded

Exponential Growth

of Internet Data:

Commercial Spoken

Language Access:

Siri (Apple)

Google Now (Google)

S Voice (Samsung)

Cortana (Microsoft)

Still Missing:

Crosslingual

Information Retrieval

Human-Generated Internet Content: Zettabyte

of Unstructured BIG Data

(6)

© DFKI GmbH

Machine

Learning Multimodal Interaction

Information Extraction from

Text and Video

BIG DATA

Databases, Data Warehouses, WWW

New Services based on Cloud Nets

Decision Support, Prediction, Simulation, Knowledge Discovery, Information Trading, Fusion, Optimization, Modeling

st ru ct ure d Lo w er de gre e o f St ru ct ure un st ru ct ure d 1012=GIGA - PETA=1015 1018=EXA - ZETTA=1021

New Data Material Low Information Density

Less used for ICT

Classical Data Material High Information Density

Much used for ICT

Complexity

From DB to BD: New BIG DATA Services

Language Technology as a

Key Enabler for BIG DATA

(7)

© DFKI GmbH

~ 1960

~1980 ~1990

Today

Data Warehousing

Big Data Analytics Digital Data Data Mining § Digital data collection § First databases Documentation § Data cubes § Relational databases § Financial data Enterprise Management § Statistics § Artificial Intelligence § Machine learning § Knowledge discovery § Unstructured data Process Optimization § Stream processing § § Collective intelligence

§ Massively distributed analytics

§ NoSQL databases

§ Heterogeneous data and knowledge

§ Petabytes and Zettabytes of data

Real-time Decision Support and Control

The Era of BIG DATA: Increasing Importance

of Real-Time Natural Language Processing

Smart Data Engineering

(8)

© DFKI GmbH

Current penetration across all industries (according to Gartner 2013)

Value & Complexity

Descriptive

Analytics

Diagnostic

Analytics

Predictive

Analytics

Prescriptive

Analytics

What happened? Why did it happen? What will happen? What shall we do?

Inform

Analyze

Act

99% 30% 13% 3%

Adopt by vast majority but not all data

Adopted by

minorities Still few adopters Very few early adopters

Focus of Data Analytics Is Changing from

Description

of Past to Decision Support of Today

(9)

© DFKI GmbH

The Processing Cycle for BIG DATA

Language Technology

BIG DATA Collection

•  Unstructured Data •  Multimodal Data •  Uncertain Data •  Complex Events •  Sensor Data •  Data Streams

•  Deep Web Data

BIG DATA Analysis •  Data Cleansing •  Information Extraction •  Semantic Analysis •  Sentiment Analysis •  Data Correlation •  Pattern Recognition •  Real-Time Analytics •  Machine Learning

BIG DATA Maintenance •  Backtracing Data Origins

•  Data Enrichment

•  Annotation and Tagging

•  Data Validation

•  Redundancy Avoidance

•  Consistency Checking

•  Inference & Abstraction

Data Storage & Retrieval

•  In-Memory Technologies •  HANA, TERRACOTTA •  Column Databases •  NoSQL •  Cloud Storage •  Densification Technology •  Aggregation Procedures •  Compression Techniques

BIG DATA Exploitation

•  Decision Support in Real Time •  Prediction •  Simulation •  Exploration •  Modeling •  Monitoring, Alerting, Reporting •  Controlling

•  Smart Data Engineering

Search for BIG DATA Sources •  Linked Data Potential

•  Data Fusion

Potential

•  Sensor Selection and

Sensor Positioning

•  Search for Open Data Sources and Streams

(10)

© DFKI GmbH In cre ase d Va lu e C re at io n Knowledge Worker Clerk Application Developer System Administrator KaaS: Knowledge as a Service BDaaS: Big Data as a Service SaaS: Software as a Service IaaS: Infrastructure as a Service BPaaS: Business Process as a Service Quelle: DFKI Business Designer PaaS: Plattform as a Service Cloud In fr as tr u ctu re

1 Hardware 2 Software 3 Information & Knowledge (Big Data)

Virtualization Chain

Service@Digital for Smart Services : Internet-based Services for Economy

Preparation of Second ICT Future Project for

(11)

© DFKI GmbH

German BIG DATA Conferences

First National Conference

3 June 2013, Berlin

BIG DATA: Exploring Data Treasures in Science and Industry

Second National Conference 11 - 12 November 2013, Berlin BIG DATA: Potentials for Germany

BIG DATA Summit 24 June 2013, Bonn

10 December 2013, Hamburg BIG DATA Platform CEO Forum

(12)

© DFKI GmbH

German BIG DATA Funding Programms

2-3 National BIG DATA Competence Centers

Submission Deadline: 12 July 2013 5 -10 Years Perspective Application Oriented Basic Research

Finalist Selection: 12 Sept. 2013, Senior Expert Jury: November 2013

Annoucment: IT Summit, 10 Dec 2013

Consortia Projects between Industry and Academia

Program Announcement: November 2013 Emphasis on SME Involvement

Consortia Projects between Industry and Academia

Submission Deadline: 12 July 2013 Applied Research

(13)

© DFKI GmbH

BIG coordination and support action

Strategic Director:

Wolfgang Wahlster, DFKI ! #JH%BUBJTBOFNFSHJOHöFMEXIFSFJOOPWBUJWFUFDIOPMPHZPòFST BMUFSOBUJWFTUPSFTPMWFUIFJOIFSFOUQSPCMFNTUIBUBQQFBSXIFO XPSLJOHXJUIIVHFBNPVOUTPGEBUBQSPWJEJOHOFXXBZTUP SFVTFBOEFYUSBDUWBMVFGSPNJOGPSNBUJPO #JH%BUBPòFSTUSFNFOEPVTVOUBQQFEQPUFOUJBMWBMVFGPSNBOZ TFDUPST)PXFWFSOPTQFDJöDJOUFMMJHFOUMBSHFEBUBIBOEMJOH CSPLFSJOHJOEVTUSJBMTFDUPSFYJTUT'VSUIFSNPSFGSPNBOJOEVTU SJBMBEPQUJPOQPJOUPGWJFX&VSPQFJTMBHHJOHCFIJOE64JO#JH %BUBUFDIOPMPHJFT"DMFBSTUSBUFHZUPBMJHOTVQQMZBOEEFNBOE JTOFFEFEBTBXBZPGJODSFBTJOHDPNQFUJUJWFOFTTPG&VSPQFBO JOEVTUSJFT #VJMEJOHBOJOEVTUSJBMDPNNVOJUZBSPVOE#JH%BUBJO&VSPQF XJMMCFUIFQSJPSJUZPGUIJTQSPKFDUUPHFUIFSXJUITFUUJOHVQUIF OFDFTTBSZDPMMBCPSBUJPOBOEEJTTFNJOBUJPOJOGSBTUSVDUVSFUPMJOL UFDIOPMPHZTVQQMJFSTJOUFHSBUPSTBOEMFBEJOHVTFSPSHBOJTBUJPOT #*(BJNTUPQSPWJEFBQMBUGPSNGPSJOEVTUSZSFTFBSDIQPMJDZNB LFSTBOEDPNNVOJUZJOJUJBUJWFTUPEJTDVTTUIFDIBMMFOHFTPG#JH %BUBBOEUIFFNFSHJOH%BUB&DPOPNZBOEUPEFWFMPQTVJUBCMF BDUJPOQMBOTGPSBEESFTTJOHUIFTFDIBMMFOHFT 3FQSFTFOUBUJWFHSPVQTGSPNSFTFBSDIBOEJOEVTUSZXJMMTFUVQ UPHFUIFSUIFOFDFTTBSZDPMMBCPSBUJPOBOEEJTTFNJOBUJPOJOGSB TUSVDUVSFUPMJOLUFDIOPMPHZTVQQMJFSTJOUFHSBUPSTBOEMFBEJOH VTFSPSHBOJTBUJPOT#JH%BUB1VCMJD1SJWBUF'PSVN #*(XJMMXPSL UPXBSETUIFEFöOJUJPOBOEJNQMFNFOUBUJPOPGBDMFBSTUSBUFHZUIBU UBDLMFTUIFOFDFTTBSZFòPSUTJOUFSNTPGSFTFBSDIBOEJOOPWBUJPO BOE̓XJMMBMTP̓QSPWJEFBNBKPSCPPTUGPSUFDIOPMPHZBEPQUJPO BOETVQQPSUJOHBDUJPOTGSPNUIF&VSPQFBO$PNNJTTJPOJOUIF TVDDFTTGVMJNQMFNFOUBUJPOPGUIF#JH%BUBFDPOPNZ 1SPKFDUDPGVOEFECZUIF&VSPQFBO$PNNJTTJPOXJUIJOUIF UI'SBNFXPSL1SPHSBNNF (SBOU"HSFFNFOU/P

5IFNBJOFWFOUDPPSHBOJTFECZBIGJTUIF&VSPQFBO%BUB

'PSVNTFSJFT POMJOFBUIUUQEBUBGPSVNFVXJUIJUTOFYU JOTUBMMBUJPOUBLJOHQMBDFJO"QSJMJO%VCMJO*SFMBOE 5IFEuropean Data Forum (EDF)JTBNFFUJOHQMBDFGPS

JOEVTUSZSFTFBSDIQPMJDZNBLFSTBOEDPNNVOJUZJOJUJBUJWFT UPEJTDVTTUIFDIBMMFOHFTPG#JH%BUBBOEUIFFNFSHJOH%BUB &DPOPNZBOEUPEFWFMPQTVJUBCMFBDUJPOQMBOTGPSBEESFTTJOH UIFTFDIBMMFOHFT 0GTQFDJBMGPDVTGPSUIF&%'BSF4NBMMBOE.FEJVNTJ[FE&OUFS QSJTFT 4.&TTJODFUIFZBSFESJWJOHJOOPWBUJPOBOEDPNQFUJUJ POJONBOZEBUBESJWFOFDPOPNJDTFDUPST 5IFSBOHFPGUPQJDTEJTDVTTFEBUUIF&VSPQFBO%BUB'PSVNSBO HFTGSPNOPWFMEBUBESJWFOCVTJOFTTNPEFMT FHEBUBDMFBSJOH IPVTFTBOEUFDIOPMPHJDBMJOOPWBUJPOT FH-JOLFE%BUB8FC UPTPDJFUBMBTQFDUT FHPQFOHPWFSONFOUBMEBUBBTXFMMBT EBUBQSJWBDZBOETFDVSJUZ

"50441"*/4"

4QBJO

5IF1SFTT"TTPDJBUJPO-JNJUFE

6OJUFE,JOHEPN

4JFNFOT"LUJFOHFTFMMTDIBGU

(FSNBOZ

"(5(SPVQ 3%(NC)

(FSNBOZ

6OJWFSTJUZPG*OOTCSVDL

"VTUSJB

/BUJPOBM6OJWFSTJUZPG*SFMBOE(BMXBZ

*SFMBOE

*OTUJUVUGàS"OHFXBOEUF*OGPSNBUJLF7

BOEFS6OJWFSTJUÊU-FJQ[JH

(FSNBOZ

%FVUTDIFT'PSTDIVOHT[FOUSVN

GàS,àOTUMJDIF*OUFMMJHFO[

(FSNBOZ

0QFO,OPXMFEHF'PVOEBUJPO%FVUTDIMBOE

(FSNBOZ

45**OUFSOBUJPOBM$POTVMUJOHVOE

3FTFBSDI(NC)

"VTUSJB

&YBMFBE

'SBODF

http://data-forum.eu

BIG DATA

BIG PARTNER

EUROPEAN DATA FORUM

EU consortium (11 partners,

including ATOS)

The goal is to help EC define

a roadmap for Big Data

(14)

© DFKI GmbH

Key facts about BIG Project

Type of project:

CSA

Project start date:

September 2012

Duration:

26 months

Call:

FP7-ICT-2011-8

Effort:

552,5 PM

Budget:

3,038 M

Max EC contribution:

2,499 M

Consortium:

11 partners

(15)

© DFKI GmbH

Major Activities of the BIG Forum

Identification of

Sector’s

requisites

Applicability of Big Data technology in each Sector

Elaboration of

Sector

Roadmap

Requirements and objectives from all

Sectors

Industry-driven sector forums

Big Data technologies and its capabilities

Technical Working Groups

Technical White Papers

Sectorial roadmap (elaborate a roadmap

per sector).

Contributions towards integrated roadmap

(16)

© DFKI GmbH

Social Media Analysis: BIG DATA NLP and

Sentiment Analysis by the DFKI Spin-Off

http://www.wiwo.de/so-waehlt-das-netz/

“The Net elects”

Analyzing Twitter and

Facebook feeds for

German Federal

elections 2013, 22

September

Analyze for candidates,

parties, and top tweets

Online feature of German

weekly magazine

“Wirtschafts-Woche”,

powered by Attensity

(17)

© DFKI GmbH

Social Media Analysis: BIG DATA NLP and

Sentiment Analysis by the DFKI Spin-Off

http://www.wiwo.de/so-waehlt-das-netz/

Selection of 10,000

Tweets from 400 Million

every day

+ 17,000 Facebook entries

per week

Advanced NLP:

Negation analysis

Analysis of counterfactuals

And comparatives

(18)

© DFKI GmbH

Example of Technology Transfer

via Brains

(19)

                   

Users broadcast their experience immediately, number of tweets increases immediately

after the earthquake and tsunami: 1.5 Million Twitter messages were analyzed by Collier and his group (cf. An analysis of Twitter messages in the 2011 Tohoku Earthquake: Son Doan, Bao-Khanh Ho Vo, and Nigel Collier, National Institute of Informatics)

 

The first Japanese tweets on the earthquake are as follows:      

2011-­‐03-­‐11T05:48:08  "地震!"  [Earthquake!]    

2011-­‐03-­‐11T05:48:08  "地震だ〜縦揺れ!"  [Earthquake  ~  ver;cal  shake!]    

2011-­‐03-­‐11T05:48:14  "地震!!!!"  [Earthquake!!!!]    

 

First two English tweets send from an iPhone:      

2011-­‐03-­‐11T05:48:54  Huge  earthquake  in  TK  we  are  affected!     2011-­‐03-­‐11T05:49:01  BIG  EARTHQUAKE!!!    

2011-­‐03-­‐11T05:50:00  Massive  quake  in  Tokyo    

 

The first tweet about a tsunami was an eye witness tweet 6 minutes after the earthquake occurred at its epicentre:  

   

2011-­‐03-­‐11T  05:52:23  "オレ、津波の様子見てくるわ!!!!"  [I  can  see  tsunami  is  coming!!!!]    

 

The first concerns about nuclear plants right after the earthquake.      

2011-­‐03-­‐11T09:50:49  "福島原発ヤバい状況らしい"  [The  Fukushima  plant  is  in  a  really  bad  situa;on  .]    

       

Multilingual Tweet Analysis for Desaster Management

11 March 2011, Fukushima

Challenge: Post-hoc Analysis must be turned into Real-Time Analysis  

(20)

© DFKI GmbH

BioCaster: Early Alerting for Public Health

Events - detects seasonal influenza and hay fever.

Trend graphs Event maps Event alerts Ontology browsing Email/GeoRSS alerting R e a l - t i m e Twitter analysis Up to date news in m u l t i p l e languages

Event database search

GHSAG partners US UK FR DE WHO IT JP CA

born.nii.ac.jp:

(21)

© DFKI GmbH

BIG

DATA

PPP

Forum

Alignment of European and German National

Projects Dealing with BIG DATA

PPPs

Service@Digital

Chair of Advisory Committee

(22)

© DFKI GmbH

European and German Software Platforms

for BIG DATA Processing

Generic Enabler for BIG DATA

for batch and online stream processing

of BIG DATA

BigMemory MAX:

Real-time Access to 100s of TBs

SAP HANA

BIG DATA Platform with up 250 TBs

in-memory data bases:

Open-source cluster/cloud computing

framework for BIG DATA analytics

(23)

© DFKI GmbH

The key technical concept of the FI-PPP is the provisioning of

Generic Service

Platforms

supported by

reusable

,

standardized and commonly shared key

technologies

and components which shall be termed “

Generic Enablers

”, which

can be applied by a multiplicity of “

Smart Application

” usage domains across

multiple sectors.

FI-WARE Catalogue

Generic Enabler 1 Generic Enabler 2 Complex Event Generic Enabler Generic Enabler 4 Cloud Generic Enabler Generic Enabler 6 Generic

Enabler 8 Enabler N Generic

FI-WARE Instance Future Internet Smart Application

assemble…

BIG DATA Generic Enabler of the FI-PPP

BIG Data Generic Enabler

•  Streaming and batch processing functionalities both in one single platform.

•  Automatic deployment capabilities in a cloud-based cluster of nodes.

•  Wide range of available data injectors.

•  High speed access to the resulting insights via a NoSQL database. BIG DATA

Generic Enabler

(24)

© DFKI GmbH FI-WARE Catalogue

FI-WARE Open Innovation Lab FI-WARE Shared Trial Facility Specific Use Case Trial Facility

24

The FI-PPP promotes and enables large-scale experimentation and validation of the

platforms in real-life application contexts involving a range of actors across domains,

including large companies, SMEs, the research community as well as public administrations

and citizens.

The open platform approach further creates novel opportunities for entrepreneurship, new businesses and innovative value creation models based on cross-sector industrial partnerships.

(25)

© DFKI GmbH

BIG DATA Analytics for Financial Fraud

Detection and Prevention

Use of the Terracotta Platform of DFKI Shareholder Software AG

Mitigated 100s of millions

in fraudulent credit

card transactions

Reduced fraud detection processing time from

45 minutes to less than 4 seconds

NLP Analysis of Sales Items

99,999% completed transactions with 4.000

fraud detection rules checked

Reduced fraud processing time from 800 ms

(26)

© DFKI GmbH

New Business Models of DHL Exploiting BIG DATA

Collected by their Employees during the Delivery of Parcels

GEOVISTA: BIG DATA TOOL for estimating earnings opportunities and

analyzing business potential.

Prepare a realistic sales forecast

Evaluate a desired location by using high-quality

geodata and

NLP CRM reports

provided by the subsidiary Deutsche Post

Direct.

Local competitors are analyzed –

with the aid of up-to-date data provided

by beDirect.

Visualization of business-location

factors and the area being

studied are presented on a

digital map.

(27)

© DFKI GmbH

Trento EIT ICT Labs

BIG DATA Analytics for Intelligent Urban

Management

(28)

© DFKI GmbH

Trentino Open Living Data (TOLD)

DFKI Is a Founding Core Partner

of EIT ICT Labs and Has a Strong

Collaboration with the Trento Node

(29)

© DFKI GmbH

Living Big Data: The Trentino Territory

as a BIG DATA Lab

Telecommunication

Energy

(30)

© DFKI GmbH

EIT ICT Labs Business Framework for

BIG DATA at Trento Rise

(31)

© DFKI GmbH

DFKI

s Social Media BIG DATA Analytics

App for Crowd Management

London Lord Mayor

s Show, Olympics 2012

(32)

© DFKI GmbH

Conclusions

1.

Big data technologies are an

innovation motor

for industry, science and

government.

Real-Time Multilingual Natural Language Analysis and

Generation as well as Translation Technologies are a Key Enabler for Big Data

Analytics.

2.

Key research challenges are smart tools for

real-time analytics and

decision support

based on intelligent

information extraction

from

unstructured data.

3.

Europe’s and in particular Germany’s strength in big data technology

are

commercial in-memory computing platforms

like Hana or Terracotta and

open-source platforms

like FI-Ware GEs and Stratosphere and

multilingual

language technologies

.

4.

In Germany, the focus is on big data applications for

Industry 4.0,

smart grids, advanced mobility and personalized medicine

.

(33)

© DFKI GmbH

Design by R.O.

References

Related documents