Language Technology for
Big Data Analytics
Connecting Europe for New Horizons
Berlin, 20 September 2013Wolfgang Wahlster
Saarbrücken, Kaiserslautern, Bremen, Berlin, Osnabrück Phone: +49 (681) 85775-5252/4162
Email: [email protected]
© DFKI GmbH
Life Log Data
for Individuals and Objects Digital Product Memories
Financial Messaging Data
Supervision of Banks, Stock Exchange and High Speed Trade
Mass Data from Social Networks Mass Data from
Smart Grid and Smart Metering
Sensor Data for Weather, Climate, Smart City and
Smart Home
Production and Machine Data from Industry 4.0
Mass Data from Individualized Medicine,
from Genom Analysis and Imaging Methods
Big Data: Data as Tradeable Assets
3D-Internet-Data and Media Streams
95% of 1.2 zettabyte worldwide digital data are unstructured – with a data growth of 62% per year.
Mass Data from Mobility by Car2Car-Communication and
© DFKI GmbH
Outline of the Talk
1.
Human- and Machine-generated Big Data
2.
Major German and European Big Data Initiatives
3.
The Role of Language Technology for Big Data
Analytics
4. The Need for Real-Time NL Analysis and
Generation for Application Impact
© DFKI GmbH
Development of Global Data Volumes
by 2020 (in Zettabyte)
Source: AT Kearney 2013 :
Mainly Machine-Generated Data, but also encoded in Natural Language
for Human Inspection, Multilingual Natural Language Generation becomes
more and more important
© DFKI GmbH
@
More than 2 Million
Queries Every Minute 3125 new Photos uploaded 284.166.667 new Emails send 68.4478 messages 100.000 New Tweets 571 new homepages 47.000 Apps downloaded 48 Hours Videos uploaded
Exponential Growth
of Internet Data:
Commercial Spoken
Language Access:
Siri (Apple)
Google Now (Google)
S Voice (Samsung)
Cortana (Microsoft)
Still Missing:
Crosslingual
Information Retrieval
Human-Generated Internet Content: Zettabyte
of Unstructured BIG Data
© DFKI GmbH
Machine
Learning Multimodal Interaction
Information Extraction from
Text and Video
BIG DATA
Databases, Data Warehouses, WWW
New Services based on Cloud Nets
Decision Support, Prediction, Simulation, Knowledge Discovery, Information Trading, Fusion, Optimization, Modeling
st ru ct ure d Lo w er de gre e o f St ru ct ure un st ru ct ure d 1012=GIGA - PETA=1015 1018=EXA - ZETTA=1021
New Data Material Low Information Density
Less used for ICT
Classical Data Material High Information Density
Much used for ICT
Complexity
From DB to BD: New BIG DATA Services
Language Technology as a
Key Enabler for BIG DATA
© DFKI GmbH
~ 1960
~1980 ~1990
Today
Data Warehousing
Big Data Analytics Digital Data Data Mining § Digital data collection § First databases Documentation § Data cubes § Relational databases § Financial data Enterprise Management § Statistics § Artificial Intelligence § Machine learning § Knowledge discovery § Unstructured data Process Optimization § Stream processing § § Collective intelligence
§ Massively distributed analytics
§ NoSQL databases
§ Heterogeneous data and knowledge
§ Petabytes and Zettabytes of data
Real-time Decision Support and Control
The Era of BIG DATA: Increasing Importance
of Real-Time Natural Language Processing
Smart Data Engineering
© DFKI GmbH
Current penetration across all industries (according to Gartner 2013)
Value & Complexity
Descriptive
AnalyticsDiagnostic
AnalyticsPredictive
AnalyticsPrescriptive
AnalyticsWhat happened? Why did it happen? What will happen? What shall we do?
Inform
Analyze
Act
99% 30% 13% 3%
Adopt by vast majority but not all data
Adopted by
minorities Still few adopters Very few early adopters
Focus of Data Analytics Is Changing from
Description
of Past to Decision Support of Today
© DFKI GmbH
The Processing Cycle for BIG DATA
Language Technology
BIG DATA Collection
• Unstructured Data • Multimodal Data • Uncertain Data • Complex Events • Sensor Data • Data Streams
• Deep Web Data
BIG DATA Analysis • Data Cleansing • Information Extraction • Semantic Analysis • Sentiment Analysis • Data Correlation • Pattern Recognition • Real-Time Analytics • Machine Learning
BIG DATA Maintenance • Backtracing Data Origins
• Data Enrichment
• Annotation and Tagging
• Data Validation
• Redundancy Avoidance
• Consistency Checking
• Inference & Abstraction
Data Storage & Retrieval
• In-Memory Technologies • HANA, TERRACOTTA • Column Databases • NoSQL • Cloud Storage • Densification Technology • Aggregation Procedures • Compression Techniques
BIG DATA Exploitation
• Decision Support in Real Time • Prediction • Simulation • Exploration • Modeling • Monitoring, Alerting, Reporting • Controlling
• Smart Data Engineering
Search for BIG DATA Sources • Linked Data Potential
• Data Fusion
Potential
• Sensor Selection and
Sensor Positioning
• Search for Open Data Sources and Streams
© DFKI GmbH In cre ase d Va lu e C re at io n Knowledge Worker Clerk Application Developer System Administrator KaaS: Knowledge as a Service BDaaS: Big Data as a Service SaaS: Software as a Service IaaS: Infrastructure as a Service BPaaS: Business Process as a Service Quelle: DFKI Business Designer PaaS: Plattform as a Service Cloud In fr as tr u ctu re
1 Hardware 2 Software 3 Information & Knowledge (Big Data)
Virtualization Chain
Service@Digital for Smart Services : Internet-based Services for Economy
Preparation of Second ICT Future Project for
© DFKI GmbH
German BIG DATA Conferences
First National Conference3 June 2013, Berlin
BIG DATA: Exploring Data Treasures in Science and Industry
Second National Conference 11 - 12 November 2013, Berlin BIG DATA: Potentials for Germany
BIG DATA Summit 24 June 2013, Bonn
10 December 2013, Hamburg BIG DATA Platform CEO Forum
© DFKI GmbH
German BIG DATA Funding Programms
2-3 National BIG DATA Competence Centers
Submission Deadline: 12 July 2013 5 -10 Years Perspective Application Oriented Basic Research
Finalist Selection: 12 Sept. 2013, Senior Expert Jury: November 2013
Annoucment: IT Summit, 10 Dec 2013
Consortia Projects between Industry and Academia
Program Announcement: November 2013 Emphasis on SME Involvement
Consortia Projects between Industry and Academia
Submission Deadline: 12 July 2013 Applied Research
© DFKI GmbH
BIG coordination and support action
Strategic Director:
Wolfgang Wahlster, DFKI ! #JH%BUBJTBOFNFSHJOHöFMEXIFSFJOOPWBUJWFUFDIOPMPHZPòFST BMUFSOBUJWFTUPSFTPMWFUIFJOIFSFOUQSPCMFNTUIBUBQQFBSXIFO XPSLJOHXJUIIVHFBNPVOUTPGEBUBQSPWJEJOHOFXXBZTUP SFVTFBOEFYUSBDUWBMVFGSPNJOGPSNBUJPO #JH%BUBPòFSTUSFNFOEPVTVOUBQQFEQPUFOUJBMWBMVFGPSNBOZ TFDUPST)PXFWFSOPTQFDJöDJOUFMMJHFOUMBSHFEBUBIBOEMJOH CSPLFSJOHJOEVTUSJBMTFDUPSFYJTUT'VSUIFSNPSFGSPNBOJOEVTU SJBMBEPQUJPOQPJOUPGWJFX&VSPQFJTMBHHJOHCFIJOE64JO#JH %BUBUFDIOPMPHJFT"DMFBSTUSBUFHZUPBMJHOTVQQMZBOEEFNBOE JTOFFEFEBTBXBZPGJODSFBTJOHDPNQFUJUJWFOFTTPG&VSPQFBO JOEVTUSJFT #VJMEJOHBOJOEVTUSJBMDPNNVOJUZBSPVOE#JH%BUBJO&VSPQF XJMMCFUIFQSJPSJUZPGUIJTQSPKFDUUPHFUIFSXJUITFUUJOHVQUIF OFDFTTBSZDPMMBCPSBUJPOBOEEJTTFNJOBUJPOJOGSBTUSVDUVSFUPMJOL UFDIOPMPHZTVQQMJFSTJOUFHSBUPSTBOEMFBEJOHVTFSPSHBOJTBUJPOT #*(BJNTUPQSPWJEFBQMBUGPSNGPSJOEVTUSZSFTFBSDIQPMJDZNB LFSTBOEDPNNVOJUZJOJUJBUJWFTUPEJTDVTTUIFDIBMMFOHFTPG#JH %BUBBOEUIFFNFSHJOH%BUB&DPOPNZBOEUPEFWFMPQTVJUBCMF BDUJPOQMBOTGPSBEESFTTJOHUIFTFDIBMMFOHFT 3FQSFTFOUBUJWFHSPVQTGSPNSFTFBSDIBOEJOEVTUSZXJMMTFUVQ UPHFUIFSUIFOFDFTTBSZDPMMBCPSBUJPOBOEEJTTFNJOBUJPOJOGSB TUSVDUVSFUPMJOLUFDIOPMPHZTVQQMJFSTJOUFHSBUPSTBOEMFBEJOH VTFSPSHBOJTBUJPOT#JH%BUB1VCMJD1SJWBUF'PSVN #*(XJMMXPSL UPXBSETUIFEFöOJUJPOBOEJNQMFNFOUBUJPOPGBDMFBSTUSBUFHZUIBU UBDLMFTUIFOFDFTTBSZFòPSUTJOUFSNTPGSFTFBSDIBOEJOOPWBUJPO BOE̓XJMMBMTP̓QSPWJEFBNBKPSCPPTUGPSUFDIOPMPHZBEPQUJPO BOETVQQPSUJOHBDUJPOTGSPNUIF&VSPQFBO$PNNJTTJPOJOUIF TVDDFTTGVMJNQMFNFOUBUJPOPGUIF#JH%BUBFDPOPNZ 1SPKFDUDPGVOEFECZUIF&VSPQFBO$PNNJTTJPOXJUIJOUIF UI'SBNFXPSL1SPHSBNNF (SBOU"HSFFNFOU/P
5IFNBJOFWFOUDPPSHBOJTFECZBIGJTUIF&VSPQFBO%BUB
'PSVNTFSJFT POMJOFBUIUUQEBUBGPSVNFVXJUIJUTOFYU JOTUBMMBUJPOUBLJOHQMBDFJO"QSJMJO%VCMJO*SFMBOE 5IFEuropean Data Forum (EDF)JTBNFFUJOHQMBDFGPS
JOEVTUSZSFTFBSDIQPMJDZNBLFSTBOEDPNNVOJUZJOJUJBUJWFT UPEJTDVTTUIFDIBMMFOHFTPG#JH%BUBBOEUIFFNFSHJOH%BUB &DPOPNZBOEUPEFWFMPQTVJUBCMFBDUJPOQMBOTGPSBEESFTTJOH UIFTFDIBMMFOHFT 0GTQFDJBMGPDVTGPSUIF&%'BSF4NBMMBOE.FEJVNTJ[FE&OUFS QSJTFT 4.&TTJODFUIFZBSFESJWJOHJOOPWBUJPOBOEDPNQFUJUJ POJONBOZEBUBESJWFOFDPOPNJDTFDUPST 5IFSBOHFPGUPQJDTEJTDVTTFEBUUIF&VSPQFBO%BUB'PSVNSBO HFTGSPNOPWFMEBUBESJWFOCVTJOFTTNPEFMT FHEBUBDMFBSJOH IPVTFTBOEUFDIOPMPHJDBMJOOPWBUJPOT FH-JOLFE%BUB8FC UPTPDJFUBMBTQFDUT FHPQFOHPWFSONFOUBMEBUBBTXFMMBT EBUBQSJWBDZBOETFDVSJUZ
"50441"*/4"
4QBJO5IF1SFTT"TTPDJBUJPO-JNJUFE
6OJUFE,JOHEPN4JFNFOT"LUJFOHFTFMMTDIBGU
(FSNBOZ"(5(SPVQ 3%(NC)
(FSNBOZ6OJWFSTJUZPG*OOTCSVDL
"VTUSJB/BUJPOBM6OJWFSTJUZPG*SFMBOE(BMXBZ
*SFMBOE*OTUJUVUGàS"OHFXBOEUF*OGPSNBUJLF7
BOEFS6OJWFSTJUÊU-FJQ[JH
(FSNBOZ%FVUTDIFT'PSTDIVOHT[FOUSVN
GàS,àOTUMJDIF*OUFMMJHFO[
(FSNBOZ0QFO,OPXMFEHF'PVOEBUJPO%FVUTDIMBOE
(FSNBOZ45**OUFSOBUJPOBM$POTVMUJOHVOE
3FTFBSDI(NC)
"VTUSJB&YBMFBE
'SBODFhttp://data-forum.eu
BIG DATA
BIG PARTNER
EUROPEAN DATA FORUM
EU consortium (11 partners,
including ATOS)
The goal is to help EC define
a roadmap for Big Data
© DFKI GmbH
Key facts about BIG Project
▶
Type of project:
CSA
▶
Project start date:
September 2012
▶
Duration:
26 months
▶
Call:
FP7-ICT-2011-8
▶
Effort:
552,5 PM
▶
Budget:
3,038 M
€
▶
Max EC contribution:
2,499 M
€
▶
Consortium:
11 partners
© DFKI GmbH
Major Activities of the BIG Forum
Identification of
Sector’s
requisites
Applicability of Big Data technology in each SectorElaboration of
Sector
Roadmap
▶
Requirements and objectives from all
Sectors
▶
Industry-driven sector forums
▶
Big Data technologies and its capabilities
▶
Technical Working Groups
▶
Technical White Papers
▶
Sectorial roadmap (elaborate a roadmap
per sector).
▶
Contributions towards integrated roadmap
© DFKI GmbH
Social Media Analysis: BIG DATA NLP and
Sentiment Analysis by the DFKI Spin-Off
http://www.wiwo.de/so-waehlt-das-netz/
“The Net elects”
Analyzing Twitter and
Facebook feeds for
German Federal
elections 2013, 22
September
Analyze for candidates,
parties, and top tweets
Online feature of German
weekly magazine
“Wirtschafts-Woche”,
powered by Attensity
© DFKI GmbH
Social Media Analysis: BIG DATA NLP and
Sentiment Analysis by the DFKI Spin-Off
http://www.wiwo.de/so-waehlt-das-netz/
Selection of 10,000
Tweets from 400 Million
every day
+ 17,000 Facebook entries
per week
Advanced NLP:
Negation analysis
Analysis of counterfactuals
And comparatives
© DFKI GmbH
Example of Technology Transfer
via Brains
Users broadcast their experience immediately, number of tweets increases immediately
after the earthquake and tsunami: 1.5 Million Twitter messages were analyzed by Collier and his group (cf. An analysis of Twitter messages in the 2011 Tohoku Earthquake: Son Doan, Bao-Khanh Ho Vo, and Nigel Collier, National Institute of Informatics)
The first Japanese tweets on the earthquake are as follows:
2011-‐03-‐11T05:48:08 "地震!" [Earthquake!]
2011-‐03-‐11T05:48:08 "地震だ〜縦揺れ!" [Earthquake ~ ver;cal shake!]
2011-‐03-‐11T05:48:14 "地震!!!!" [Earthquake!!!!]
First two English tweets send from an iPhone:
2011-‐03-‐11T05:48:54 Huge earthquake in TK we are affected! 2011-‐03-‐11T05:49:01 BIG EARTHQUAKE!!!
2011-‐03-‐11T05:50:00 Massive quake in Tokyo
The first tweet about a tsunami was an eye witness tweet 6 minutes after the earthquake occurred at its epicentre:
2011-‐03-‐11T 05:52:23 "オレ、津波の様子見てくるわ!!!!" [I can see tsunami is coming!!!!]
The first concerns about nuclear plants right after the earthquake.
2011-‐03-‐11T09:50:49 "福島原発ヤバい状況らしい" [The Fukushima plant is in a really bad situa;on .]
Multilingual Tweet Analysis for Desaster Management
11 March 2011, Fukushima
Challenge: Post-hoc Analysis must be turned into Real-Time Analysis
© DFKI GmbH
BioCaster: Early Alerting for Public Health
Events - detects seasonal influenza and hay fever.
Trend graphs Event maps Event alerts Ontology browsing Email/GeoRSS alerting R e a l - t i m e Twitter analysis Up to date news in m u l t i p l e languages
Event database search
GHSAG partners US UK FR DE WHO IT JP CA
born.nii.ac.jp:
© DFKI GmbH
BIG
DATA
PPP
Forum
Alignment of European and German National
Projects Dealing with BIG DATA
PPPs
Service@Digital
Chair of Advisory Committee
© DFKI GmbH
European and German Software Platforms
for BIG DATA Processing
Generic Enabler for BIG DATA
for batch and online stream processing
of BIG DATA
BigMemory MAX:
Real-time Access to 100s of TBs
SAP HANA
BIG DATA Platform with up 250 TBs
in-memory data bases:
Open-source cluster/cloud computing
framework for BIG DATA analytics
© DFKI GmbH
The key technical concept of the FI-PPP is the provisioning of
Generic Service
Platforms
supported by
reusable
,
standardized and commonly shared key
technologies
and components which shall be termed “
Generic Enablers
”, which
can be applied by a multiplicity of “
Smart Application
” usage domains across
multiple sectors.
FI-WARE Catalogue
Generic Enabler 1 Generic Enabler 2 Complex Event Generic Enabler Generic Enabler 4 Cloud Generic Enabler Generic Enabler 6 GenericEnabler 8 Enabler N Generic
FI-WARE Instance Future Internet Smart Application
assemble…
BIG DATA Generic Enabler of the FI-PPP
BIG Data Generic Enabler
• Streaming and batch processing functionalities both in one single platform.
• Automatic deployment capabilities in a cloud-based cluster of nodes.
• Wide range of available data injectors.
• High speed access to the resulting insights via a NoSQL database. BIG DATA
Generic Enabler
© DFKI GmbH FI-WARE Catalogue
FI-WARE Open Innovation Lab FI-WARE Shared Trial Facility Specific Use Case Trial Facility
24
The FI-PPP promotes and enables large-scale experimentation and validation of the
platforms in real-life application contexts involving a range of actors across domains,
including large companies, SMEs, the research community as well as public administrations
and citizens.
The open platform approach further creates novel opportunities for entrepreneurship, new businesses and innovative value creation models based on cross-sector industrial partnerships.
© DFKI GmbH
BIG DATA Analytics for Financial Fraud
Detection and Prevention
Use of the Terracotta Platform of DFKI Shareholder Software AG
•
Mitigated 100s of millions
€
in fraudulent credit
card transactions
•
Reduced fraud detection processing time from
45 minutes to less than 4 seconds
•
NLP Analysis of Sales Items
•
99,999% completed transactions with 4.000
fraud detection rules checked
•
Reduced fraud processing time from 800 ms
© DFKI GmbH
New Business Models of DHL Exploiting BIG DATA
Collected by their Employees during the Delivery of Parcels
GEOVISTA: BIG DATA TOOL for estimating earnings opportunities and
analyzing business potential.
Prepare a realistic sales forecast
Evaluate a desired location by using high-quality
geodata and
NLP CRM reports
provided by the subsidiary Deutsche Post
Direct.
Local competitors are analyzed –
with the aid of up-to-date data provided
by beDirect.
Visualization of business-location
factors and the area being
studied are presented on a
digital map.
© DFKI GmbH
Trento EIT ICT Labs
BIG DATA Analytics for Intelligent Urban
Management
© DFKI GmbH
Trentino Open Living Data (TOLD)
DFKI Is a Founding Core Partner
of EIT ICT Labs and Has a Strong
Collaboration with the Trento Node
© DFKI GmbH
Living Big Data: The Trentino Territory
as a BIG DATA Lab
Telecommunication
Energy
© DFKI GmbH
EIT ICT Labs Business Framework for
BIG DATA at Trento Rise
© DFKI GmbH
DFKI
‘
s Social Media BIG DATA Analytics
App for Crowd Management
•
London Lord Mayor
‘
s Show, Olympics 2012
© DFKI GmbH
Conclusions
1.
Big data technologies are an
innovation motor
for industry, science and
government.
Real-Time Multilingual Natural Language Analysis and
Generation as well as Translation Technologies are a Key Enabler for Big Data
Analytics.
2.
Key research challenges are smart tools for
real-time analytics and
decision support
based on intelligent
information extraction
from
unstructured data.
3.
Europe’s and in particular Germany’s strength in big data technology
are
commercial in-memory computing platforms
like Hana or Terracotta and
open-source platforms
like FI-Ware GEs and Stratosphere and
multilingual
language technologies
.
4.
In Germany, the focus is on big data applications for
Industry 4.0,
smart grids, advanced mobility and personalized medicine
.
© DFKI GmbH