• No results found

Data Search. Searching and Finding information in Unstructured and Structured Data Sources

N/A
N/A
Protected

Academic year: 2021

Share "Data Search. Searching and Finding information in Unstructured and Structured Data Sources"

Copied!
35
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Searching and Finding information in

Unstructured and Structured Data Sources

Data Search

Erik Fransen

Senior Business Consultant Centennium BI expertisehuis The Hague, The Netherlands e.fransen@centennium.nl

11.00-12.00 P.M. November, 3 IRM UK, DW/BI 2009, London

(3)

Agenda

• Introduction;

• Industry models;

• Combining structured & unstructured data

– “Pure Portal”

– “Index it all”

– “Structure it all”

(4)

Profile

• Erik Fransen

• Background: Knowledge Engineering,

Middlesex University;

• Expertise areas:

– Business Intelligence

– Knowledge engineering

– Knowledge & Content management

– Data warehousing

– Analytics

• CBIP.

(5)
(6)

Combining BI with unstructured data

• Integrated access to relevant information (‘provide complete picture’); • Unstructured data like documents provide valuable context to numerical

data;

– Customer complaints

– Competitor’s press releases – Marketing documents

– …

• Insurance fraud analysis (i.e. claim statistics and claim forms);

• Competitive Intelligence (i.e. market share data and competitor news); • Customer retention (i.e. sales data and customer complaints);

• Data Search acts as a bridge between structured and unstructured data.

(7)

1999 GIG ABYTES 2001

2005

2009

> 80% Un st ru ct ured 2000 Cave paintings, Bone tools 40,000 BC Writing 3500 BC Paper 105 Printing 1450 Electricity, Telephone 1870 Transistor 1947 Computing 1950 Internet (DARPA) Late 1960s

The Web 1993 SQL -70 Or acle -79 SQL -89 SQL -92 SQL -99 SQL -03

(un)structured data keeps growing….

(8)

Industry Model:

Bill Inmon’s DW 2.0™

• Hold data at the lowest detail; • Hold data to infinity;

• Have integrity of data and have online high-performance

transaction processing;

• Tightly couple metadata to the data warehouse environment; • …

• Link structured data

and unstructured data;

Text Data

(9)

Industry Model:

(10)

Industry Model:

Enterprise Search Platform (Forrester)

(11)

Data Search Scenarios

Searching and Finding information in

(12)
(13)
(14)

Scenario 1: Pure Portal

Many portlets, one user interface;

Business user may manually combines content

from several independent sources;

Risk: too complex for user.

(15)
(16)

Integrate news with BI information

Source: Aruba

(17)
(18)

… and Photos, Files and Maps

(19)

Scenario 2: “index it all”

Enterprise Search from one user interface;

Business user knows what to look for and expects

a “complete picture” as a result;

(20)
(21)

Scenario 2: “Index it all”

Unstructured

data sources Search index

Data warehouse Architecture

Structured

data sources Reports

(22)

Example: IBM Cognos 8 Go! Search

Integration with enterprise search applications (IBM OmniFind, Google OneBox for Enterprise, Yahoo, Autonomy)

Search results return all relevant structured content (reports, analyses, etc.) and unstructured content (Word documents, PDFs, et) within a single interface.

(23)
(24)

Example: IBM OmniFind

(25)
(26)

11/6/2 009

SAP BusinessObject Intelligent Search

(27)

Scenario 3: “Structure it all”

Generate structure using document warehousing

and text mining;

(28)
(29)

Retrieve Documents Internal sources retrieval, file servers, CMS/DMS External source retrieval, using crawlers, spiders Sources are not fixed Iterative process, sources lead to new sources Preprocess Documents Format documents in a consistent matter Files must be in suitable form for text analysis

Text Mining

Linguistic analysis Key features are extracted

Indexing documents Summarizing documents

Source: Dan Sullivan

Compile Metadata Carefully attach metadata to document Used for querying, matching, navigation support Store in document warehouse Identify Sources

Sources are not fixed

Iterative process, sources lead to new sources

Generating structure in document warehouse

(30)

Document warehouse

Contains complete documents or URLsMetadata about documents:

summaries, authors’ names, publication dates, titles, sources, keywords, etc.

Translations of documentsThematic clustering of similar

documents

Topical or thematic indexes

Extracted key features (structure)

 Dimensions and Facts, linked to documents, summaries etc.

 Combine with the data warehouse

Document warehouse Architecture

(31)

BI reporting on dimensional model

Sales

Facts

Call

Facts

(32)

Generate structure using text mining tools

Example taken from SPSS PASW Text Analytics, many other tools available: IBM, SAS, Oracle, SAP BO, Microsoft etc. etc.

(33)

Generating structure using UIMA

• Unstructured Information Management Architecture

• Originates from IBM, now Apache UIMA

Source: IBM

http://incubator.apache.org/uima/

(34)

• Analyzed by a collection of text analytics

• Detected Semantic Entities and Relations Highlighted • Represented in UIMA Common Analysis Structure (CAS)

Example: Generating structure using UIMA

(35)

Summary

• Growing business need for combining BI with

unstructured data;

• Data Search bridges the gap between both

worlds

– Scenario 1: “Pure Portal”

– Scenario 2: “Index it all”

– Scenario 3: “Structure it all”

• Scenarios can be combined.

References

Related documents

In addition to large companies in food industry such as Unilever, Ferrero, P & G and Nestle, there are also NGOs members such as WWF, Solidaridad and Oxfam (Nikoloyuk, et

(g) The Committee should encourage Member States currently using API systems to consider implementing pre-departure matching of passenger data at airports for

Fig.. Case Study: Reduction and Stabilization of Grade III L5-S1 Dysplastic Spondylolisthesis in 15-Year-Old Female Using Posterior Approach; Terrence L. Piper, MD, Piper Spine

Players can create characters and participate in any adventure allowed as a part of the D&D Adventurers League.. As they adventure, players track their characters’

According to the international experience, federal authorities can carry out six groups of functions for support of mechanisms of development of innovative

The binding of steroid-binding proteins to the testos- terone-BSA surface was competed with free steroid hor- mone molecules (testosterone, DHEAS, androstenedione, estradiol, and

Data confirmed a decreased functional connectivity (FC) and task-induced deactivation of the DMN during the aging process and in subjects with lower mood; on the contrary, an