Searching and Finding information in
Unstructured and Structured Data Sources
Data Search
Erik Fransen
Senior Business Consultant Centennium BI expertisehuis The Hague, The Netherlands e.fransen@centennium.nl
11.00-12.00 P.M. November, 3 IRM UK, DW/BI 2009, London
Agenda
• Introduction;
• Industry models;
• Combining structured & unstructured data
– “Pure Portal”
– “Index it all”
– “Structure it all”
Profile
• Erik Fransen
• Background: Knowledge Engineering,
Middlesex University;
• Expertise areas:
– Business Intelligence
– Knowledge engineering
– Knowledge & Content management
– Data warehousing
– Analytics
• CBIP.
Combining BI with unstructured data
• Integrated access to relevant information (‘provide complete picture’); • Unstructured data like documents provide valuable context to numerical
data;
– Customer complaints
– Competitor’s press releases – Marketing documents
– …
• Insurance fraud analysis (i.e. claim statistics and claim forms);
• Competitive Intelligence (i.e. market share data and competitor news); • Customer retention (i.e. sales data and customer complaints);
• Data Search acts as a bridge between structured and unstructured data.
1999 GIG ABYTES 2001
2005
2009
> 80% Un st ru ct ured 2000 Cave paintings, Bone tools 40,000 BC Writing 3500 BC Paper 105 Printing 1450 Electricity, Telephone 1870 Transistor 1947 Computing 1950 Internet (DARPA) Late 1960sThe Web 1993 SQL -70 Or acle -79 SQL -89 SQL -92 SQL -99 SQL -03
(un)structured data keeps growing….
Industry Model:
Bill Inmon’s DW 2.0™
• Hold data at the lowest detail; • Hold data to infinity;
• Have integrity of data and have online high-performance
transaction processing;
• Tightly couple metadata to the data warehouse environment; • …
• Link structured data
and unstructured data;
Text Data
Industry Model:
Industry Model:
Enterprise Search Platform (Forrester)
Data Search Scenarios
Searching and Finding information in
Scenario 1: Pure Portal
Many portlets, one user interface;
Business user may manually combines content
from several independent sources;
Risk: too complex for user.
Integrate news with BI information
Source: Aruba
… and Photos, Files and Maps
Scenario 2: “index it all”
Enterprise Search from one user interface;
Business user knows what to look for and expects
a “complete picture” as a result;
Scenario 2: “Index it all”
Unstructured
data sources Search index
Data warehouse Architecture
Structured
data sources Reports
Example: IBM Cognos 8 Go! Search
Integration with enterprise search applications (IBM OmniFind, Google OneBox for Enterprise, Yahoo, Autonomy)
Search results return all relevant structured content (reports, analyses, etc.) and unstructured content (Word documents, PDFs, et) within a single interface.
Example: IBM OmniFind
11/6/2 009
SAP BusinessObject Intelligent Search
Scenario 3: “Structure it all”
Generate structure using document warehousing
and text mining;
Retrieve Documents Internal sources retrieval, file servers, CMS/DMS External source retrieval, using crawlers, spiders Sources are not fixed Iterative process, sources lead to new sources Preprocess Documents Format documents in a consistent matter Files must be in suitable form for text analysis
Text Mining
Linguistic analysis Key features are extracted
Indexing documents Summarizing documents
Source: Dan Sullivan
Compile Metadata Carefully attach metadata to document Used for querying, matching, navigation support Store in document warehouse Identify Sources
Sources are not fixed
Iterative process, sources lead to new sources
Generating structure in document warehouse
Document warehouse
Contains complete documents or URLs Metadata about documents:
summaries, authors’ names, publication dates, titles, sources, keywords, etc.
Translations of documents Thematic clustering of similar
documents
Topical or thematic indexes
Extracted key features (structure)
Dimensions and Facts, linked to documents, summaries etc.
Combine with the data warehouse
Document warehouse Architecture
BI reporting on dimensional model
Sales
Facts
Call
Facts
Generate structure using text mining tools
Example taken from SPSS PASW Text Analytics, many other tools available: IBM, SAS, Oracle, SAP BO, Microsoft etc. etc.
Generating structure using UIMA
• Unstructured Information Management Architecture
• Originates from IBM, now Apache UIMA
Source: IBM
http://incubator.apache.org/uima/
• Analyzed by a collection of text analytics
• Detected Semantic Entities and Relations Highlighted • Represented in UIMA Common Analysis Structure (CAS)