1
EC Wise Report:
Unlocking the Value of Deeply Unstructured
Data
By Frank Teklitz
The Challenge: Gaining Knowledge from
Deeply Unstructured Data
There are many industries that live, breath and depend on deeply unstructured data - like business agreements and warranties; scientific papers, business and economic research; engineering and IT documents; customer social conversations; corporate and medical communications.
Getting clean, clear, accurate and meaningful data from deeply unstructured data is a significant challenge. Here are examples that illustrate the point:
A faulty water pipe cracks and causes significant flooding. The work history for the building, delivering, installing and maintaining of this pipeline lives in contracts, work orders, field notes and documentation – and in many files and document stores. The research department needs to see all documents for work done on this water pipe for the last 20 years. Unfortunately, these data stores are deeply
ambiguous, with many words, phrases and statements that have unclear or vague meanings; often with many spellings; and many vocabularies; even languages. There is no reliable programmatic way to query the documents.
A surge of warranty work occurs in Atlanta for a defective battery. The warranty, manufacturing and legal departments need to analyze all warranties, supporting work orders, contractor agreements, and supporting documents pertaining to the defective battery. Unfortunately these unstructured document stores do not lend themselves to data search techniques because of their deeply unstructured nature and high levels of ambiguity.
Feedback from the Market:
“Forest Rim enables significant improvements in the quality of semantic information derived from text data. This is critical to delivering more accurate understandings of social, business and scientific information”
-- Jack Hakim, CEO of EC Wise
Key Capabilities:
Forest Rim unlocks the value of Deeply
Unstructured Data - business contracts and warranties; scientific research, business and economic analysis; engineering and IT
documents; customer social conversations; corporate communications
Forest Rim is business oriented - delivers a new generation of knowledge management capabilities that work in conjunction with the enterprise data warehouse to deliver a new generation of analysis capabilities
2 Medical records, doctors notes, patient charts – while often captured electronically – have limited
analytical and research value because of conflicting medical vocabularies, structures and language embedded in the data itself.
Insurance claims data is deeply unstructured with each claims statement, customer interaction and remedy statement being a unique, individual document produced by an individual agent (or multiple agents) with individual understandings, vocabularies, and background levels. Each individual document source has properties and characteristics that must be understood and managed.
What is Deeply Unstructured Data?
Deeply unstructured data has four components:1. The data is significantly ambiguous 2. The data has little or no structure 3. The data is highly valuable
4. The data and data interactions are complex
The data is significantly ambiguous
In each of the examples above the data sources have many words and statements that mean the same thing; with many spellings; and many vocabularies. Let’s present a basic example to illustrate the point: Look at the statements below:
I love my mustang My mustang is brown My mustang is fast
Is the “mustang” a car or a horse or a toy or something entirely different? Well, you don’t know without contextual tags and pointers. The same is true for a computer doing textual analysis – if there is no context in the data text, there is very limited value that can be obtained from this very valuable data. And this includes big data.
Forest Rim makes this data valuable by contextualizing it with proper tagging and pointers using powerful semantic transformation tools. A typical result from Forest Rim would look like the following:
I love my mustang (horse, taxonomy.horses.mustang) My mustang is brown (toy, glossary.toys.mustang) My mustang is fast (car, ontology.cars.mustang)
This is a simple example, but it makes the point – if there is no context, there is only limited value in the text. Every industry have many vocabularies to manage (sometimes called syntaxes, taxonomies or ontologies), especially in healthcare, insurance, manufacturing, software.
3
The data has little or no structure
Deeply unstructured data has little or no structure. For example, even a simple contract will have many dates imbedded within it – Contract Date, Signing Date, Start Date, Launch Date, End Date, Close Date, Expiration Date, Termination Date, Completion Date; Warranty Date, Insurance Date, etc. Warranty, insurance and financial contracts can have hundreds of dates embedded anywhere in the document. Today, even asking a basic question from deeply unstructured data like “show all contracts that expire this month” is a daunting task, typically beyond the capabilities of document management and big data solutions.
Using Forest Rim’s Knowledge ETL capabilities, corporations can quickly index their deeply unstructured data and store the resulting indexes in a relational database. These knowledge indexes are easily queried with conventional business intelligent tools. The knowledge indexes can also be stored as part of the corporate data warehouse to enable unstructured and structured data to be queried and analyzed together. The ability to leverage unstructured and structured data together is what Bill Inmon refers to as DW 2.0.
The data is highly valuable
Typically highly unstructured data is also very valuable. In the Utilities, Energy, Transportation, Pipeline, Manufacturing and Commodity industries where business is driven by delivery, development,
maintenance and pricing contracts; scientific and engineering journals and specifications; and supporting documentation of results – the ability to leverage this highly unstructured data is critical. Unfortunately, this data, while easy to store, is almost impossible to tag, categorize, query or analyze - significantly undermining the value of the data.
By being able to effectively tag, index and analyze deeply unstructured data as easily as structured data, Forest Rim exponentially increases the value of deeply unstructured data to the enterprise.
The data and data interactions are complex
Deeply unstructured data and the underlying data interactions of deeply unstructured data are complex. Corporations are veritable towers of babble. Every division and line-of-business within a division is often siloed – each with their own vocabularies, inside speak, business practices and of course politics. It has been sighted by many financial research journals that a major cause of the crash of 2008 was the siloing of information between and within divisions in the banking industry.
Forest Rim has been called a “silo cracker.” The Forest Rim Knowledge ETL engine leverages syntaxes and taxonomies to quickly build living knowledge bases of the enterprise. These knowledge bases are living because Forest Rim enables a knowledge infrastructure that constantly scans, manages and
evaluates all information silos – both technical and business – to build sharable and consumable semantic models of the enterprise. Forest Rim not only provides a tool for knowledge management, but an architecture for making knowledge a living entity within the enterprise. In this way Forest Rim cracks silos, before they can even form.
4
The Solution
Forest Rim is a tool for business users. Unlike other unstructured data management and analysis technologies which are typically technology driven, the interface and usage of Knowledge ETL is a business oriented interface that delivers a new generation of data warehouse capabilities for structured and unstructured data or Data Warehouse 2.0
Forest Rim Knowledge ETL is integration and transformation technology for unstructured text data- again including big data. It is NOT a search, text query or data mining technology. Unlike search technologies, Knowledge ETL makes the assumption that taxonomical, ontological and semantic transformations must be performed on the unstructured text to increase its usefulness and value. Forest Rim empowers text with context.
Forest Rim Knowledge ETL extends legacy ETL into the unstructured world. Traditional ETL integrates structured legacy data into the data warehouse, while textual ETL integrates unstructured text and context into the data warehouse. Turning text into context is VERY different from performing traditional ETL.
Most importantly Knowledge ETL is an enabling technology. Knowledge ETL enables the building of analytical applications that leverage both text and context for analysis.
The Value
Unlock the value of unstructured data - emails, medical records, contracts, warranties, reports, call centers, and so forth. Most estimates shows that 80% of the data in the corporation is in the form of unstructured text, not numbers
Evaluate contract performance - for a warranty contracts, determine the exposure of failure in the product or any part of a product; for subcontractor and vendor contracts, evaluate contractor performance versus the contract; determine contractual relationships across and between contracts of similar vocabularies, ontologies and taxonomies; understand the value, types and volume of contracts
Determine warranty performance – determine volume and causes of “warranty storms” and patterns; understand patterns and characteristics of product failures over the long-term; understand categories of customers that are submitting warranties; evaluate performance of warranty work
Improve medical healthcare analysis - determine correlations between different medical diseases, between different medical conditions; determine common patterns of medical conditions, even though different doctors call the same things something different, (for example, when the condition “goiter” appears, does it appear in conjunction with
“hypertension”); Evaluate the context of patient occurrences, events and outcomes for things like “smoking”; how many patients can be generally described as “healthy” and why; determine all textual references for people who are “overweight”
Enhance insurance claims processing - determine how often a condition appears on a claim and why; determine how frequently the same product or class of products appears on claims? Analyze the types of claims that appear the most frequently and for what reasons; evaluate the
5 causes for claims, for example, when there is a claim for “broken lamplights”; is there also
mention of “recreational weapons”?
Improve scientific research – scan scientific documents for information on specific chemicals, molecules, substance characteristics (like melting points); find references between subjects areas, for example, show all the places where “carbon” is discussed in conjunction with
“benzene”; find the naturally occurring correlations that occur in scientific documents; rationalize the terminology of documents so that the research can be analyzed in a consistent manner Better leverage big data, document and text management systems – the tools that exist for
doing analysis in the Big Data environment are crude and still evolving. Forest Rim enables proven, contemporary business intelligence tools for unstructured data analysis to determine what valuable information is in your document files; integrate your big data with the data warehouse, remove “blather” and stop words
Unlocks the value of proven, contemporary business intelligence tools for unstructured data - today, textual data just doesn’t lend itself to easy and facile analysis using contemporary business intelligence tools which are almost 100% dedicated to handling well-structured numeric data. Forest Rim unlocks the value of business intelligence tools by enabling unstructured data for the data warehouse
Unlocking the value of data scientists – data scientists focus on doing experiments against big data stores. Because there are a limited number of data scientists, they become the bottleneck for producing analytical results. With Forest Rim, data scientists have a tools to purify and cleanse big stores for consumption by all business analyst in the enterprise.
Improve email/call center management – understand volume of unhappy customers and why; analyze recurring topics or product associations; understand volume of customer orders and why customers prefer one product or service over another; understand promotional responsiveness Optimize social marketing and commerce – determine social relationships and interactions in
purchasing behavior, online activity, and social networks - including clickstreams and social media messages - for behavioral analysis, influencer marketing, “virality” analysis, crowd sourcing, and similar applications to drive sales growth
Forest Rim Applications and Major Features Overview
Forest Rim Knowledge ETL ProcessThe Forest Rim Knowledge ETL process enables the input of legacy knowledge and taxonomies that are consumed by the Forest Rim Knowledge ETL Engine. The result from the Knowledge ETL process is structured, semantic, business knowledge that is stored in the Forest Rim Enterprise Knowledge Store, which in turn can be consumed by proprietary semantic models, BI applications and visualization tools. The Forest Rim Knowledge ETL process is specified below:
Legacy Knowledge Input
The input into Knowledge ETL is basically electronic text. This text can be in English, Spanish, German, French, Italian or Portuguese. Forest Rim Knowledge ETL can handle ANY form of legacy knowledge – business contracts and warranties; scientific research, business and economic analysis; engineering and
6 IT documents; customer social conversations; corporate communications, legacy programs, applications, stored procedures, data definition, queries, reports, legacy metadata models, proprietary semantic models and documentation. Regarding documentation, Forest Rim Knowledge ETL can consume formal text, informal text, notes, shorthand, email, blogs, tweets, etc. The most common forms of electronic documentation are files that have the extension type of .txt, .doc, .docx, or .pdf
Knowledge ETL
At the heart of knowledge transformation is Forest Rim Knowledge ETL. In knowledge transformation, legacy knowledge is ingested and transformed into a form that is suitable for a semantic model, XML knowledge store or relational data base.
In order to manage structured knowledge in a sophisticated manner, the legacy knowledge must be contextually transformed. For example:
Terminologies, rules and definitions from multiple sources must be analyzed, transformed and categorized to yield consistent knowledge, even though the original legacy knowledge is different from each source,
Alternate spellings and common misspellings must be accounted for,
Words need to be stemmed (antonyms, synonyms, homonyms) to their Latin or Greek roots, and so forth.
Forest Rim today has over 40 types of contextual knowledge transformations. Each of these transforms adds value, context and understanding that cannot be obtained from the legacy knowledge alone in programs, applications, stored procedures, data definitions, documents, etc.
7 Taxonomies and Syntaxes
Taxonomies and syntaxes are important inputs for most Knowledge ETL processing. Taxonomies and syntaxes are useful in resolving terminology, developing contextual classifications, and in filtering text (like documents, code and data definitions). Forest Rim Technology can operate with taxonomies that have been built by the client or Forest Rim. Forest Rim has access to over 29,000 professionally built and maintained taxonomies and syntaxes. In most cases it is simply a matter of selecting the 4 or 5
taxonomies and syntaxes that are the most relevant and installing them. This is done in a matter of minutes.
Forest Rim Enterprise Knowledge Store
The Forest Rim Enterprise Knowledge Store provides the shared environment for storing and managing business, analysis, security, data, workflow, quality and governance rules / knowledge definitions.
The Forest Rim Enterprise Knowledge Store can also be consumed by business level semantic models and XML knowledge stores, and the business and operating applications that consume them (read the EC Wise paper “Harvesting of Valuable Knowledge from Legacy Systems” for details on
leveraging semantic models).
Knowledge ETL creates DB2/UDB, Oracle, SQL Server, Teradata, and other relational data bases.
Knowledge ETL is agnostic to the type of relational data base that is created. Knowledge ETL creates up to 35 different types of analysis tables. These tables are designed for analytical joins and when taken together are much more powerful analytically than any one given table. These tables are designed to be easily read and manipulated by standard Business Intelligence software.
The Forest Rim Enterprise Knowledge Store can be stored at part of the enterprise data warehouse, enabling unstructured and structured data to be querieable from one environment. This is what Bill Inmon refers to as DW 2.0.
Business Intelligence
The Forest Rim Enterprise Knowledge Store is querieable with traditional business intelligence tools, including Business Objects, Cognos, MicroStrategy, SAS, Crystal Reports or Tableau.
Knowledge Visualizations
Forest Rim provides visualizations to help understand and analyze rules, data definitions, information flows and interrelationships
Textual Business Intelligence
Forest Rim can also contextualize unstructured, textual (big) data to enable one analytical environment for both structured and unstructured data. Forest Rim enables textual data to be contextualized -classified, organized and categorized - using the ontologies and taxonomies in the corporation – just like traditional business intelligence – creating one version of truth for structured and unstructured data.
8 Forest Rim Cloud Based Knowledge ETL with EC Wise
EC Wise provides a cloud based Forest Rim Knowledge ETL facility on Microsoft Azure that hosts the Forest Rim Knowledge Engine and Text Semantic Engine with support for Search Metadata Integration
Text Inputs – Contracts, Email, Spreadsheets, Documents; Metadata Samples; Search Data Indexes Samples; Compliance Laws
IT Knowledge Inputs - legacy programs, applications, stored procedures, data definition, queries, reports, proprietary semantic models and documentation
Outputs
o Forest Rim Enterprise Knowledge Store
o Microsoft Business Intelligence Semantic Model (BISM) Cloud Infrastructure Options
o Hosted by FRT – Limited data size
o In the Cloud – You pick the data volume – We support Microsoft Azure Data Transfer – Secure FTP or Dropbox
Additional interfaces - available via a professional services engagement
Forest Rim and EC Wise Proof of Value
Forest Rim and EC Wise offer free proof of value for samples of 50 stored procedures and / or data definitions or less. For more than 50 elements we will do a proof of value for a nominal fee. You can contact us at [email protected]