Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

(1)

(2)

(3)

Operates more like a search engine

than a database

Scoring and ranking IP allows for

fuzzy searching

Best-result candidate sets returned

Contextual analytics to correctly

disambiguate entities

Embedded inside the database

No need for Hadoop or

custom-code analytics

True real-time analytics – done

per transaction and in aggregate

On-the-fly linking IP

A new kind of in-memory platform,

built for in-memory applications

Proprietary compression enables

in-memory at scale

Datasets reduced to 16% of

original size

(4)

1M documents to petabyte scale; streaming, constantly changing data, or more of same type of data

Questions are unique to users; analytics driven by the information

that comes through on the query

Looking for the “best” answer, not a definitive one. Consider how/if/to

what extent data changes.

Need flexibility in the query formation and fuzzy search; DBMS must perform like a search engine

as well as a database

Finch = up to 16% of original size

Need sub-second response times; enabling analytics per transaction.

Need embedded models.

Need storage costs reduced; must run on commodity hardware

(5)

Fraud Detection Monitoring financial transactions to identify patterns that could indicate fraud

Internet of Things Collecting high- volume, high velocity sensor and telemetry data to improve performance, _{meet customer needs or support new product development}

Digital Communication/

Message Traffic Monitoring streaming feeds of message traffic to identify patterns, risks, trends

CRM/Customer Service Engagement Aggregating customer information from multiple sources with different data models to _{improve the customer experience}

Personalization Ingesting clickstream data at high throughput rates to create and refine visitor profiles, _{serving up relevant content upon each return site visit}

Real-Time Big Data Ingesting a streaming feed of data to perform real-time analytics that inform business-critical_decisions

Cyber Security Protecting data from breaches, theft or misuse

(6)

(7)

Answer

Query Best Answer (derived from

analytic processing) Aggregate Analytics (optional) Candidate Set Compression IP: Makes in-memory feasible at scale

On-the-Fly Linking IP: Enables true real-time

analytics inside Finch

(8)

Analytics Outside the Database

Batch Processing

(Look Up Known, Precomputed Info)

*Predetermined answers to predetermined questions… about things you know you want to know

(9)



Search Today:

(HP Autonomy, Solr, and even commercial search engines)

Query

Candidate Set

Ranked Results

Not in-memory

But FinchDB is.

No analytics

But FinchDB does.

(10)

A question we often encounter is how FinchDB handles streaming data – in

addition to static data – and how it differs from the popular Apache Spark product.

The primary difference is our ability to apply transactional, predictive analytics on

the fly, inside the database – using all available data.

Below is a side-by-side

comparison.

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html

• Apply predictive models

• Analyze on the fly

• Compute answers

• Go beyond look-up

Models inside the database

(11)

KB Inserts Wires Original Content Corporate Blogs Online Media

Stream Processing

Entity Extraction Queries 33

(12)



Running on a

four-node cluster in AWS



Processing a streaming feed of news with

800,000 documents per day



Disambiguating roughly

10 entities

per document



Leveraging a Person-KB of

500M features

describing 3M unique people



A Geo-KB with more than

30M+ unique places

in the world



And an Org-KB of more than

380M features

describing more than 1.3 million

unique companies, non-profits, governments and criminal organizations.

(13)

(14)

(15)

(16)

(17)

(18)

(19)



Every query has

search specifications

and

scoring/ranking specifications

.



We look at both to return a candidate set.



In an entity disambiguation use case, to do that, we calculate a disambiguation

score, based on:



Name Score



Topic Vector Score



Context Vector Score



Prominence Score



And we do that in less than a millisecond around every event. In this use case,

an “event” is a new document coming into the system.



The same would be true in other use cases. In a cybersecurity usecase, an

“event” would be an attack. In this scenario, you could take what’s happening in

your environment and put that data as part of the query.

Answer

Query Best Answer

Aggregate Analytics

(20)

JSON-style, doc database Not in-memory, no embedded analytics, open-source In-memory, multiple deployment models,

distributed architecture, No embedded analytics

In-memory, HTAP processing use cases Only works on structured data

In-memory, handles unstructured text

As a “data fabric” GridGain takes in SQL, NoSQL and Hadoop-analytic data. FinchDB does on-the-fly analytics inside the database –

meaning the need for Hadoop for could be eliminated altogether. HTAP processing use cases

Only works on structured data. Not true in-memory: uses a built-in, on-demand caching scheme. All transactional operations are done on in-memory data.

Doc database Open source, cannot be cloud deployed/DBaas

JSON-style, doc database, distributed

(21)