• No results found

Databases for text storage

N/A
N/A
Protected

Academic year: 2021

Share "Databases for text storage"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Databases for text storage

Jonathan Ronen

New York University [email protected]

(2)

Overview

1 Introduction

2 PostgresSQL

(3)

Why Databases?

Structured way to store your data Accessible, shareable

Manage growing volumes of data

You cannot keep all of your data in working memory... indexing

(4)

Basic issues with databases

Inserting data Schema Querying Indexing

(5)

I’ll show you how to do this in

PostgresSQL MongoDB

(6)

PostgreSQL

Relational DB

Which means we define tables with columns and relations Queried using Structured Query Language

ES-QUE-ELL, or SEQUEL, but not SQUEAL

opensource, free, very fast, advanced text search capabilities Friendly elephant logo

(7)
(8)
(9)
(10)

Basics of SQL

SELECT statement

SELECT * FROM tweets WHERE user id=2170941466;

SELECT statement with time range

SELECT * FROM tweets WHERE timestamp >’2014-12-2’;

SELECT statement with LIKE

(11)

Indexing

Imagine searching through a table:

id user id timestamp text

1 1 2014-11-30 10:23:40 I love the biebsssss!

2 2 2014-11-30 11:33:44 Bieberboy make me a baby!

3 1 2014-11-30 10:23:23 God if biebs dont come i shoot myself!!!

4 3 2014-11-30 9:12:11 I love bieber so much i have bieber sandwiches

5 2 2014-11-30 12:33:10 RT if you love biebsbs as much ias me!! or you die!!!!

(12)

Indexing

Imagine searching through a table:

id user id timestamp text

4 3 2014-11-30 9:12:11 I love bieber so much i have bieber sandwiches

3 1 2014-11-30 10:23:23 God if biebs dont come i shoot myself!!!

1 1 2014-11-30 10:23:40 I love the biebsssss!

2 2 2014-11-30 11:33:44 Bieberboy make me a baby!

5 2 2014-11-30 12:33:10 RT if you love biebsbs as much ias me!! or you die!!!!

(13)

Indexing

An index is a sorted copy of a column.

timestamp id 2014-11-30 9:12:11 4 2014-11-30 10:23:23 3 2014-11-30 10:23:40 1 2014-11-30 11:33:44 2 2014-11-30 12:33:10 5

(14)

Text search in postgres

SELECT statement using PG text search

SELECT * FROM tweets WHERE to tsvector(’english’, text) @@ to tsquery(’obama’);

to tsvector to tsquery

(15)

Text indexing

CREATE INDEX statement

CREATE INDEX text idx ON tweets USING gin(to tsvector(’english’, text));

SELECT statement using text index

SELECT * FROM tweets WHERE to tsvector(’english’, text) @@ to tsquery(’obama’);

(16)

Aggregation

GROUP BY statement

(17)

MongoDB

Document store

noSQL doesn’t mean query language isn’t structured (but it’s different..)

(18)

JSON documents

{

” c r e a t e d a t ” : ”Wed Aug 13 1 5 : 2 0 : 4 6 +0000 2 0 1 4 ” , ” l a n g ” : ” en ” ,

” r e t w e e t c o u n t ” : 0 ,

” t e x t ” : ” P e n n s y l v a n i a USA P h i l a d e l p h i a \ u00bb MikeBrown 545 Mike Brown : S t . L o u i s P o l i c e S h o o t amp K i l l Unarmed 18−Year−Old −− S\ u2026 h t t p : / / t . co /RgDpM8M881 ” , ” u s e r ” : { ”name ” : ” J e f f ” , ” s c r e e n n a m e ” : ” j e f f e r s o n d o l ” , ” s t a t u s e s c o u n t ” : 2 0 7 8 4 5 , ” d e s c r i p t i o n ” : ”#a n d r o i d , #a n d r o i d g a m e s ,# i p h o n e , #i p h o n e g a m e s , #i p a d , #i p a d g a m e s , #app ” , ” f o l l o w e r s c o u n t ” : 8 1 0 , ” l a n g ” : ” en ” , ” g e o e n a b l e d ” : f a l s e ,

(19)

MongoDB is a document database

MongoDB lets you store these documents directly No need to flatten to tabular form!

Comes with its own query syntax Also uses indexing to speed queries

SQL Mongo

Database Database

Table Collection

Row Document

(20)

MongoDB Query Syntax

Regex matching db . c o l l e c t i o n . f i n d ( { ’ t e x t ’ : /obama / } ) Date range db . c o l l e c t i o n . f i n d ( { t i m e s t a m p : { $ g t : new Date ( 2 0 1 4 , 1 0 , 6 ) } } )

(21)

Text search in MongoDB

Creating a text index

db.tweets.ensureIndex({text: ”text” })

Using text search

(22)

Aggregation in MongoDB

Aggregation framework db . t w e e t s . a g g r e g a t e ( { $ g r o u p : { i d : ” $ u s e r . s c r e e n n a m e ” , number : { $sum : 1 } } } )

(23)

SMAPP

Some info on the smapp backend:

MongoDB with index on tweet id, timestamp, random number (for sampling)

No text index (yet!)

(24)

References

Related documents

RISS also provides web-based and email client advanced search capabilities that support nearly instantaneous searches of even very large amounts of archived data, giving

A single dtSearch index could include both the SQL database and the separate document repository, including searching with all the above advanced search features,

In this paper, we propose to explicitly incorpo- rate the column value polarities as external knowl- edge in text-to-SQL models. Our goal is to scale the capabilities of existing

The planned method involves automatically summarization from free-text databases to n-gram models and spatial language models, automatically filtering spatial

The Advanced Search function allows you to filter your search by the fields that exist in a Close Call record (except for the free text fields. See section 2.2.1 on searching the

thumbnail view, full-text search and hit highlighting capabilities, a JPEG2000 image server, a highly scalable back-end common infrastructure for storage, batch image processing,

Splitted document to replace text search for an advanced document into aspose components and extract it finds and excel documents in a node.. Small application with new document