• No results found

Data Management in the Cloud

N/A
N/A
Protected

Academic year: 2021

Share "Data Management in the Cloud"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

Data  Management  in  the  Cloud  

Lecture  8  

Data  Models  

Document:  MongoDB  

1  

With  thanks  to  Michael  Grossniklaus!  

“…

I’ve  failed  over  and  over  and  over  again  in  my  life.  

And  that  is  why  I  succeed.

”  –  Michael  Jordan  

(2)

Trend  Towards  Specializa=on  

2   Non-­‐rela@onal  

Opera@onal  Datastores   (“NoSQL”)  

New  Genera@on  OLAP  

(ver@ca,  aster,  greenplum)   (Oracle,  MySQL)  RDBMS  

MongoDB’s  defini=on  of  “NoSQL”:  

(3)

Horizontally  Scalable  Architectures  

No  joins  and  no  complex  mul@-­‐row  transac@ons    

–  hard  to  implement  both  when  scaling  horizontally  

–  abandon  the  rela@onal  data  model  

No  SQL  

New  data  models  

–  key/value:  memcached,  Dynamo  

–  document-­‐oriented:  MongoDB,  CouchDB,  JSON  stores  

–  tabular:  BigTable  

Improved  ways  to  develop  applica@ons?  

–  “easier  than  a  rela@onal  database  management  system”  

–  data  modeling:  begin  with  normalized  model  and  refine  model  by  

embedding  documents    

(4)

Spectrum  of  Systems  

4   sc al ab ili ty  a nd  p er fo rma nc e  

depth  of  func;onality  

•  memcached  

• key/value   •  MongoDB  

(5)

Document3   Document2  

MongoDB  vs.  RDBMS  

What  is  missing?  

–  joins  

–  complex  mul@-­‐row  transac@ons  

–  SQL  (but  MongoDB  has  document-­‐based  query  “language”)  

Terminology  

–  database  →  database  or  schema  

–  table  →  collec@on   –  row  →  document   –  column  →  field   Field2   Document1   Collec@on1   Database   Index   Field1   Collec@on2   Doc1   Doc2  

(6)

MongoDB  vs.  RDBMS  

“rela@onal  databases  define  columns  at  the  table  level  

whereas  a  document-­‐oriented  database  defines  its  fields  at  the  

document  level”  -­‐  

h]p://openmymind.net/mongodb.pdf

 (p.5)  

–  Table  -­‐>  Collec@on;  Row  -­‐>  Document  

6  

Brand   Size   Color  

Vans   8   Red/Gold   DC   8   White/Blue/Green   Nike   8.5   YellowGreen   {“Brand”: “Vans”, “Size”: “8”, “Color”: “Red/Gold”} {“Brand”: “DC”, “Size”: “8”, “Color”: “White/Blue/Green”} {“Brand”: “Nike”, “Size”: “8.5”, “Color”: “YellowGreen”}

(7)

MongoDB  vs.  RDBMS  

“rela@onal  databases  define  columns  at  the  table  level  

whereas  a  document-­‐oriented  database  defines  its  fields  at  the  

document  level”  -­‐  

h]p://openmymind.net/mongodb.pdf

 (p.5)  

–  Table  -­‐>  Collec@on;  Row  -­‐>  Document  

7  

Brand   Size   Color  

Vans   8   Red/Gold   DC   8   White/Blue/Green   Nike   8.5   YellowGreen   {“Brand”: “Vans”, “Size”: “8”, “Color”: “Red/Gold”,

“Note”: “Looks like mermaid”}

{“Brand”: “DC”, “Size”: “8”, “Color”: “White/Blue/Green”} {“Brand”: “Nike”, “Size”: “8.5”, “Color”: “YellowGreen”}

(8)

MongoDB  Data  Model  &  Features  

Advantages  of  document  model  

–  Documents  (objects)  correspond  to  na@ve  types  in  many  programming  

languages  

–  Embedded  documents  and  arrays  reduce  need  for  expensive  joins  

–  Dynamic  schema  supports  fluent  polymorphism  

High  Performance  

High  Availability  

Automa@c  Scaling  

8   h]p://docs.mongodb.org/manual/core/introduc@on/  

(9)

Data  Model  

MongoDB  stores  JSON-­‐

style

 documents    

–  internally  represented  as  BSON  

–  {“hello”: “world”}

↓    

\x16\x00\x00\x00\x02hello\x00

\x06\x00\x00\x00world\x00\x00

–  light-­‐weight  and  traversable  

–  driver  converts  programming  language  objects  to  BSON  

–  document  size  limited  to  4  MB  

Flexible  “schemas”  

–  the  “big  difference”  between  a  collec@on  and  a  table  

9  

denotes  an  a<ribute   of  type  string   {“author”: “fred”, “text”: “...”} {“author”: “mary”, “text”: “...”, “tags”: [“mongodb”]} Source:  h<p://bsonspec.org  

(10)

Use  Cases  

World  Wide  Web  

–  scaling  out  

–  high  volume  

–  caching  

Less  good  at…  

–  highly  transac@onal  

–  ad-­‐hoc  business  intelligence  

–  problems  that  require  SQL  

(11)

Storing  Documents  

post = {author: “mike”, date: new Date(),

text: “my blog post...”, tags: [“mongodb”, “intro”]}

db.posts.save(post)

Example  uses  JavaScript  

–  MongoDB  supports  JavaScript  on  the  client  and  server  side  

–  other  language  bindings  exist  

BSON  is  the  basis  for  a  rich  type  system  

–  e.g.,  (binary)  date  values  are  not  possible  in  JSON  

Collec@ons  are  created  implicitly  on  demand  

11  

database  

collec;on  

(12)

Document  Iden=fiers  

Special  key  _id

–  present  in  all  documents:  user  or  system-­‐defined  

–  unique  across  a  collec@on  

–  can  be  any  type  

System-­‐defined  default  _id  is  added  if  not  specified  by  user  

–  similar  to  GUID/UUID,  lightweight  (only  12  bytes)  and  fast  to  generate  

–  generated  at  the  client  

–  ObjectId(“4bface1a2231316e04f3c434”)

 @mestamp      machine  id        process  id          incremen@ng  counter   12   Source:  h<p://www.mongodb.org/display/DOCS/Object+IDs  

(13)

Queries  

Query  evalua@on  invoked  by  method  find()  on  collec@on  

–  posts  by  author  

 db.posts.find({author: “mike”}) –  query-­‐by-­‐example:  problems?  

Query  “language”  

–  query-­‐by-­‐example  plus  “$  modifiers”:  $gt,  $lt,  $gte,  $lte,  $ne,   $all,  $in,  and  $nin

–  age  between  20  and  40  

 {author: “mike”, age: {$gte: 20, $lt: 40}}

Advanced  queries  

–  keyword  $where


 db.posts.find({$where: “this.author == ‘mike’ ||

  this.title == ‘dbms’”})

13  

(14)

Complex  Queries  

Queries  involving  dates  

–  posts  since  April  1  

 april1 = new Date(2013, 3, 1)

 db.posts.find({date: {$gt: april1}})

Sor@ng  and  limit  

–  last  ten  posts  

 db.posts.find()

  .sort({date: -1})

  .limit(10) –  lazy  evalua@on  (cursors)  

14  

descending   all  posts  

JavaScript  starts   coun;ng  months  at  0  

(15)

Cursors  

var c = db.posts.find({author: “mike”}) .skip(20) .limit(10)) c.next() c.next() ... 15   next  result   query  

first  N  results  and  cursor  id  

next  with  cursor  id  

next  N  results  and  cursor  id  or  eof   ...  

omit  first  20  results   only  return  10  results  

(16)

Queries  over  Text  and  Collec=ons  

Use  regular  expression  in  find()  method  

–  posts  ending  with  “Tech”  

 db.posts.find({text: /Tech$/})

Members  of  collec@on-­‐valued  a]ributes  are  queried  in  the  

same  way  as  a]ributes  with  atomic  values  

–  posts  with  a  specific  tag  

 db.posts.find({tags: “mongodb”})

Use  of  a  mul@-­‐key  index  speeds  up  collec@on-­‐oriented  queries  

–   db.posts.ensureIndex({tags: 1})

16  

(17)

Aggrega=on  Func=ons  

Coun@ng  the  number  of  documents  in  a  collec@on  

–  total  posts  

 db.posts.count() –  total  posts  authored  by  Mike  

 db.posts.find({author: “mike”}).count()

Method  

distinct()

 displays  the  dis@nct  values  found  for  a  

specified  key  in  a  collec@on  

Method  

group()

 groups  documents  in  a  collec@on  by  the  

specified  key  and  performs  simple  aggrega@on  

More  ways  to  compute  aggregated  values  

–  aggrega=on  framework  using  aggregate  command  

–  map/reduce  aggrega=on  for  large  data  sets  using  mapReduce  

command  

(18)

Atomic  Update  Modifiers  

var c = {author: “eliot”, date: new Date(), text: “Great post!”} db.posts.update({_id: post._id},

{$push: {comments: c}})

MongoDB  has  no  transac@ons  

–  complex  func@onality  exposed  through  update  modifiers  

–  $push  adds  (appends)  an  object  to  an  array  

Client  does  not  need  to  worry  about  

–  locking  of  documents  

–  conten@on  of  updates    

18  

query  selector  

(19)

Document  Evolu=on  

post = {author: “mike”, date: new Date(),

text: “another blog post...”, tags: [“mongodb”],

title: “Introduction to MongoDB”} db.posts.save(post)

Flexible  schema  

–  documents  within  the  same  collec@on  can  have  different  a]ributes  

–  document  a]ributes  can  simply  be  added  or  removed  

–  evolu@on  performed  when  save()  is  invoked  

19  

(20)

Horizontally  Scalable  Architectures  

Remember…they  said…  

No  joins  and  no  complex  mul@-­‐row  transac@ons    

–  hard  to  implement  both  when  scaling  horizontally  

–  abandon  the  rela@onal  data  model  

(21)

Parallel  Database  Execu=on  -­‐  Join  

SELECT  brand,  descrip@on,  size,  shelfnumber,   shelfposi@on  

FROM  shoes,  shoestorage  

WHERE  shoes.id  =  shoestorage.id                                AND  color  =  ‘Green’  

                             AND  lastworn  <  ‘1-­‐25-­‐2014’  

shoes   shoestorage  

Case  1:  Tuples  from  shoes  and   shoestorage  are  randomly  

par@@oned  (sharded)  across  the   three  disks.   shoes   shoestorage   JOIN   JOIN   SELECT   SELECT   RePar@@on   RePar@@on   SCAN   RePar@@on   RePar@@on   SCAN  

Bold  lines  indicate   transfer  across   network  

Computer  1  

(22)

Transac=ons  Across  Systems…  

22  

Brand   Size   Color  

Vans   8   Red/Gold  

DC   8   White/Blue/Green   Nike   8.5   YellowGreen  

Brand   Size   Color  

Converse   8   White   Converse   8   Blue  

Converse   8   Black/Red   T1  –  Update  Converse  Black/Red  -­‐>  Black/Red/White  

T2  –  Update  Nikes  YellowGreen  -­‐>  ReallyBrightYellow   T1  –  Update  Nike  Yellow/Green  -­‐>  NeonYellow  

T2  –  Update  Nike  Yellow/Green  -­‐>  NeonYellow  

T1   T2   T1  wai@ng  on  T2  

T2  wai@ng  on  T1  

(23)

Atomic  Update  Modifiers  

var c = {author: “eliot”, date: new Date(), text: “Great post!”} db.posts.update({_id: post._id},

{$push: {comments: c}})

MongoDB  has  no  transac@ons  

–  complex  func@onality  exposed  through  update  modifiers  

–  $push  adds  (appends)  an  object  to  an  array  

Client  does  not  need  to  worry  about  

–  locking  of  documents  

–  conten@on  of  updates    

23  

query  selector  

(24)

So,  no  transac=ons…  

24  

No  begin,  commit,  rollback  

Atomic  only  at  the  single  document  level…  

(25)

Write  Concern…  

Errors  Ignored  

–  MongoDB  does  not  acknowledge  write  opera@ons.    

–  Client  cannot  detect  failed  write  opera@ons  –  including  connec@on  

errors,  duplicate  key  excep@ons…  

–  But  fast  performance…  

Unacknowledged  

–  MongoDB  does  not  acknowledge  the  receipt  of  write  opera@on.  

–  Similar  to  errors  ignored;  however,  drivers  a]empt  to  receive  and  

handle  network  errors  when  possible.    

–  Used  to  be  the  default  write  concern.  

Acknowledged  

–  Mongodb  confirms  the  receipt  of  the  write  opera@on.  

–  Allows  clients  to  catch  network,  duplicate  key,  and  other  errors.  

–  Is  now  the  default  write  concern.    

25   h]p://docs.mongodb.org/manual/core/write-­‐concern/  

(26)

Write  Concern…  

Journaled  

–  Mongodb  acknowledges  the  write  opera@on  only  auer  commivng  the  

data  to  the  journal.  Write  is  recoverable.    

–  Write  opera@ons  must  wait  for  next  journal  commit.  To  reduce  latency,  

mongodb  increases  the  frequency  that  it  commits  to  the  journal…  

Replica  Acknowledged  

–  With  replica  acknowledged  you  can  guarantee  that  the  write  opera@on  

propagates  to  the  members  of  a  replica  set.  See    

–  To  set  replica  acknowledged  write  concern,  specify  w  values  greater  

than  1  to  your  driver.  

26   h]p://docs.mongodb.org/manual/core/write-­‐concern/  

(27)

Indexing  

(28)

Indexing  

MongoDB  has  support  for  user-­‐defined  secondary  indexes  

–  Indices  defined  at  the  collec@on  level  

–  B-­‐tree  indexes  for  keys  and  compound  keys  

–  db.posts.ensureIndex({tags: 1})

Support  for  mul@-­‐key  indexes  

–  index  document  for  every  “key”  value  in  an  array  

Geospa@al,  text,  hash  indices;  Unique  indices  

28  

(29)

Embedded  Documents  

Document  model  supports  nested  or  embedded  documents  

–   {author: “mike”,

  title: “Teach Yourself MongoDB”,

  tags: [“mongodb”, “intro”],

  comments: [{author: “eliot”,

  text: “Awesome tutorial!”},

  {author: “fred”,

  text: “Thanks Mike!”}],

  text: “MongoDB is an open-source...”}

“Dot  nota@on”  to  index  and  query  embedded  documents  

–  create  an  index  on  an  embedded  document  

 db.posts.ensureIndex({comments.author: 1}) –  find  posts  with  a  comment  by  Eliot  

 db.posts.find({comments.author: “eliot”})

29  

Embedded   Documents  

(30)

Query  Op=mizer  

No  algebraic  or  cost-­‐based  query  op@miza@on  

–  evaluator  processes  alterna@ve  plans  in  parallel  

–  first  plan  to  finish  is  remembers  as  most  efficient  

–  other  plans  are  terminated  

–  query  plan  is  recorded  –  re-­‐evaluate  plan  auer  1000  writes,  add,  drop  

or  rebuild  index  

Find  posts  by  Joe  about  database  management  systems  

–   db.posts.find({author: “joe”, tags: “dbms”})

linear  scan   index  on  tags   index  on  author  

terminate   terminate   remember  

(31)

Data  File  Alloca=on  

Memory-­‐mapped  storage  engine  

–  en@re  files  are  mapped  into  memory  

MongoDB  stores  each  database  as  a  number  of  files  

–  namespace  contains  the  catalog  

–  data  files  contain  the  collec@ons  and  documents  

–    16MB blog.ns   64MB blog.0  128MB blog.1   16MB test.ns   ... 31   double  in  size  

(up  to  2  GB)   allocated  for  each  database  

(32)

Extent  Alloca=on  

Extents  allocated  into  

namespaces  per  collec@on  

–     blog.posts

–     blog.comments –     blog.users

–     blog.$freelist –     pre-­‐allocated  space  

Namespace  metadata  details  

stored  in  blog.ns

Empty  file  is  kept  in  order  to  

avoid  blocking  on  alloca@on  

32   0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 0000000000000 blog.0 blog.1 blog.2 000

(33)

Supported  Pla[orms  and  Languages  

Drivers  

–  PHP,  Perl,  Python,  Ruby,  Java,  C++/C,  C#,  Erlang,  Haskell  

Plazorms  

–  Window  (x86,  x64)   –  Linux  (x86,  x64)   –  OS  X  (x86,  x64)   –  Solaris   –  FreeBSD   33  
(34)

Replica=on  

Replica@on  provides  redundancy  and  increases  data  availability  

Replica  sets  added  in  version  1.6  

–  “master/slave”-­‐style  replica@on,  but  roles  can  change  

–  primary  receives  all  writes  

–  if  primary  fails,  secondaries  elect  a  new  primary  

34  

primary  

secondary   secondary   secondary  

primary   secondary   primary   primary   secondary   secondary   primary   primary   h]p://docs.mongodb.org/manual/replica@on/   client   applica@on   driver   writes   reads  

(35)

Read  Preferences  

35  

Read  Preference  Mode   Descrip=on  

primary   Default  mode.  All  opera@ons  read  from  

the  current  replica  set  primary.  

primaryPreferred   In  most  situa@ons,  opera@ons  read  from   the  primary  but  if  it  is  unavailable,  

opera@ons  read  from  secondary  members.   secondary   All  opera@ons  read  from  the  secondary  

members  of  the  replica  set.  

secondaryPreferred   In  most  situa@ons,  opera@ons  read  from   secondary  members  but  if  nosecondary   members  are  available,  opera@ons  read   from  the  primary.  

nearest   Opera@ons  read  from  the  nearest  member  

of  the  replica  set,  irrespec@ve  of  the   member’s  type.  

(36)

Auto-­‐Sharding  

36   mongod   mongod   mongod   mongod   mongod   mongod   ...   Shards   mongod   mongod   mongod   Configura@on   Servers   mongos   mongos   ...   client  

Stores  the  data  

Interfaces  with  client   applica@ons  and  directs  

opera@ons  to  the  appropriate   shard  or  shards  

Stores  the  cluster’s  metadata;   contains  a  mapping  of  the  

cluster’s  data  set  to  the  shards.   Produc@on  –  must  have  exactly   3  config  servers.  

Query  Routers  

h]p://docs.mongodb.org/manual/sharding/   Supports  range-­‐ based  and  hash-­‐ based  sharding   based  on  a   “shard  key”  

(37)

Other  Features  

Aggrega@on  and  map/reduce  

–  complex  aggrega@on  func@onality  

–  ETL/offline  aggrega@on  tasks  

Capped  collec@ons  

–  fixed-­‐sized  collec@on  with  wrap  around  (overwrite)  

–  e.g.,  for  website  logging  

Unique  indexes  

Mongo  shell  

GridFS  

–  file  system  API  built  on  top  of  MongoDB  

–  stores  large  binary  data  by  chunking  it  up  into  smaller  documents  

Two-­‐dimensional  geo-­‐spa@al  indexing  

–  e.g.,  foursquare  

(38)

References  

M.  Dirolf:  

Introduc=on  to  MongoDB

,  O’Reilly  Webcast,  2010,  

available  at:  

h]p://www.youtube.com/watch?v=w5qr4sx5Vt0

.  

h]p://docs.mongodb.org/manual/

 

Chapter  9  Sandage  &  Fowler  

(39)

TO  BE  CONTINUED…  

Data  Management  in  the  Cloud  

References

Related documents

The multi-stage barrel compressor is normally used for low flow and &#34;high&#34; head conditions while the pipeline.. compressor is normally used in high flow and low &#34;head&#34;

The multi-tenant nature of the cloud and questions about the physical location of cloud data are security risks that organizations looking at using cloud services need to be

As a complement of planimetry obtained by procedures as photogrammetry or terrestrial laser scanning (TLS), complete architectural surveys using non- destructive techniques are

En efecto, así como los libertarianos ven en cual- quier forma de intervención del Estado una fuente inevitable de interferencias arbitrarias –con la excepción de aquella acción

CDMI – The Cloud Storage Standard Cloud Data Management Interface. Enables interoperable cloud storage and data management Facilitates data portability, compliance

Write: Cloud Management System base functions Qualification plan Apply FCAPS/X-805 Write: Work instruction Write: Installation qualification Write: Operational Qualification

• Senses if the user is sitting, standing or walking through the use of an accelerometer • Lights up the appropriate LED on the interface control board for the current state •

Executive Brief In This Paper • In cloud environments, using multiple point products for data management often results in diminishing returns • Single-vendor solutions