• No results found

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

N/A
N/A
Protected

Academic year: 2021

Share "Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

1  

Cyrus Shahabi, Ph.D.

Professor of Computer Science & Electrical Engineering

Director, Integrated Media Systems Center (IMSC)

Director, VSoE Informatics

Viterbi School of Engineering

University of Southern California

Los Angeles, CA 900890781

[email protected]

 

Archiving  and  Sharing  Big  Data  

(2)

2  

OUTLINE  

Some  Backgrounds  

Cloud  CompuBng  

An  Example  

IMSC’s  TransDec  

A  Proposal  

USC  DataLab  

(3)

3  

OUTLINE  

Some  Backgrounds  

Cloud  CompuBng  

An  Example  

IMSC’s  TransDec  

A  Proposal  

USC  DataLab  

(4)

4  

Cloud  CompuBng  

• 

Cloud  compuBng  is  the  delivery  of  compuBng  

and  storage  resources  as  a  

service

 across  

Internet  to  mulBple  external  customers  

through  massive  scale  data  centers.  

• 

Some  staBsBcs  

51%

 of  all  global  workloads  in  2014  were  

processed  in  cloud  versus  tradiBonal  IT  space

1

.  

IBM  Big  Blue  cloud  project  generated  

$7  billion  

revenue  in  2014,  up  75%  from  the  previous  

year

2

.  

By  2020,  it  is  esBmated  that  

80%

 of  small  

businesses  in  US  will  use  cloud  compuBng,  up  

from  37%  in  2014.  

[1]  Cisco,  

hYp://newsroom.cisco.com/release/

1274405  

[2]  

hYp://talkincloud.com/cloud-­‐compuBng-­‐

funding-­‐and-­‐finance/01202015/ibm-­‐q4-­‐

earnings-­‐cloud-­‐revenues-­‐hit-­‐7b-­‐2014  

 

(5)

5  

Cloud  CompuBng  

Advantages  

ü

Reduced  Cost  

eliminaBon  of  the  investment  in  stand-­‐alone  so_ware  or  servers  

ü

Scalability  and  ElasBcity  

providing  on-­‐demand  resources  instantaneously  

ü

Availability  

downBme  is  very  small  throughout  year  

ü

Quick  deployment  

minimum  effort  in  integraBng  applicaBon  

ü

Environment  friendly  

(6)

6  

Cloud  CompuBng  

Disadvantages  

ü

Security  and  Privacy  

by  leveraging  a  remote  cloud  based  infrastructure,  a  company  

essenBally  gives  away  private  data  and  informaBon  

ü

Dependency  and  Vendor  lock-­‐in  

implicit  dependency  on  the  provider  

ü

Limited  Flexibility  

since  the  applicaBons  and  services  run  on  remote,  third  party  

virtual  environments,  users  have  limited  control  over  the  

hardware  and  so_ware  

ü

Increased  Vulnerability  

since  cloud  based  soluBons  are  exposed  on  the  public  internet  

and  are  thus  a  more  vulnerable  target  for  malicious  users  and  

hackers.  

(7)

7  

Cloud  CompuBng  

(8)

8  

Cloud  CompuBng  -­‐  Pricing  

Virtual  Machines  (Servers)  

Servers  are  grouped  into  certain  categories  such  as  

disk-­‐

op(mized,  memory-­‐op(mized,  CPU-­‐op(mized,  GPU.    

Each  server  group  consists  of  mulBple  servers  

Note:  smallest  means  the  server  with  the  lowest  

configuraBon  in  that  group  

Group

Amazon

Microso8

Google

price  ($/hour)

price  ($/hour)

price  ($/hour)

smallest

largest

smallest

largest

smallest

largest

General  purpose

0.07

0.56

0.02

0.72

0.077

1.232

Compute  op<mized

0.105

1.68

2.45

4.9

0.096

0.768

Memory  op<mized

0.175

2.8

0.33

1.32

0.18

1.44

Disk  op<mized

0.853

6.82

-­‐

-­‐

-­‐

-­‐

Micro

0.02

0.044

-­‐

-­‐

0.014

0.0385

GPU

0.65

0.65

-­‐

-­‐

-­‐

-­‐

(9)

9  

OUTLINE  

Some  Backgrounds  

Cloud  CompuBng  

An  Example  

IMSC’s  TransDec  

A  Proposal  

USC  DataLab  

(10)

10  

Data Type   Sample XML File Size (in KB)   Cycle Duration (in

seconds)   Minute (in KB)   Hourly (in KB)   Daily (in KB)   Annual (in KB)   3 Years (in KB)  

bus_mta_inv2.xml   23   86400   0.96   0.96   23.00   8,395.00   25,185.00   bus_mta_rt2.xml   1065   120   532.50   31,950.00   766,800.00   279,882,000.00   839,646,000.00   cctv_inv.xml   57   86400   0.04   2.38   57.00   20,805.00   62,415.00   cms_inv.xml   52   86400   0.04   2.17   52.00   18,980.00   56,940.00   cms_rt.xml   48   75   38.40   2,304.00   55,296.00   20,183,040.00   60,549,120.00   event_d7.xml   11   75   8.80   528.00   12,672.00   4,625,280.00   13,875,840.00   rail_mta_inv.xml   1   86400   0.00   0.04   1.00   365.00   1,095.00   rail_rt.xml   8   60   8.00   480.00   11,520.00   4,204,800.00   12,614,400.00   rms_inv.xml   865   86400   0.60   36.04   865.00   315,725.00   947,175.00   rms_rt.xml   1236   75   988.80   59,328.00   1,423,872.00   519,713,280.00   1,559,139,840.00   signal_inv.xml   2095   86400   1.45   87.29   2,095.00   764,675.00   2,294,025.00   signal_rt.xml   2636   45   3,514.67   210,880.00   5,061,120.00   1,847,308,800.00   5,541,926,400.00   tt_d7_inv.xml   746   86400   0.52   31.08   746.00   272,290.00   816,870.00   tt_d7_rt.xml   152   60   152.00   9,120.00   218,880.00   79,891,200.00   239,673,600.00   vds_art_d7_inv.xml   115   86400   0.08   4.79   115.00   41,975.00   125,925.00   vds_art_d7_rt.xml   45   60   45.00   2,700.00   64,800.00   23,652,000.00   70,956,000.00   vds_art_ladot_inv.xml   2538   86400   1.76   105.75   2,538.00   926,370.00   2,779,110.00   vds_art_ladot_rt.xml   969   60   969.00   58,140.00   1,395,360.00   509,306,400.00   1,527,919,200.00   vds_fr_d7_inv.xml   957   86400   0.66   39.88   957.00   349,305.00   1,047,915.00   vds_fr_d7_rt.xml   361   30   722.00   43,320.00   1,039,680.00   379,483,200.00   1,138,449,600.00   Total KB from XML data   13980   864660   6,985.28   419,060.38   10,057,449.00   3,670,968,885.00   11,012,906,655.00  

 

Variety  

(gps,  video,  loop  

sensor,  events)

 

Velocity  

Volume  

Traffic  Data  Lifecycle:  Data  Aggregator  

An  Exclusive  Contract  w  LA-­‐Metro  

(11)

11  

Input  Traffic  

Data  

Processing  

Data      

Storage  

Retrieval,  Analysis  

&VisualizaBon  

Highway  (4313)  

Arterial  (4780)  

Bus  &  Rail  (2000)  

Ramp  meter  

Events  &  CMS  (800/day)  

Real-­‐<me  Queries  &    

Data  Cleansing  

46  MB/min  

TransDec:  

Big  data  acquisiBon,  storage  &  access  

 

15  TB/Year  

26  MB/min  

Spa<otemporal  Indexing  

 (Oracle  Award,  

IEEE  CloudCom  Best  paper

)  

E.g.,  Accident  impact  

 analysis    &  predic<on  

(ICDM’12  &  13)  

Sensor  3   Event  LocaBon   Sensor  2   Sensor  1   Sensor  4  

(12)

12  

OUTLINE  

Some  Backgrounds  

Cloud  CompuBng  

An  Example  

IMSC’s  TransDec  

A  Proposal  

USC  DataLab  

(13)

13  

Berkeley  Data  AnalyBcs  Stack  

-­‐BDAS-­‐  

BDAS:    

BDAS  is  

an  open  source  so8ware  stack  

that  integrates  open-­‐source  so_ware  

components  to  make  sense  of

 Big  Data

.  

A  High  Level  overview  of  BDAS  

Components  

 

 

Data  Processing  

Data  Management  

Resource  Management  

(14)

14  

Berkeley  Data  AnalyBcs  Stack  

-­‐BDAS-­‐  

   

BDAS  More  in  Depth  

 

Numerous  available  open  source  

packages  for:  

-­‐

Machine  Learning  

(MLlib)  

-­‐

Graph  analysis  

(GraphX)  

-­‐

Real-­‐<me  Analysis  

(Streaming)  

 

BigData  

applica<ons

 for  various  

domains  

 

Flexible  

intercommunica<on    

between  

layers    

 

Unlimited  expansion…  

Many  more  projects  to  come  

 

 

(15)

15  

USC  DataLab  

Create  a  shared  repository  of  USC  data  &  code  for  

research  (on  BDAS)  

Example:  Security-­‐related  Datasets  

• 

CCTV  videos  from  DPS  

• 

Mobile  videos  from  any  individuals  

• 

Sensor  Readings  from  Buildings  from  Facility  Management  

• 

Crime  Reports  from  DPS  

• 

ShuYle  bus  routes/locaBons  from  USC  TransportaBon  

• 

Security  patrol  cars/ambassadors  locaBons  from  DPS  

• 

Events  from  various  sources  

• 

Crowdsourced  data  from  USC  community  

• 

Shared  So_ware  (for  data  analysis  such  as  object  recogniBon)  from  

USC  community  

(16)

16  

Backup  Slides  

(17)

17  

Cloud  CompuBng  -­‐  Amazon  

• 

Virtual  Machines  

28

 different  types  of  servers  

Big  Data  analysis  for  both  

offline

 and  

stream

 data.  

Services:  ElasBc  MapReduce,  Kinesis,  RedShi_  

Scalable  NoSQL  databases  

Service:  DynamoDB  

• 

TradiBonal  relaBonal  databases  

Service:  RDS  

File  and  Object  Storage  

(18)

18  

Cloud  CompuBng  -­‐  Microso_  

Virtual  Machines  

18

 different  types  of  servers  

• 

Big  Data  analysis  for  both  

offline

 and  

stream

 data.  

Services:  HDInsight  

Scalable  NoSQL  databases  

Service:  Windows  Azure  Table  

TradiBonal  relaBonal  databases  

Service:  SQL  server  

• 

File  and  Object  Storage  

(19)

19  

Cloud  CompuBng  -­‐  Google  

Virtual  Machines  

15

 different  types  of  servers  

• 

Big  Data  analysis  for  both  

offline

 and  

stream

 data.  

Services:  Big  Query,  Hadoop  

Scalable  NoSQL  databases  

Service:  Cloud  Datastore  

TradiBonal  relaBonal  databases  

Service:  Google  Cloud  SQL  

• 

File  and  Object  Storage  

(20)

20  

Berkeley  Data  AnalyBcs  Stack  

-­‐BDAS-­‐  

BDAS  Important  Components  

Mesos  

A  cluster  management  layer  

Resource  management  and  scheduling  across  enBre  datacenters  and  cloud  environments    

 

Spark  

An  in-­‐memory,  distributed,  fault-­‐tolerant  processing  framework  

Data  sharing  enabled  compared  to  Map  Reduce    

In-­‐memory  solu<on,  extremely  faster  for  tasks  that  boYleneck  on  disk  I/O  in  MapReduce  

MulBple  running  packages  on  top  of  Spark  Core  (Spark  SQL,  SPARK  MLlib,  SPARK  Streaming)  

 

Tachyon  

Fault-­‐tolerant,  memory-­‐centric  distributed  file  system  

Tachyon  caches  working  set  files  in  memory  

Avoids  going  to  disk  to  load  datasets  that  are  frequently  read  

(21)

21  

Berkeley  Data  AnalyBcs  Stack  

-­‐BDAS-­‐  

InstallaBon  Dependencies  

 

BDAS  can  be  installed  on  any  cloud  provider  with    

Amazon  Cloud  

Google  Cloud  

Microso_  Azure  Cloud  

Private  Cloud  

HPC  

 

So8ware  Requirements    

 

 Runs  on  both  Windows  and  Unix-­‐like  systems  (Centos,  RHEL,  Mac  OS)  

 

Produc<on  Requirements  

 

 Memory  per  machine  –  good  behavior  documented  from  8GB  to  hundred  

References

Related documents

We believe a new fast track entry route into teaching for graduates could be piloted alongside these existing routes with aspiring teachers assessed on their teaching in a

1) L’anàlisi del context territorial dels pobles de Mariola, englobant-los en un context més ampli on s’introdueixin les capçaleres comarcals, així com la

lysed the global demand and needs for water technologies and services (as done in Chap- ter 2), we find that the export-oriented part of the Danish water sector possesses certain

Computational Science &amp; Engineering Electrical &amp; Computer Engineering Engineering Science &amp; Mechanics Enterprise Transformation Environmental Engineering Health

the available knowledge on bone regeneration via the intramembranous pathway, such an endochondral approach is still in its infancy. Therefore, the overall objective

Social domain Technological domain Economic domain Ecological domain Governance domain Extraction &amp; Processing Technologies Use: Level of Service &amp; Value Consumption

This study used the following variables such as the real gross domestic product, the real exchange rate, the broad money supply, real interest rate to test the impact of

In fact, current development shows that the existing e-procurement systems help to address challenges associated with paper-based system in data management