• No results found

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

N/A
N/A
Protected

Academic year: 2021

Share "E6895 Advanced Big Data Analytics Lecture 4:! Data Store"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

© CY Lin, 2015 Columbia University E6895 Advanced Big Data Analytics — Lecture 4

E6895 Advanced Big Data Analytics Lecture 4:

! Data Store

Ching-Yung Lin, Ph.D.

Adjunct Professor, Dept. of Electrical Engineering and Computer Science

Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center

(2)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

2

Reference

(3)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

3

Spark SQL

(4)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

4

Spark SQL

(5)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

5

Apache Hive

(6)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

6

Using Hive to Create a Table

(7)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

7

Creating, Dropping, and Altering DBs in Apache Hive

(8)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

8

Another Hive Example

(9)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

9

Hive’s operation modes

(10)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

10

Using HiveQL for Spark SQL

(11)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

11

Hive Language Manual

(12)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

12

Using Spark SQL — Steps and Example

(13)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

13

Query testtweet.json

Get it from Learning Spark Github ==> https://github.com/databricks/learning-spark/tree/master/files

(14)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

14

SchemaRDD

(15)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

15

Row Objects

Row objects represent records inside SchemaRDDs, and are simply fixed-length arrays of fields.

(16)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

16

Types stored by Schema RDDs

(17)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

17

Look at the Schema

(not a complete screen shot)

(18)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

18

Another way to create SchemaRDD

(19)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

19

JDBC Server

Spark SQL provides JDBC connectivity, which is useful for connecting business intelligence

tools to a Spark cluster and for sharing a cluster across multiple users.

(20)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

20

User-Defined Functions (UDF)

UDFs allow you to register custom functions in Python, Java, and Scala to call within SQL.

!

This is a very popular way to expose advanced functionality to SQL users in an organization,

so that these users can call into it without writing code.

(21)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

21

RDF and SPARQL

(22)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

22

Spark Streaming

In Spark 1.1, Spark Streaming is available only in Java and Scala.

Spark 1.2 has limited Python support.

(23)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

23

Spark Streaming architecture

(24)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

24

Spark Streaming with Spark’s components

(25)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 4: Data Store

25

Try these examples

(26)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

26

RDF and SPARQL

(27)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

27

RDF and SPARQL

(28)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

28

Resource Description Format (RDF)

A W3C standard sicne 1999

Triples

Example: A company has nince of part p1234 in stock, then a simplified triple rpresenting this might be {p1234 inStock 9}.

Instance Identifier, Property Name, Property Value.

In a proper RDF version of this triple, the representation will be more formal. They

require uniform resource identifiers (URIs).

(29)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

29

An example complete description

(30)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

30

Advantages of RDF

Virtually any RDF software can parse the lines shown above as self-contained, working data file.

You can declare properties if you want.

The RDF Schema standard lets you declare classes and relationships between properties and classes.

The flexibility that the lack of dependence on schemas is the first key to RDF's value.

!

Split trips into several lines that won't affect their collective meaning, which makes sharding of data collections easy.

Multiple datasets can be combined into a usable whole with simple concatenation.

!

For the inventory dataset's property name URIs, sharing of vocabulary makes easy to

aggregate.

(31)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

31

SPARQL – Query Langauge for RDF

The following SPQRL query asks for all property names and values associated with the

fbd:s9483 resource:

(32)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

32

The SPAQRL Query Result from the previous example

(33)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

33

Another SPARQL Example

What is this query for?

Data

(34)

© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 9: Linked Big Data: Graph Computing

34

Open Source Software – Apache Jena

References

Related documents

- Predictive Analytics 101 Operationalizing Big Data - Reducing Client Incidents through Big Data Predictive - Welcome to the Era of Big Data and Predictive Analytics in - Unleash

- Predictive Analytics 101 Operationalizing Big Data - Reducing Client Incidents through Big Data Predictive - Welcome to the Era of Big Data and Predictive Analytics in - Unleash

Source: IDC MaturityScape Benchmark – Big Data and Analytics in Government in North America, IDC GI 247362, March 2014... Big Data and Analytics capabilities represent a mix of talent,

The site is all about Big Data « 7 Tips to Succeed with Big Data in 2014 Java and Analytics the next frontier » Univa, MapR Partner Over Enterprise- Grade Hadoop Management 1

IBM Business Analytics software provides organizations with new ways to combine the analysis of big data with traditional analytics, particularly in three areas: accessing big data,

Oracle Big Data Connectors Oracle Data Integrator In -Data bas e A nal y tics Data Warehouse Oracle Advanced Analytics Oracle Database... Copyright © 2012, Oracle and/or

Oracle Big Data Management System SOUR CE S Oracle Database Oracle Industry Models Oracle Advanced Analytics Oracle Spatial

value from health information used in health care centers using a new information management approach called as big data analytics .Including big data analytics