• No results found

How To Use Facebook Data From A Microsoft Microsoft Hadoop On A Microsatellite On A Web Browser On A Pc Or Macode On A Macode Or Ipad On A Cheap Computer On A Network Or Ipode On Your Computer

N/A
N/A
Protected

Academic year: 2021

Share "How To Use Facebook Data From A Microsoft Microsoft Hadoop On A Microsatellite On A Web Browser On A Pc Or Macode On A Macode Or Ipad On A Cheap Computer On A Network Or Ipode On Your Computer"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduction to Big Data Science

14

th

Period

Retrieving, Storing, and Querying

Big Data

(2)

Contents

Retrieving Data from SNS

Introduction to Facebook APIs and Data

Format

K-V Data Scheme on Hadoop

Storing and Querying Data on Hive

(3)

Distributed Objects

Objects that can communicate with objects on

heterogeneous run-time environments

Distribute Objects Standard Protocol – ex: JRMP

Robust

Reliable

Transparent

Distributed Objects Technology

Multi-Platform

Transparent access to distributed objects

Language Neutral: RMI, CORBA, DCOM

(4)

Java Remote Method Invocation

(RMI)

Can use objects on remote different

run-time environments as like objects on a local

run-time environment

Abstraction of low-level network code on

distributed network to provide developers an

environment where they focus on their

(5)

CORBA Contributions

CORBA addresses two challenges of

developing distributed system

:

Making distributed application development no more

difficult than developing centralized programs.

Easier said than done due to :

Partial failures

Impact of latency

Load balancing

Event Ordering

Providing an infrastructure to integrate application

components into a distributed system

(6)

Big Data Science 6

APIs on the Web

Web Service Standard: Recommended

by W3C, Robust and Fast, but Not Easy to

use

Simple Object Access Protocol (SOAP)

Simple XML Message

Remote Procedure Call

Web Service Description Language (WSDL)

Specification of Web Service Function

Universal Description, Discovery, and

Integration (UDDI)

(7)

APIs on the Web

RESTful Web API: No Standard by Some Authorities,

but Easy to Use

Representational state transfer (REST) is an architectural style

consisting of a coordinated set of constraints applied to components,

connectors, and data elements, within a distributed hypermedia system.

REST ignores the details of component implementation and protocol

syntax in order to focus on the roles of components, the constraints upon

their interaction with other components, and their interpretation of

significant data elements.

REST has been applied to describe desired web architecture, to identify

existing problems, to compare alternative solutions, and to ensure that

protocol extensions would not violate the core constraints that make the

Web successful. Fielding used REST to design HTTP 1.1 and Uniform

Resource Identifiers (URI).

The REST architectural style is also applied to the development of Web

services as an alternative to other distributed-computing specifications

such as SOAP.

(8)

Retrieving Data from SNS

Social Network Services (SNS) provide

useful API for accessing their data.

Usually, they provide it in the form of Web

API, Web programming, Smart Phone

SDK.

It is almost impossible for us to retrieve all

data, but we can save what we need for

special purpose to a long time big data

storage.

(9)

Web APIs for Web and Several SNS

Localization and

translation

Atlas API

Public Feed API

Keyword Insights API

Twitter API

Google API

Facebook API

Graph API

Open Graph

Dialogs

Chat

Ads API

FQL

(10)

Twitter API

REST API v1.1 Resources

Timelines

Collections of Tweets, ordered with the most recent first.

Tweets

The atomic building blocks of Twitter, 140-character status updates

with additional associated metadata. People tweet for a variety of

reasons about a multitude of topics.

Search

Find relevant Tweets based on queries performed by your users.

Streaming

Direct Messages

Short, non-public messages sent between two users. Access to

Direct Messages is governed by the The Application Permission

Model.

(11)

Twitter API

Friends & Followers

Users follow their interests on Twitter through both one-way and

mutual following relationships.

Users

Users are at the center of everything Twitter: they follow, they

favorite, and tweet & retweet.

Suggested Users

Categorical organization of users that others may be interested to

follow.

Favorites

Users favorite tweets to give recognition to awesome tweets, to

curate the best of Twitter, to save for reading later, and a variety

of other reasons. Likewise, developers make use of "favs" in

many different ways.

(12)

Twitter API

Lists

Collections of tweets, culled from a curated list of Twitter users.

List timeline methods include tweets by all members of a list.

Saved Searches

Allows users to save references to search criteria for reuse later.

Places & Geo

Users tweet from all over the world. These methods allow you to

attach location data to tweets and discover tweets & locations.

Trends

With so many tweets from so many users, themes are bound to

arise from the zeitgeist. The Trends methods allow you to explore

what's trending on Twitter.

Spam Reporting

These methods are used to report user accounts as spam

(13)

Facebook APIs

Graph API

The Graph API is a simple HTTP-based API that gives access to

the Facebook social graph, uniformly representing objects in the

graph and the connections between them. Most other APIs at

Facebook are based on the Graph API.

Open Graph

The Open Graph API allows apps to tell stories on Facebook

through a structured, strongly typed API.

Dialogs

Facebook offers a number of dialogs for Facebook Login, posting

to a person's timeline or sending requests.

Chat

You can integrate Facebook Chat into your Web-based, desktop,

or mobile instant messaging products. Your instant messaging

client connects to Facebook Chat via the Jabber XMPP service.

(14)

Facebook APIs

Ads API

The Ads API allows you to build your own app as a customized

alternative to the Facebook Ads Manager and Power Editor tools.

FQL

Facebook Query Language, or FQL, enables you to use a

SQL-style interface to query the data exposed by the Graph API. It

provides for some advanced features not available in the Graph

API such as using the results of one query in another.

Localization and translation

Facebook supports localization of apps. Read about the tools we

provide.

Atlas API

The Atlas APIs provides you with programmatic access to the

(15)

Facebook APIs

Public Feed API

The Public Feed API lets you read the stream of public comments

as they are posted to Facebook.

Keyword Insights API

The Keyword Insights API exposes an analysis layer on top of all

Facebook posts that enables you to query aggregate, anonymous

insights about people mentioning a certain term.

(16)

Facebook Query APIs: FQL

Facebook

(17)

Facebook Query APIs: FQL

Fields of

comment

(18)

Big Data Science 18

Facebook APIs Running Example

Example

Runs the query "SELECT uid2 FROM friend WHERE uid1=me()"

https://developers.facebook.com/tools/explorer?method=GET&pat

h=fql%3Fq%3DSELECT+uid2+FROM+friend+WHERE+uid1%3D

me%28%29

Read

You can issue a HTTP GET request to /fql?q=query where query

can be a single fql query or a JSON-encoded dictionary of queries.

Query

Queries are of the form SELECT [fields] FROM [table] WHERE

[conditions]. Unlike SQL, the FQL FROM clause can contain only a

single table. You can use the IN keyword in SELECT or WHERE

clauses to do subqueries, but the subqueries cannot reference

variables in the outer query's scope. Your query must also be

indexable, meaning that it queries properties that are marked as

indexable in the documentation below.

(19)

FQL Example

<?php $app_id = 'YOUR_APP_ID'; $app_secret = 'YOUR_APP_SECRET'; $my_url = 'POST_AUTH_URL'; $code = $_REQUEST["code"]; // auth user if(empty($code)) { $dialog_url = 'https://www.facebook.com/dialog/oauth?client_id=' . $app_id . '&redirect_uri=' . urlencode($my_url) ;

echo("<script>top.location.href='" . $dialog_url . "'</script>"); }

// get user access_token

$token_url = 'https://graph.facebook.com/oauth/access_token?client_id=' . $app_id . '&redirect_uri=' . urlencode($my_url)

. '&client_secret=' . $app_secret . '&code=' . $code;

// response is of the format "access_token=AAAC..."

(20)

FQL Example

// run fql query $fql_query_url = 'https://graph.facebook.com/' . 'fql?q=SELECT+uid2+FROM+friend+WHERE+uid1=me()' . '&access_token=' . $access_token; $fql_query_result = file_get_contents($fql_query_url); $fql_query_obj = json_decode($fql_query_result, true);

// display results of fql query echo '<pre>'; print_r("query results:"); print_r($fql_query_obj); echo '</pre>'; // run fql multiquery $fql_multiquery_url = 'https://graph.facebook.com/' . 'fql?q={"all+friends":"SELECT+uid2+FROM+friend+WHERE+uid1=me()",' . '"my+name":"SELECT+name+FROM+user+WHERE+uid=me()"}' . '&access_token=' . $access_token; $fql_multiquery_result = file_get_contents($fql_multiquery_url); $fql_multiquery_obj = json_decode($fql_multiquery_result, true);

// display results of fql multiquery echo '<pre>';

print_r("multi query results:"); print_r($fql_multiquery_obj);

(21)

Big Data Science 21

Map-Reduce for Multiple Outputs

Parallel Execution of Map-Reduce Program

To give several control flow in Map operation, we can use

GenericOptionsParser, but that kinds of way can decrease

performance severely for a big data.

MultipleOutputs provides a trick of parallel processing of

Map-Reduce job by multiple output data.

org.apache.hadoop.mapreduce.lib.output.MultipleOutputs

Provides function of creating multiple output data.

Creating multiple OutputCollectors, and setting output path, output

format, key, and value type for OutputCollectors.

It creates different data to that the existing Map-Reduce program

outputs.

When Map-Reduce job finished, a output data “part-r-nnnnn” is to

be created in the Reduce stage.

If a programmer creates data on a directory “myfile” using

MultipleOutputs, “part-r-nnnnn” and “myfile-r-nnnnn” are created at

the same time.

(22)

Mapper Implementation for MultipleOutputs

public class DelayCountMapperWithMultipleOutputs extends

Mapper<LongWritable, Text, Text, IntWritable> {

// map output value

private final static IntWritable outputValue = new IntWritable(1);

// map output key

private Text outputKey = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if (key.get() > 0) {

String[] colums = value.toString().split(","); if (colums != null && colums.length > 0) { try {

// Departure dealy data output if (!colums[15].equals("NA")) {

int depDelayTime = Integer.parseInt(colums[15]); if (depDelayTime > 0) {

// Output key set

outputKey.set("D," + colums[0] + "," + colums[1]); // Output data creation

context.write(outputKey, outputValue);

} else if (depDelayTime == 0) {context.getCounter( DelayCounters.scheduled_departure).increment(1); } else if (depDelayTime < 0) {

(23)

Mapper Implementation for MultipleOutputs

} else { context.getCounter(DelayCounters.not_available_departure).increment(1); }

// Arrival Delay Data Output if (!colums[14].equals("NA")) {

int arrDelayTime = Integer.parseInt(colums[14]); if (arrDelayTime > 0) {

// Output Key Setting

outputKey.set("A," + colums[0] + "," + colums[1]); // Output Data Creation

context.write(outputKey, outputValue); } else if (arrDelayTime == 0) { context.getCounter(DelayCounters.scheduled_arrival).increment(1); } else if (arrDelayTime < 0) { context.getCounter(DelayCounters.early_arrival).increment(1); } } else { context.getCounter(DelayCounters.not_available_arrival).increment(1); } } catch (Exception e) { e.printStackTrace(); } } } }

(24)

Reducer Implementation for MultipleOutputs

public class DelayCountReducerWithMultipleOutputs extends Reducer<Text, IntWritable, Text, IntWritable> {

private MultipleOutputs<Text, IntWritable> mos; // reduce Output Key

private Text outputKey = new Text(); // reduce Output Value

private IntWritable result = new IntWritable(); @Override

public void setup(Context context) throws IOException, InterruptedException { mos = new MultipleOutputs<Text, IntWritable>(context); }

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // Split by comma

String[] colums = key.toString().split(","); // Output Key Setting

outputKey.set(colums[1] + "," + colums[2]); // Departure Delay

if (colums[0].equals("D")) { // Delay count sum int sum = 0;

for (IntWritable value : values) { sum += value.get(); } // Output Value Setting

result.set(sum);

// Output Data Setting

(25)

Reducer Implementation for MultipleOutputs

// Arrival Delay } else {

// Delay count sum int sum = 0;

for (IntWritable value : values) { sum += value.get(); }

// Output value setting result.set(sum);

// Output Data Creation

mos.write("arrival", outputKey, result); }

}

@Override

public void cleanup(Context context) throws IOException, InterruptedException {

mos.close(); }

(26)

Big Data Science 26

Hive Programming

Hive

To provide a means of running MapReduce job through

a SQL-like scripting language, called HiveQL, that can

be applied towards summarization, querying, and

analysis of large volumes of data.

Important difference to SQL

Table-generating function

Lateral view

Useful URLs

Hive

https://cwiki.apache.org/confluence/display/Hive/Home

Language Reference

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

(27)

Hive Programming

Workflow of Hive

Create Table

Load Data into HDFS/Hive

Query Data: Use HiveQL to query data

Table-generating functions

User-defined operations via external programs (TRANSFORM)

(28)

HiveQL

DDL Operation

Creating Hive Tables

hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY

(ds STRING);

Browsing through Tables

hive> SHOW TABLES;

hive> SHOW TABLES '.*s';

hive> DESCRIBE invites;

Altering and Dropping Tables

hive> ALTER TABLE events RENAME TO 3koobecaf;

hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a

comment');

hive> ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING,

baz INT COMMENT 'baz replaces new_col2');

(29)

HiveQL

DML Operation

Loading data from flat files into Hive

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE

INTO TABLE pokes;

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE

INTO TABLE invites PARTITION (ds='2008-08-15');

hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;

hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE

a.key < 100;

hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM

events a;

hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM

profiles a;

hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM

invites a WHERE a.ds='2008-08-15';

SQL Operation

(30)

HiveQL

GROUP BY, JOIN, STREAMING

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*)

WHERE a.foo > 0 GROUP BY a.bar;

hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a

WHERE a.foo > 0 GROUP BY a.bar;

hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE

TABLE events SELECT t1.bar, t1.foo, t2.foo;

hive> FROM invites a INSERT OVERWRITE TABLE events SELECT

TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds >

'2008-08-09';

Table example for Apache Weblog data

CREATE TABLE apachelog (

host STRING, identity STRING, user STRING, time STRING,

request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^]*) ([^]*) ([^]*) (-|¥¥[^¥¥]*¥¥]) ([^ ¥"]*|¥"[^¥"]*¥") (-|[0-9]*) (-|[0-9]*)(?: ([^ ¥"]*|¥".*¥") ([^ ¥"]*|¥".*¥"))?"

)

(31)

Table Generating Functions

Functions generating multiple rows from one

It allows a single row to expand to multiple rows

Explode is one such example; it takes an array and

generate a row for each item in the array (split is a

function that splits a string into an array)

SELECT explode(split(line, “ “)) as word FROM a_file;

Transform is a table generating function that applies an

external program (just like streaming)

SELECT TRANSFORM(column, … ) USING ‘command’ as column-alias, … ;

Explode(Split(.)) equivalent by transform

SELECT TRANSFORM(line) USING ‘./ws.py: from a_file;

“ws.py”

Import sys

References

Related documents

Experimental philosophy papers have posited some sort of dual-process, intuitional and rational, explanation for the results of Knobe (2003) (Cushman &amp; Mele, 2008; Nichols

The use of social media is not only important but also strategic means of political communication in the Regional Leader Election (Pilkada) of DKI Jakarta and

The most capable of agentless products not only use mechanisms such as these to collect and aggregate data from links in the chain of service delivery, but also correlate the

Further, when a cash value policy is bought by a transferee, upon a later sale to a third party, any gain up to the policy’s cash surrender value will be taxed as ordinary income..

In order to build upon the successes of the prior year, the revised curriculum for spring 2013 assessed ACRL’s Standards for Proficiencies for Instruction Librarians and

The study was significant to university curriculum developers, instruction department leaders, and high school teachers in rural settings, including the local educational setting

Claim activity in excess of $1 million dollars shows that catastrophic claims continue to increase in frequen- cy and severity due to our health care system’s high $25,000 to