• No results found

Big Data & the LAMP Stack: How to Boost Performance

N/A
N/A
Protected

Academic year: 2021

Share "Big Data & the LAMP Stack: How to Boost Performance"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data & the LAMP Stack:

How to Boost Performance

Jeff Kibler, Infobright Community Manager

(2)

Agenda

The ‘Machine-Generated

Data’ Problem

Benchmark: Methodology

& Results

The Infobright Approach

Zend Technology

(3)
(4)

The Machine-Generated Data Problem

Rate of Growth

Machine-generated/hybrid

data

Weblogs

Computer, network events

CDRs

Financial trades

Sensors, RFID etc

Online game data

Humangenerated data

-input from most

conventional transactions

Purchase/sale

Inventory

Manufacturing

Employment status

change

“Machine-generated data is the future of data management.”

(5)

Current Database Technology: Hitting the Wall

Today’s database technology requires huge effort and massive hardware

7%

21%

30%

33%

54%

66%

4%

32%

44%

60%

70%

75%

0%

20%

40%

60%

80%

100%

Don't Know / Unsure

Upgrade networking infrastructure

Archive older data on other systems

Upgrade/expand storage systems

Upgrade server hardware/processors

Tune or upgrade existing databases

High Growth

Low Growth

Source: KEEPING UP WITH EVER-EXPANDING ENTERPRISE DATA

By Joseph McKendrick, Research Analyst; Unisphere Research October 2010

How Analytic Performance Issues are Typically Addressed –

by Pace of Data Growth

(6)

What if you could achieve much

higher performance analytics with no

database tuning and much less

(7)
(8)

Benchmark Objective

Evaluate Analytic Performance of

Infobright/Zend versus MySQL/PHP

Software Evaluated

1.

Infobright Enterprise

Edition 3.5.2 + Zend

Server 5.0.4

2.

MySQL 5.0.77 + PHP 5

Parameters Evaluated

Loading Time

Compression

Monitoring Ease

(9)

Configurations

Hardware

CentOS5 64bit w/ 16GB Ram and 16 cores

MySQL

5.0.77 –

Indexed with Star Schema

PHP

5

Infobright EE

3.5.2 –

Single, flat table; no indexing

(10)

National Climatic Data Center (NCDC)

Sensor Data – Global Daily Weather Reports – 1929 to 2011

Publicly available -

ftp://ftp.ncdc.noaa.gov/pub/data/gsod/

World’s Largest Archive of Climate Data

Machine-generated Data Used

(11)

Data Load Time and Compression

Bulk Loaders: LOAD DATA INFILE

Load Time

Raw Size

Compression

DB Size

Infobright

35 Min

20.71 GB

20.5 : 1

1.01 GB

MySQL

100 Min

12.37 GB

0.92 : 1

13.47 GB

(12)

Monitoring

During Test, Extensive

Use of:

Query Execution Warnings

- Severe Slow Request

Execution

Error/Server/Access Logs

- Fatal PHP Errors

(Development)

Code Tracing

I found ZendServer to be extremely easy to use and to configure. Its

functionality allows for easy debugging and monitoring of some tough queries

we threw at the db. Plus, Zend Framework helped me get the code up and

running very quickly.”

(13)

Baseline: Nine Non-Analytic, Simplistic Queries

Queries - Baseline

Baseline Queries

MyISAM (MySQL) Brighthouse (Infobright)

1 SELECT COUNT(*) FROM reports; SELECT COUNT(*) FROM reports;

2 SELECT station, COUNT(*) FROM reports GROUP BY station; SELECT Station, COUNT(*) FROM reports GROUP BY station;

3 SELECT COUNT(DISTINCT station) FROM stations; SELECT COUNT(DISTINCT station) FROM reports;

4 SELECT COUNT(DISTINCT fips_id) FROM stations; SELECT COUNT(DISTINCT fips_id) FROM reports;

5 SELECT COUNT(DISTINCT country_name) FROM countries; SELECT COUNT(DISTINCT country_name) FROM reports;

6 SELECT DISTINCT full_date FROM reports; SELECT DISTINCT full_date FROM reports;

7 SELECT DISTINCT full_date, COUNT(*) FROM reports GROUP BY full_date ORDER BY full_date DESC;

SELECT DISTINCT full_date, COUNT(*) FROM reports GROUP BY full_date ORDER BY full_date DESC

8 SELECT MAX(max_temp) FROM reports; SELECT MAX(max_temp) FROM reports;

(14)

Simple #1: Reports a descending list of snow depths from all stations who reported a snow depth amount

on a given date.

Medium #1: Reports a list of station names (and the station's country) with the average precipitation and

total precipitation calculated from all days that fall within a given range of dates as well as having

reported snow on the same day. Only stations located above the 5000ft elevation mark are considered

in this query.

Medium #2: Reports the maximum max temperature, maximum precipitation, maximum wind speed,

maximum wind gust, maximum standard deviation of the maximum temperature, and the total number of

tornadoes reported for the state of Illinois for a given date range.

Complex #1:Reports the country or countries with the maximum standard deviation in temperature for

counties between the -20000 and 20000 latitude between a given range of dates.

Complex #2: Reports the station, country name, precipitation total, and a description of the precipitation

report for all reports between the given date where the country name begins with “United'.

Complex #3: Reports the station, country name, precipitation total, and a description of the precipitation

report for all reports between the given date where the country name contains “island”. Similar to Q5,

except the pattern matching on the country name is done on the end of the string versus the front.

(15)

Query Performance

Base #1 Base #2 Base #3 Base #4 Base #5 Base #6 Base #7 Base #8 Base #9 Simple #1 Med. #1 Med. #2 Comp. #1 Comp. #2 Comp. #3 Infobright/Zend 0.002 17.33 9.622 22.99 64.70 14.97 18.26 0.026 0.662 0.050 0.120 11.11 0.487 5.597 4.838 MySQL/PHP 0.0001 41.145 0.0294 0.0148 0.0004 139.3 106.58 53.697 53.281 53.573 53.296 88.598 0 2678.9 87.582 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 T im e (seco n d s)

IB/Zend and MySQL/PHP Performance

Baseline 3-5: MySQL/PHP pulling from dimension table; Infobright/Zend pulling from fact. Baseline speeds can be drastically improved with addition of simple

(16)

Expedite Development Capabilities

Monitor LAMP Performance

20X Stronger Data Compression*

280% Faster Data Loads*

Superior Analytic Query Performance

*

Results from this benchmark

Conclusions

(17)
(18)

What Infobright Delivers

High performance with much less work and lower cost

Faster queries

without the

work

• No indexes

• No projections

or cubes

• No data

partitioning

• Faster ad-hoc

analytics

Fast load / High

compression

• Multi-machine

Distributed

Load Processor

• 10:1 to 40:1+

compression

Low cost

• Less storage

and servers

• Low-cost

subscriptions

• 90% less

administration

Fast time to

production

• Download in

minutes

• Minimal

configuration

• Implement in

days

(19)

Column vs. Row Orientation - Use Cases

Row Oriented works if…

All the columns are needed

Transactional processing is required

Column Oriented works if…

Only relevant columns are needed

Reports are aggregates (sum, count, average, etc.)

Benefits

Very efficient compression

Faster results for analytical queries

Reading column takes similar CPU resources as

reading a row

Column-Based Storage

ID Job Dept City

# # # # # #

1 Shipping Operations Toronto 2 Receiving Operations Toronto 3 Accounting Finance Boston

1 Shipping Operations Toronto

2 Receiving Operations Toronto 3 Accounting Finance Boston

(20)

Data Packs and Compression

64K

64K

64K

64K

Data Packs

Each data pack contains 65,536 data values

Compression is applied to each individual data pack

The compression algorithm varies depending on data

type and distribution

Compression

Results vary depending on the

distribution of data among data packs

A typical overall compression ratio

seen in the field is 10:1

Some customers have seen results of

40:1 and higher

For example, 1TB of raw data

compressed 10 to 1 would only require

100GB of disk capacity

Patent-Pending

Compression

(21)

Intelligence Not Hardware

• Stores it in the Knowledge Grid (KG)

• KG is loaded into memory

• Less than 1% of compressed data size

Creates information

(metadata) about the

data upon load,

automatically

• The less data that needs to be accessed, the

faster the response

• Sub-second responses when answered by the KG

Uses the metadata when

processing a query to

eliminate / reduce need

to access data

• No need to partition data, create/maintain

indexes, projections or tune for performance

• Ad-hoc queries are as fast as static queries, so

users have total flexibility

(22)

The Knowledge Grid

Knowledge Grid

applies to the whole table

Column A

Column B

DP1 DP2 DP3 DP4 DP5 DP6

Information about the data

Knowledge Nodes

built for each Data Pack

Dynamic knowledge

Global knowledge

String and character data

Numeric data

Distributions

Built during

LOAD

Built per query

E.g. for

aggregates, joins

DP1

Column A

Knowledge Nodes answer the query directly, or

Identify only required Data Packs, minimizing decompression, and

(23)

Granular Engine

Q: How are my

sales doing this

year?

Query

Knowledge Grid

Results

Compressed Data

1%

1. Query received

2. Engine iterates on Knowledge Grid

3. Each pass eliminates Data Packs

(24)

1 Month Report

(15MM events)

43 min with SQL

Server

23 seconds

Alternative

Oracle Query Set

10 seconds – 15

minutes

0.43 – 22

seconds

Alternative

BI Report

7 hours in

Informix

17 seconds

Alternative

Data Load

11 hours in

MySQL ISAM

11 minutes

Alternative

Fast query response with no tuning or indexes

Infobright Customer Performance Statistics

Analytic Queries

2+ hours with

MySQL

<10 seconds

Alternative

(25)

Infobright and MySQL

Complementary Technologies

Provides a simple

scalability path for MySQL

users and OEMs

No new management

interface to learn

MySQL integration

enables seamless

connectivity to BI tools and

MySQL drivers for ODBC,

JDBC, C/C++, .NET, Perl,

Python, PHP, Ruby, Tcl, etc.

Infobright is architected on MySQL,

“the world’s most

(26)

Analytic Use Cases

Marketing/

Advertising

Mobile analytics

Behavioral analysis

Web analytics

Telecom:

Network

performance /

management

CDR analytics

Capital markets

Stock analytics

Enterprise network

and systems

analysis

Security (SIEM)

High performance

queries/reporting

within ISV / SaaS

application

Online

Analytics

Vertical

Log / Events

OEM/

(27)
(28)

The Zend Solution

OPERATING SYSTEM

Zend

Framework

Integration

Zend Server

Integration

Eclipse-based

CLOUD SERVICES LAYER

Cloud Services

SimpleCloud

Web Services

FRAMEWORK SERVICES LAYER

MVC

Security

Mobile

RIA

Auth

Data Access

RUNTIME SERVICES LAYER

Session

Clustering

Application

Monitoring

Cluster

Management

Configuration

Management

Enterprise

PHP

Application

Deployment

(29)
(30)

Zend Technologies

Zend.com

Download Zend Server and Zend Studio

Watch recorded webinars

Get training

Infobright

infobright.org

Infobright.org

Download ICE (Infobright Community Edition)

Join the forums and learn from the experts!

Infobright.com

Download a free trial of Infobright Enterprise

Edition, IEE

Get Started

Questions

• Jeff.kibler@infobright.com

• Kevin@zend.com

Social Media

twitter.com/zend

twitter.com/infobright

www.youtube.com/infobrightdb

www.facebook.com/ZendTechnologies

References

Related documents

Alternatively, CLNs may be issued by an SPV that holds collateral securities (usually government securities, repurchase agreements on government securities, or high quality

The recently published ISO/IEC 29110 standard Lifecycle profiles for Very Small Entities has at its core a Management and Engineering Guide [1] which are

In the following we first take a closer look at the cost models using the experience curve effect for both single development and SPLE, and then extend these cost models to provide

a) SELECT * FROM Persons SORT BY ‘FirstName’ DESC b) SELECT * FROM Persons ORDER FirstName DESC c) SELECT * FROM Persons SORT ‘FirstName’ DESC d) SELECT * FROM Persons ORDER

116 Potomac Valley Audubon Society Effort. Year Count

It is evident from the above findings that hair during pregnancy remain in anagen phase where hair follicles have profuse blood supply (abundant supply of required

Prime Minister Gordon Brown, Launching the social enterprise action plan (as Chancellor of. the Exchequer)

Note: if you would like to count the number of unique values in a column, as opposed to the number of times a value appears in the data, be sure to select Distinct Count as