• No results found

Big Data, Small Testing?

N/A
N/A
Protected

Academic year: 2021

Share "Big Data, Small Testing?"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data

,

Small Testing

?

Jayant Haritsa Database Systems Lab Indian Institute of Science

(2)

NYT Op-ed Article

[April 2014]

Eight (No, Nine!) Problems With Big Data

• Gary Marcus, Ernest Davis (NYU faculty)

big data is prone to giving scientific-sounding

solutions to hopelessly imprecise questions”

Who’s Bigger? Where Historical Figures Really Rank (Book by MIT/Google: Hitler ranks higher than Aristotle!)

We need to ensure that Big Data does not wind up

(3)

Research Landscape

Current Focus:

Architecting the “plumbing”

infrastructure for Big Data environments

• programming models, stream processing and summarization, sketching and approximation algorithms, storage architectures, cloud hosting, analytics, security …

These techniques are unlikely to work in practice

The elephant in the room is the lack of

(4)

Quotes

50% of our cost is on testing (QA)

(Bill Gates @ Opening of Gates Building)

Testing alone takes up six months of the

18 month product release cycle

(SAP Executive)

Estimated damage of 60 billion dollars

per year in USA caused by software bugs

(5)
(6)

1. UK Immigration [2013]

A Home Office text message campaign accusing

people of being illegal immigrants has received numerous complaints after several people were contacted in error. Officials have sent messages to almost 40,000 people they suspect of not having a right to be in the UK, instructing them to contact border officials to discuss their immigration status.

Government commissioned Capita, the outsourcing company, to trace people believed to have outstayed their visas.

(7)

UK Immigration (contd)

 In a few months, Capita was accused of mishandling

cases and getting just as mixed up as the bureaucrats it was supposed to be replacing!

 In November, Capita admitted a backlog of 150,000

notifications to foreign students it hadn't been able to process and therefore determine if they should or shouldn't still be in the country.

In IT terms, it's been at the center of a billion dollar botched "e-borders" system, which has been missing deadlines and delivery dates since the middle of the last decade and which may not even be legal under European Union legislation!

(8)

2. Obama HealthCare.gov [2013]

 Severe problems were caused by unexpected high volume

when the site drew 250,000 simultaneous users instead of the 50,000-60,000 expected. More than 8 million people visited the site from October 1 to 4. White House officials subsequently conceded that it was not just an issue of volume,

but involved software and systems design issues. Also, stress tests done by the contractors one day before the launch date revealed that the site became too slow with only 1,100 simultaneous users !

 HealthCare.gov problems persisted even weeks after the

launch. For example, a networking error at the related data services hub killed the website's functionality. This occurred the exact day after Health & Human Services head Kathleen Sebelius had highlighted designing that data hub as a government success.

(9)

3. Flipkart → Flopkart

[Oct 6, 2014]

Deccan Herald: Big Apology Day follows Flipkart's Big Billion Day

– After its Big Billion Day on Monday, which fetched Flipkart.com $100

million by way of sales and the ire of hordes of angry customers who complained of technical glitches and false promises on discounts, the Bangalore-based online giant was quick to apologise for its drawbacks on Tuesday.

– “Though we saw unprecedented interest in our products and traffic like

never before, we also realised that we were not adequately prepared for the sheer scale of the event. We didn't source enough products and deals in advance to cater to your requirements. To add to this, the load on our server led to intermittent outages, further impacting your shopping experience on our site,” the Bansals said.

– Noting that it took enormous effort from everyone at Flipkart, many

months of preparation and pushing its “capabilities and systems to the limit” for the big day, the Bansals said that they were looking at deals and offers painstakingly put together for months.

(10)

Flipkart → Flopkart

[Oct 6, 2014]

 Price Changes

– Even as Flipkart prepares various deals and promotional pricing in the

lead-up to the sale, the pricing of several products gets changed to non-discounted rates for a few hours.

 Out of stock

– The website ran out of stock for many products within a few minutes

(and in some cases, seconds) of the sale going live. Most special deals were sold out as soon as they went live.

 Cancellations

– A large number of people bought specific products simultaneously. This

led to some instances of orders getting overbooked for a product sold out just a few seconds ago.

 Website Issues

– Nearly 5000 servers were deployed and had prepared for 20 times the

traffic growth. But the volume of traffic at different times of the day was much higher than this.

(11)
(12)

Software Mindset

Everybody

loves

writing code

Everybody

hates

testing it

emphasis on developing new models

than on evaluating current setups

solution: automate the testing

(13)

Basic Question

How do you know the output delivered

for the user objective is correct?

Checking is hard because of the

magnitude of data involved and the

complexity of the queries

(14)

Types of Errors

English-to-SQL translation errors

“Public demands change”

 Public is demanding change in society  Public demands are changing over time  Public is demanding loose change (coins)

Big problem (only about 40% are correct !)

Further, more than 80% are written correctly

only after two to four attempts!

(15)

Types of Errors (contd)

Syntactic errors

– easy to check with automatic parser generators

Semantic errors

– Schema/type errors (easy to check from catalogs)

– Arithmetic errors (easy to check at runtime)

– Optimizer rewriting errors

 e.g. infamous Count Bug [1986]

– Operator implementation errors

– Index maintenance errors

– Transaction management errors

 e.g. ARIES checkpoint error

(16)

Library Approach

SQL test libraries designed by the

engine developers or application

specialists

Run regression tests on this workload

(17)
(18)

Test Environment

Underlying infrastructure is a hybrid of

ETL/IR/KM/DB

components

• e.g. IBM Infosphere (DataStage, QualityStage, MDM,

DB2, Big Insights, Metadata repository, …)

Need to test

• “functionality” (programs/data)

• “compilation” (query/model planning)

(19)

Sample Scenario

Wish to test

“yottabyte” (10

24

byte)

scale

Big Data environment for InfoSphere

Metrics: Functionality, Correctness,

Performance, Scalability

Impractical

(time)

or infeasible

(space)

to

explicitly create and process test data

(20)

Pie-in-the-sky

A complete testing environment for Big

Data management systems, wherein the

entire data and meta-data is virtual or

transient

, supporting efficient evaluation

(21)
(22)

Our Approach

Build

metadata construction tools

that

“fool”

the underlying information systems

into thinking that the data is actually

present even though it had never been

created or stored

Developed tool called

CODD

(Constructing

Dataless Databases)

for this purpose

• Edgar Codd, IBM, father of RDBMS / Turing awardee

(23)

CODD Metadata Processor

Easy-to-use graphical tool for the automated

creation, verification, retention, scaling and

porting

of database meta-data configurations

Entirely written in Java

(~50K LOC)

and

operational on industrial-strength db engines

(DB2, Oracle, SQL Server, SQL-MX)

Released as

free software

after receiving

copyright from the Indian government

In use at several industrial and academic

research labs

(24)

Metadata Construction

33 • Users can directly input statistics on:

• Relational Tables (row cardinality, row length, disk blocks)

• Attribute Columns (column width, number of distinct values, value distribution histograms)

• Attribute Indexes (number of leaf blocks, clustering factor)

(25)

(26)

Metadata Validation

35

Need to ensure that the input information is

Legal (valid type and range)

Consistent (compatible with other metadata values)

Validation Approach

– Construct a directed acyclic constraint graph CG(V,E)

– V is the set of individual metadata entities while E is the set of statistical value dependencies

Super Nodes: used to represent collapsed chain of nodes for compactness

– Run topological sort on CG to obtain CGlinear

(27)

Constraint Graph

[DB2]

Legality Constraint

Statistical Dependency: Direction chosen as per

abstraction hierarchy

Super Nodes Dashed edges represent

missing constraints

(28)

Unique features of CODD

Supports creation of arbitrary

“what-if”

scenarios

Carries out

automatic

validation

of user input

Supports both

space-based

scaling and

time-based

scaling

Provides

graphical histogram

operations

Supports

inter-engine

metadata transfer

Successfully simulated

yottabyte

environment on

a laptop

• Demonstrated deep bug in a popular commercial DBMS that only surfaces at Big Data scale

(29)

Take Away

Research on Automating Big Data

Testing is great technical fun with

immediate practical relevance ...

References

Related documents

Hertel and Martin (2008), provide a simplified interpretation of the technical modalities. The model here follows those authors in modeling SSM. To briefly outline, if a

In conclusion, for the studied Taiwanese population of diabetic patients undergoing hemodialysis, increased mortality rates are associated with higher average FPG levels at 1 and

This study aims to determine the spider fauna from the ground and understory (herbs, shrubs and small trees) of the TMCF in El Triunfo Biosphere Reserve (REBITRI for its

Therefore, various laboratory equipment used in learning media with the help of ICT can be developed simulation application.. Particularly in the field of

International Rectifier has MOSFET Spice models on www.irf.com that can be used for pre- prototype circuit validation for a multitude of power application topologies.

Data collection was designed to meet the busy schedule of any teacher that expe- riences the day to day rigor of being a special education teacher. With the amount of

In this study I exploit a dataset of loss given default realizations to estimate a prediction model based on financial accounting information available to lenders at the