• No results found

An analysis of Big Data ecosystem from an HCI perspective.

N/A
N/A
Protected

Academic year: 2021

Share "An analysis of Big Data ecosystem from an HCI perspective."

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

An analysis of Big Data ecosystem from an HCI

perspective.

Jay Sanghvi Rensselaer Polytechnic Institute For: Theory and Research in Technical Communication and HCI  Rensselaer Polytechnic Institute Wednesday, December 5th 2012

(2)

Abstract

The potential benefits of Big Data are practical and significant, and some initial benefits have        already been achieved, however, there are still many technical and people­related challenges that        must be addressed to fully exploit its potential. The size of the data is a major challenge, and this        can be sensed easily. But, there are others. There are challenges not just in size of data, but also in        heterogeneity in data type, its representation and semantic interpretation, and the rate at which the        data needs to be processed. While these aspects are important, additional important aspect are        privacy and usability. This paper presents these challenges from an HCI perspective.

(3)

Table of Content 1. What is Big Data 2. Applications and Benefits 3. Data Analysis Pipeline 4. Challenges 4.1 Fundamental challenges Volume Velocity Variety 4.2 Technology related challenges Technology usability Application acumen Provenance Annotations Cloud Visualization 4.3 People related challenges Data ownership Ethics Privacy 5. Conclusion

(4)

1. What is Big Data

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world        today has been created in the last two years alone. This data comes from everywhere: sensors used        to gather climate information, posts to social media sites, digital pictures and videos, purchase        transaction records, and cell phone GPS signals to name a few. This data is big data.

Big data is data that exceeds the processing capacity of conventional information systems. The        data is too big, moves too fast, or doesn’t fit the strictures of conventional information        architectures. To gain value from this data, we must choose an alternative way to process it.

2. Applications and Benefits

Scientific research has been revolutionized by Big Data. The field of Astronomy is being        transformed from one where taking pictures of the sky was a large part of an astronomer’s job to        one where the pictures are all in a database already and the astronomer’s task is to find interesting        objects and phenomena in the database.

Big Data has the potential to revolutionize not just research, but also education. A recent detailed        quantitative comparison of different approaches taken by 35 charter schools in NYC has found        that one of the top five policies correlated with measurable academic effectiveness was the use of        data to guide instruction.

It is widely believed that the use of information technology can reduce the cost of healthcare        while improving its quality, by making care more preventive and personalized and basing it on        more extensive (home­based) continuous monitoring. McKinsey estimates a savings of 300 billion        dollars every year in the US alone.

Similarly, there are strong cases made for the value of Big Data for urban planning, intelligent        transportation, environmental modeling, energy saving, smart materials, computational social        sciences, financial systemic risk analysis, homeland security, computer security. and so on.

In 2010, enterprises and users stored more than 13 exabytes of new data; this is over 50,000 times        the data in the Library of Congress. The potential value of global personal location data is        estimated to be $700 billion to end users, and it can result in an up to 50% decrease in product        development and assembly costs, according to a recent McKinsey report. McKinsey predicts an        equally great effect of Big Data in employment, where 140,000­190,000 workers with “deep        analytical” experience will be needed in the US; furthermore, 1.5 million managers will need to        become data­literate. Not surprisingly, the recent PCAST report on Networking and IT R&D        identified Big Data as a “research frontier” that can “accelerate progress across a broad range of       

(5)

priorities.” Even popular news media now appreciates the value of Big Data as evidenced by        coverage in the Economist, the New York Times, and National Public Radio.

3. Data Analysis Pipeline

3.1 Data Acquisition and Recording

Data is recorded from some data generating source such as human interaction, human behaviour,        business transactions, nature, scientific experiments and simulations that can easily and        continuously produce petabytes of data today.

3.2 Information Extraction and Cleaning

Almost all the time, the information collected is not in a format ready for analysis. For example,        the pictures that capture the deep outer space. We cannot leave the data in image form and still        effectively analyze it. Rather we require an information extraction process that pulls out the        required information from the underlying sources and expresses it in a structured form suitable        for analysis. Doing this correctly and completely is a continuing technical challenge. Note that        this data also includes images and will in the future include video; such extraction is often highly        application dependent.

Many a times the instruments used for data capture are biased under certain conditions (example:        the pictures of the deep space taken from a space telescope when it was in a radiation field of a        meteor or another star may have been affected in a particular way) which make it imperative to        clean the data.

(6)

3.3 Data Integration, Aggregation, and Representation

Given the heterogeneity of big data (in terms of what they represent, format, granularity,        semantic interpretation and intent) it is not enough merely to record it and store it into a        repository. We need to make sure that it is discoverable and make efforts to use it by the larger        community. Adequate annotations does help, but integration and aggregation remain challenging        due to differences in experimental details and in record structure of two or more data sets.

Data analysis is significantly more challenging than just locating, identifying, understanding, and        citing data. For effective large­scale analysis all these steps needs to happen in a completely        automated manner. This requires differences in data structure and semantics to be expressed in        forms that are computer understandable, and then “robotically” resolvable i.e. error­free        data­structure­independent difference resolution method.

Analysis is not simple even when there is only one data set involved. There are many alternative        ways to store the same information. Certain database designs has advantages over others for        certain purposes, and possibly drawbacks for other purposes. Database design expertise is limited        to a few qualified professionals. There exists no tools or frameworks that enable other        professionals, such as domain scientists, to create effective database designs.

3.4 Query Processing, Data Modeling, and Analysis

This phase involves retrieving target data from heterogeneous interrelated redundant data sources,        mining big data, cross checking conflicting cases, validating trustworthy relationships, identifying        inherent clusters and uncovering hidden relationships and models.

Big Data enables the next generation of interactive data analysis with real­time answers. Scaling        complex query processing techniques to terabytes while enabling interactive response times is a        major open research problem today.

3.5 Interpretation

Analyzing Big Data is of limited value if users cannot understand the result. An expert        decision­maker, provided with the result of analysis, has to interpret these results. This        interpretation involves examining all the assumptions made and retracing the analysis. Also, there        are many possible sources of error: bugs in computer systems, assumptions made by the data        models and results can be based on erroneous data. For all of these reasons, no responsible user        will cede authority to the computer system. The recent mortgage­related shock to the financial        system dramatically underscored the need for such decision­maker diligence ­­ rather than accept        the stated solvency of a financial institution at face value, a decision­maker has to examine       

(7)

critically the many assumptions at multiple stages of analysis.

Hence, it is not enough to provide just the results. Rather, one must provide supplementary        information that explains how each result was derived, and based upon precisely what inputs .i.e.        the provenance of the (result) data.

Systems with rich variety of visualizations are important in conveying the results of the queries in        a way that is best understood by a particular set of people. Results needs to be presented using        powerful visualizations that assist interpretation, and support user collaboration.

4. Challenges

Almost all the challenges for Big Data development and adoption are due to its three fundamental        dimensions: Volume, Velocity and Variety.

4.1 Fundamental challenges

Volume: Enterprises are awash with ever­growing data of all types, easily amassing        terabytes—even petabytes—of information per day.

● Turn 12 terabytes of Tweets created each day into improved product sentiment analysis ● Convert 350 billion annual meter readings to better predict power consumption

Velocity: Sometimes 2 minutes is too late. For time­sensitive processes such as catching        fraud, big data must be used as it streams into your enterprise in order to maximize its value. ● Scrutinize 5 million trade events created each day to identify potential fraud

● Analyze 500 million daily call detail records in real­time to predict customer churn faster

Variety: Big data is any type of data ­ structured and unstructured data such as text,        sensor data, audio, video, click streams, log files and more. New insights are found when        analyzing these data types together.

● Monitor 100’s of live video feeds from surveillance cameras to target points of interest

● Exploit the 80% data growth in images, video and documents to improve customer        satisfaction

4.2 Technology related challenges

Technology usability: Big data has made tremendous progress in terms of developing        various technology and tools to make big data benefits accessible to even smallest of the        organizations. As almost 100% of this development is open source and relatively young, there is a        huge scope for consolidation and standardization of technologies and tools.

(8)

Apache Hadoop is one of the big data enabling open source projects and has been the        driving force behind the growth of the big data reach. Programming Hadoop is a case of working        with the Java APIs, many of which are known for their ‘horrific’ usability.

As project Apache Hadoop is relatively young and constantly evolving, their isn’t much        focus on ease of learning, which makes the learning curve steep. This is one major hurdle for        people willing to adopt these technologies. Infact, many promising startups have sprung up just to        make these technologies simpler to understand and use.

Another hurdle is availability of alternative sub­technologies under Hadoop that are        overlapping or mutually exclusive in terms of the features they offer for implementing a the        functionality, so no one sub­technology is complete in itself and requires use of multiple        technologies.

Application acumen: As big data is finding increasingly varied applications in more and        more disciplines, the chances that an existing data set would be used for an un­intended        application are increasing. Also, it is difficult, if not downright impossible, to assess how a        particular set of data that is collected today will be used even if the application is in same        intended discipline. This inability results into not­so­helpful or inadequate annotations,        provenance and metadata. In fact, the definition of noise itself, depending on the application, may        change.

Joining two or more data sets or joining data within the same data set requires a thorough        understanding of the intent of various data manipulations. Also, the personnel making these        decision needs to be proficient in understanding and manipulating the independent variables and        understand how are they relate to and affect the dependent variables. The result analysis,        modelling and result interpretations are all functions of his/ her proficiency with the tools and        domain knowledge.

Provenance: Storing information about the data at its source is not useful unless this        information can be interpreted and carried along the phases of data analysis pipeline. For        example, an error at one step can make following analysis useless. Only with suitable provenance,        one can easily identify all following processing that is dependent on this step. We need research        into generating suitable provenance and into data systems that transmit the provenance through        data analysis phases.

Annotations: Automatically generating the right metadata to describe what data is        recorded and how it is measured is difficult. For example, in scientific experiments, considerable        detail on the specific experimental conditions and procedures are required to be able to interpret        the results effectively and it is important that the metadata be recorded with observational data.        We need research into generating suitable metadata and into data systems that carry the metadata       

(9)

through data analysis phases.

Cloud: Big data and cloud technology go hand­in­hand. Big data needs clusters of servers        for processing and huge storage space, which clouds can readily provide. Cloud services        themselves are at an early stage, and we will see both increasing standardization and innovation        over the next couple of years.

The cloud services as provided by the three major players today, Amazon, Google and        Microsoft are different in many aspects and have different capabilities. A big data implementation        on one of them may not be portable to another platform with the exact same capabilities. In other        words, such large implementations are locked­in tied to a cloud service provider and there is a        huge switching cost.

Visualization: A Picture Is Worth 10,000 Rows! The best data visualizations are ones          that expose something new about the underlying patterns and relationships contained within the        data. Understanding those relationships—and so being able to observe them—is key to good        decision­making. The Periodic Table is a classic testament to the potential of visualization to        reveal hidden relationships in even small data sets. Visualization are a new set of languages you        can be used to communicate. As big data application matures, more complex visualization forms        would be invented to understand the complex relationship between the various dimensions and        equally trained brains would be required to decode and interpret them.

4.3 People related challenges

Data ownership: In this age when each one of us is constantly interacting with sensors,        there is a confusion over who owns the data that is about you but has been recorded or sensed by        someone elses sensor. Example: Someone meets with a road accident and is carried to hospital for        treatment and in the process records host of information on your behaviour, body characteristics,        the activity you were involved in just before the accident, etc. Does the owner of the sensors own        the data or does the person own this data?

Ethics: With very limited people having access to big data technologies, is it justified        only for a select few to reap the benefits of rich information that is derived from innocent looking        datasets? Example: Is it justified when an insurance company with the help of big data        technologies and weather data sets (generated using public money) calculates the chances of        occurrence of drought or floods and correspondingly changes the insurance premiums and the fine        prints in the offer document?

As more and more predictions are made using the sophisticated techniques and bigger and        bigger decisions are based on these predictions, should the data miners/ scientists be held        responsible for any losses arising due to any wrong predictions? Recently, in Italy seven scientist        were jailed and asked to pay heavy fines for ‘false assurances’ before earthquake that killed 300       

(10)

people.

Privacy: With the varied innovations that Big Data is enabling, there is a fine line        between being innovative and breaching someones privacy. Is privacy breached only when there        is a name attached to a set of disclosed attributes? Do we have to change our core value system to        be able to fully benefit from big data?

5. Conclusion

“Like it or not, we live in interesting times.” Big data is powerful and disruptive. Like most other        technologies, it is neutral. It is the applications that raises questions. There has been a        considerable progress on the technology front to enable big data. On the other hand, we have just        started to understand and resolve its implications on our lives and core values. The potential        benefits of Big Data are practical and significant, and some initial benefits have already been        achieved, there are still many technical and people­related challenges that are needed be addressed        to fully exploit its potential. References and Citations: http://oreillynet.com/pub/e/2180 http://www.forbes.com/sites/oreillymedia/2012/06/21/the­ethics­of­big­data/3/ http://weblogs.java.net/blog/timboudreau/archive/2009/07/api_design_vs_a.html http://www.guardian.co.uk/world/2012/oct/22/italian­scientists­jailed­earthquake­aquila http://www.purdue.edu/discoverypark/cyber/assets/pdfs/BigDataWhitePaper.pdf Book: Privacy and Big Data by Terence Craig and Mary E. Ludloff Book: Planning for Big Data by by O’Reilly Radar Team http://research.google.com/pubs/Human­ComputerInteractionandVisualization.html http://www­01.ibm.com/software/data/bigdata/ http://en.wikipedia.org/wiki/Big_data

References

Related documents

The theoretical concerns that should be addressed so that the proposed inter-mated breeding program can be effectively used are as follows: (1) the minimum sam- ple size that

Simultaneous Analysis of Child Labour and Child Schooling: Comparative Evidence from Nepal and Pakistan, ASARC Working Papers 2001-10, The Australian National University,

Use feedback from program administrators and other stakeholders to identify lessons learned and best practices in retrofit program delivery practices, focusing on audits, site

intensive care unit. Prevalence and factors of intensive care unit conflicts: the conflicus study. Am J Respir Crit Care Med. Conflicts in the ICU: perspectives of

In particular, broadband wireless technology is very suitable for the delivery of content-enriched communication services (e.g. active content sharing, synchronous or

University Example : How do different funding structures affect the composition of Purdue University’s research workforce, as identified by the University’s STAR METRICS Level I

The bank also continues to make strong gains in lending, with $2.6 billion in net loans in the first quarter, up 9% from $2.3 billion in first quarter 2014, and in deposits with