An analysis of Big Data ecosystem from an HCI
perspective.
Jay Sanghvi Rensselaer Polytechnic Institute For: Theory and Research in Technical Communication and HCI Rensselaer Polytechnic Institute Wednesday, December 5th 2012Abstract
The potential benefits of Big Data are practical and significant, and some initial benefits have already been achieved, however, there are still many technical and peoplerelated challenges that must be addressed to fully exploit its potential. The size of the data is a major challenge, and this can be sensed easily. But, there are others. There are challenges not just in size of data, but also in heterogeneity in data type, its representation and semantic interpretation, and the rate at which the data needs to be processed. While these aspects are important, additional important aspect are privacy and usability. This paper presents these challenges from an HCI perspective.
Table of Content 1. What is Big Data 2. Applications and Benefits 3. Data Analysis Pipeline 4. Challenges 4.1 Fundamental challenges Volume Velocity Variety 4.2 Technology related challenges Technology usability Application acumen Provenance Annotations Cloud Visualization 4.3 People related challenges Data ownership Ethics Privacy 5. Conclusion
1. What is Big Data
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
Big data is data that exceeds the processing capacity of conventional information systems. The data is too big, moves too fast, or doesn’t fit the strictures of conventional information architectures. To gain value from this data, we must choose an alternative way to process it.
2. Applications and Benefits
Scientific research has been revolutionized by Big Data. The field of Astronomy is being transformed from one where taking pictures of the sky was a large part of an astronomer’s job to one where the pictures are all in a database already and the astronomer’s task is to find interesting objects and phenomena in the database.
Big Data has the potential to revolutionize not just research, but also education. A recent detailed quantitative comparison of different approaches taken by 35 charter schools in NYC has found that one of the top five policies correlated with measurable academic effectiveness was the use of data to guide instruction.
It is widely believed that the use of information technology can reduce the cost of healthcare while improving its quality, by making care more preventive and personalized and basing it on more extensive (homebased) continuous monitoring. McKinsey estimates a savings of 300 billion dollars every year in the US alone.
Similarly, there are strong cases made for the value of Big Data for urban planning, intelligent transportation, environmental modeling, energy saving, smart materials, computational social sciences, financial systemic risk analysis, homeland security, computer security. and so on.
In 2010, enterprises and users stored more than 13 exabytes of new data; this is over 50,000 times the data in the Library of Congress. The potential value of global personal location data is estimated to be $700 billion to end users, and it can result in an up to 50% decrease in product development and assembly costs, according to a recent McKinsey report. McKinsey predicts an equally great effect of Big Data in employment, where 140,000190,000 workers with “deep analytical” experience will be needed in the US; furthermore, 1.5 million managers will need to become dataliterate. Not surprisingly, the recent PCAST report on Networking and IT R&D identified Big Data as a “research frontier” that can “accelerate progress across a broad range of
priorities.” Even popular news media now appreciates the value of Big Data as evidenced by coverage in the Economist, the New York Times, and National Public Radio.
3. Data Analysis Pipeline
3.1 Data Acquisition and Recording
Data is recorded from some data generating source such as human interaction, human behaviour, business transactions, nature, scientific experiments and simulations that can easily and continuously produce petabytes of data today.
3.2 Information Extraction and Cleaning
Almost all the time, the information collected is not in a format ready for analysis. For example, the pictures that capture the deep outer space. We cannot leave the data in image form and still effectively analyze it. Rather we require an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. Doing this correctly and completely is a continuing technical challenge. Note that this data also includes images and will in the future include video; such extraction is often highly application dependent.
Many a times the instruments used for data capture are biased under certain conditions (example: the pictures of the deep space taken from a space telescope when it was in a radiation field of a meteor or another star may have been affected in a particular way) which make it imperative to clean the data.
3.3 Data Integration, Aggregation, and Representation
Given the heterogeneity of big data (in terms of what they represent, format, granularity, semantic interpretation and intent) it is not enough merely to record it and store it into a repository. We need to make sure that it is discoverable and make efforts to use it by the larger community. Adequate annotations does help, but integration and aggregation remain challenging due to differences in experimental details and in record structure of two or more data sets.
Data analysis is significantly more challenging than just locating, identifying, understanding, and citing data. For effective largescale analysis all these steps needs to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then “robotically” resolvable i.e. errorfree datastructureindependent difference resolution method.
Analysis is not simple even when there is only one data set involved. There are many alternative ways to store the same information. Certain database designs has advantages over others for certain purposes, and possibly drawbacks for other purposes. Database design expertise is limited to a few qualified professionals. There exists no tools or frameworks that enable other professionals, such as domain scientists, to create effective database designs.
3.4 Query Processing, Data Modeling, and Analysis
This phase involves retrieving target data from heterogeneous interrelated redundant data sources, mining big data, cross checking conflicting cases, validating trustworthy relationships, identifying inherent clusters and uncovering hidden relationships and models.
Big Data enables the next generation of interactive data analysis with realtime answers. Scaling complex query processing techniques to terabytes while enabling interactive response times is a major open research problem today.
3.5 Interpretation
Analyzing Big Data is of limited value if users cannot understand the result. An expert decisionmaker, provided with the result of analysis, has to interpret these results. This interpretation involves examining all the assumptions made and retracing the analysis. Also, there are many possible sources of error: bugs in computer systems, assumptions made by the data models and results can be based on erroneous data. For all of these reasons, no responsible user will cede authority to the computer system. The recent mortgagerelated shock to the financial system dramatically underscored the need for such decisionmaker diligence rather than accept the stated solvency of a financial institution at face value, a decisionmaker has to examine
critically the many assumptions at multiple stages of analysis.
Hence, it is not enough to provide just the results. Rather, one must provide supplementary information that explains how each result was derived, and based upon precisely what inputs .i.e. the provenance of the (result) data.
Systems with rich variety of visualizations are important in conveying the results of the queries in a way that is best understood by a particular set of people. Results needs to be presented using powerful visualizations that assist interpretation, and support user collaboration.
4. Challenges
Almost all the challenges for Big Data development and adoption are due to its three fundamental dimensions: Volume, Velocity and Variety.
4.1 Fundamental challenges
Volume: Enterprises are awash with evergrowing data of all types, easily amassing terabytes—even petabytes—of information per day.
● Turn 12 terabytes of Tweets created each day into improved product sentiment analysis ● Convert 350 billion annual meter readings to better predict power consumption
Velocity: Sometimes 2 minutes is too late. For timesensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. ● Scrutinize 5 million trade events created each day to identify potential fraud
● Analyze 500 million daily call detail records in realtime to predict customer churn faster
Variety: Big data is any type of data structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
● Monitor 100’s of live video feeds from surveillance cameras to target points of interest
● Exploit the 80% data growth in images, video and documents to improve customer satisfaction
4.2 Technology related challenges
Technology usability: Big data has made tremendous progress in terms of developing various technology and tools to make big data benefits accessible to even smallest of the organizations. As almost 100% of this development is open source and relatively young, there is a huge scope for consolidation and standardization of technologies and tools.
Apache Hadoop is one of the big data enabling open source projects and has been the driving force behind the growth of the big data reach. Programming Hadoop is a case of working with the Java APIs, many of which are known for their ‘horrific’ usability.
As project Apache Hadoop is relatively young and constantly evolving, their isn’t much focus on ease of learning, which makes the learning curve steep. This is one major hurdle for people willing to adopt these technologies. Infact, many promising startups have sprung up just to make these technologies simpler to understand and use.
Another hurdle is availability of alternative subtechnologies under Hadoop that are overlapping or mutually exclusive in terms of the features they offer for implementing a the functionality, so no one subtechnology is complete in itself and requires use of multiple technologies.
Application acumen: As big data is finding increasingly varied applications in more and more disciplines, the chances that an existing data set would be used for an unintended application are increasing. Also, it is difficult, if not downright impossible, to assess how a particular set of data that is collected today will be used even if the application is in same intended discipline. This inability results into notsohelpful or inadequate annotations, provenance and metadata. In fact, the definition of noise itself, depending on the application, may change.
Joining two or more data sets or joining data within the same data set requires a thorough understanding of the intent of various data manipulations. Also, the personnel making these decision needs to be proficient in understanding and manipulating the independent variables and understand how are they relate to and affect the dependent variables. The result analysis, modelling and result interpretations are all functions of his/ her proficiency with the tools and domain knowledge.
Provenance: Storing information about the data at its source is not useful unless this information can be interpreted and carried along the phases of data analysis pipeline. For example, an error at one step can make following analysis useless. Only with suitable provenance, one can easily identify all following processing that is dependent on this step. We need research into generating suitable provenance and into data systems that transmit the provenance through data analysis phases.
Annotations: Automatically generating the right metadata to describe what data is recorded and how it is measured is difficult. For example, in scientific experiments, considerable detail on the specific experimental conditions and procedures are required to be able to interpret the results effectively and it is important that the metadata be recorded with observational data. We need research into generating suitable metadata and into data systems that carry the metadata
through data analysis phases.
Cloud: Big data and cloud technology go handinhand. Big data needs clusters of servers for processing and huge storage space, which clouds can readily provide. Cloud services themselves are at an early stage, and we will see both increasing standardization and innovation over the next couple of years.
The cloud services as provided by the three major players today, Amazon, Google and Microsoft are different in many aspects and have different capabilities. A big data implementation on one of them may not be portable to another platform with the exact same capabilities. In other words, such large implementations are lockedin tied to a cloud service provider and there is a huge switching cost.
Visualization: A Picture Is Worth 10,000 Rows! The best data visualizations are ones that expose something new about the underlying patterns and relationships contained within the data. Understanding those relationships—and so being able to observe them—is key to good decisionmaking. The Periodic Table is a classic testament to the potential of visualization to reveal hidden relationships in even small data sets. Visualization are a new set of languages you can be used to communicate. As big data application matures, more complex visualization forms would be invented to understand the complex relationship between the various dimensions and equally trained brains would be required to decode and interpret them.
4.3 People related challenges
Data ownership: In this age when each one of us is constantly interacting with sensors, there is a confusion over who owns the data that is about you but has been recorded or sensed by someone elses sensor. Example: Someone meets with a road accident and is carried to hospital for treatment and in the process records host of information on your behaviour, body characteristics, the activity you were involved in just before the accident, etc. Does the owner of the sensors own the data or does the person own this data?
Ethics: With very limited people having access to big data technologies, is it justified only for a select few to reap the benefits of rich information that is derived from innocent looking datasets? Example: Is it justified when an insurance company with the help of big data technologies and weather data sets (generated using public money) calculates the chances of occurrence of drought or floods and correspondingly changes the insurance premiums and the fine prints in the offer document?
As more and more predictions are made using the sophisticated techniques and bigger and bigger decisions are based on these predictions, should the data miners/ scientists be held responsible for any losses arising due to any wrong predictions? Recently, in Italy seven scientist were jailed and asked to pay heavy fines for ‘false assurances’ before earthquake that killed 300
people.
Privacy: With the varied innovations that Big Data is enabling, there is a fine line between being innovative and breaching someones privacy. Is privacy breached only when there is a name attached to a set of disclosed attributes? Do we have to change our core value system to be able to fully benefit from big data?
5. Conclusion
“Like it or not, we live in interesting times.” Big data is powerful and disruptive. Like most other technologies, it is neutral. It is the applications that raises questions. There has been a considerable progress on the technology front to enable big data. On the other hand, we have just started to understand and resolve its implications on our lives and core values. The potential benefits of Big Data are practical and significant, and some initial benefits have already been achieved, there are still many technical and peoplerelated challenges that are needed be addressed to fully exploit its potential. References and Citations: http://oreillynet.com/pub/e/2180 http://www.forbes.com/sites/oreillymedia/2012/06/21/theethicsofbigdata/3/ http://weblogs.java.net/blog/timboudreau/archive/2009/07/api_design_vs_a.html http://www.guardian.co.uk/world/2012/oct/22/italianscientistsjailedearthquakeaquila http://www.purdue.edu/discoverypark/cyber/assets/pdfs/BigDataWhitePaper.pdf Book: Privacy and Big Data by Terence Craig and Mary E. Ludloff Book: Planning for Big Data by by O’Reilly Radar Team http://research.google.com/pubs/HumanComputerInteractionandVisualization.html http://www01.ibm.com/software/data/bigdata/ http://en.wikipedia.org/wiki/Big_data