Shining Light on
Dark Data
November, 2013
White Paper
Sponsored by Hitachi Data SystemsShining Light on Dark Data
Benjamin S. Woo
November, 2013
Overview
Dark Data is often misunderstood. It is considered wasteful, costly and valueless. However, in reality, it is often quite the opposite. Dark Data represents untapped value and opportunity. In order to extract the value from Dark Data, we need to be able to understand better what is captured in Dark Data, the type of data it is, and categorize it.
In this White Paper, Neuralytix looks at what Dark Data is, and how enterprises can generate value and create competitive advantage from Dark Data with the assistance of HDS’s Hitachi Data Discovery Suite (HDDS) and Hitachi Content Platform (HCP)
HDDS is storage and data platform independent – it has the ability to index nearly 500 different data types, in over 50 different languages. This capability exceeds any that the human mind can comprehend or process. HDDS also enables the ability to search for data that may have been buried in archives. It can also enable discovery, information about how data relates to other data, to generate new awareness.
The Hitachi Content Platform (HCP) stores, shares, synchronizes, protects, preserves, analyzes and retrieves file data from a single system, and extend the capabilities of traditional file storage solutions. It has a policy engine that allows the automation of day-‐to-‐day IT
operations such as data protection and readily evolves to changes in scale, scope, applications, and storage and server technologies over the life of data.
With the combination of the output that HDDS provides along with HCP, industry knowledge and business experience, enterprises of all sizes should expect significant return on the investment in a data discovery solution.
Table of Contents
Overview ... 2
Table of Contents ... 3
Current Overview ... 4
What is Dark Data ... 5
How does Dark Data get generated ... 5
Dark Data comes in many forms ... 5
Search & Discovery ... 6
Search ... 7
Discovery ... 7
Search & Discovery and Dark Data ... 8
Neuraspective™ ... 9
Generating Value from Dark Data ... 9
Data retention is costly, data reuse is valuable ... 9
Reusing data requires coordination and consolidation ... 10
Discover, Search and Profit ... 11
One Search for All Data ... 11
Minimizing Time to Insight ... 12
Conclusion ... 13
About the Author ... 14
About Neuralytix™ ... 14
Current Overview
Enterprises today collect tremendous amounts of data. That data is preserved for a number of standardized uses including, but not limited to, reporting, business intelligence, and regulatory compliance.
There are also other reasons for data preservation. These include sentimental and historical reasons. Whatever the purpose of preserving data is, enterprises soon realize that they have more data than their regular processes require or use.
The data that is not used on a regular basis is considered “Dark Data”. This data seems extraneous, and is often under-‐ or never utilized. This data is collected often as part of transaction, and most likely is innocuous or referential data.
The problem is that, particularly resulting from regulatory compliance, enterprises are being required to keep more and more data that would fall under the definition of Dark
Data. Preserving this data is expensive. Dark Data is no less costly to preserve than non-‐
dark data. Apart from the cost of keeping Dark Data, Dark Data can also be a long-‐term liability: for example, if the enterprise is sued, and potentially incriminating data is subsequently found within Dark Data, exposing the enterprise to tremendous losses.
Rather than taking the “ostrich” approach of ignoring the existence of Dark Data,
enterprises need to take a much more proactive attitude towards Dark Data. Dark Data can be very valuable. In keeping with the example above, by being proactive about discovering data that may cause a significant financial liability, the same enterprise could proactively mine its own Dark Data, internally exposing the potential liability, and proactively
remedying the situation with its customers.
Not only can such measures reduce punitive damages, they can also improve customer loyalty and build respect and confidence in the customer base towards the company. Outside of these cost-‐mediating benefits, through the use of Big Data technologies, Dark
Data can actually help enterprises to proactively generate revenue and create competitive
This White Paper looks at how Hitachi Data Systems addresses these challenges and helps to shed light (i.e. generating positive enterprise value) on Dark Data.
What is Dark Data
The IT industry has not formally defined Dark Data. For most, the concept of Dark Data represents data that is kept because an organization feels that the destruction of such data may prove detrimental in some way. In other words, we could define Dark Data as data that we keep but do not really know why we keep it!
How does Dark Data get generated
Dark Data is generated in an infinite number of ways. It could be data that is collected by
transactional systems using off-‐the-‐shelf software, but has no direct relationship to the type of business being run. It could be log files that are archived for many years, simply because that is how it has always been done. It could also be derivative data left over from other processes.
One potential example of Dark Data is raw survey data from many years ago, for products that an enterprise no longer manufactures or distributes.
One of the most common way that Dark Data gets created is a result of versioning: old versions of documents and other unstructured data that get kept for recovery or referential purposes.
Dark Data comes in many forms
As the examples above illustrate, Dark Data can be structured, semi-‐structured, or completely unstructured. In fact, it is mostly the semi-‐ and fully-‐unstructured data that present the biggest problems.
Unstructured data, as the name suggests, have no prescribed relationships or correlation to any other data. They are not supposed to; hence the concept of “unstructured”. However, the lack of associations and connections is what causes the Dark Data classification in the first place. These data are “orphaned.” Without specific linkages back to other data, their preservation or destruction can generate similar liabilities for an enterprise.
Examples of unstructured data include notes made by doctors on a patient’s electronic medical record (EMR). The notes are often captured in a text field; but it is very difficult to mine through this data, as it often does not have a prescribed relationship to any other field other than the patient’s identity. This same kind of example can extend to any task that involves some form of human intuition or note-‐taking (e.g. notes from a repairman).
Search & Discovery
Accepting the existence of Dark Data, the question begs to be asked, what can one do with
Dark Data? To address this question, two technologies have found great prominence in the
last 20 years: search and discovery.
Search and discovery are distinct and complementary technologies. With search, a user’s goal or expectation is to find an answer to a question. With discovery, the user is more likely to seek awareness, rather than specific answers.
A good example of the difference would be in legal litigation. One party may conduct legal discovery and suggest to the other party that through discovery, they found incriminating evidence against the second party. Unlike search, in which the outcome is precise, the outcome of the discovery could simply be that emails exist that may prove the second party liable to damages.
At this point, the second party may feel sufficiently threatened, and settle.
Alternatively, the second party may feel that the first party is bluffing, and that the outcome of their discovery (or the first party’s awareness of the existence of the emails) is
insufficient. In this situation, the first party would need to search the discovered emails, for specific evidence, that would lead them to answer the question of whether the second party has wronged the first.
Search
As noted in the example above, a search seeks to find specific answers to a question. Public search engines such as those provided by Google, Yahoo! and Bing are prime examples of search technology at work. These search engines enable a user to ask the search providers to find probable answers to the question posed.
The search engines then go through a ranking process to list out what they believe to be the most appropriate response, but nonetheless, the outcome is to seek a satisfactory direct answer to a posited question.
Discovery
On the other hand, discovery is not as specific. The user’s goal is generally one of
awareness. One might seek to discover why a computer system’s processors are taxed at 9:01am on a Monday morning. An obvious “answer” may be a very high number of people booting up their computers to start a work week. But in discovery, we are not looking for specific answers, only awareness.
“Data mining” as this process is sometimes referred, can be augmented with Big Data technologies to proactively generate value and create competitive advantage.
Big Data can process hundreds of machine-‐generated logs and find patterns or correlations between the increase in system processor usage and the number of users logging on. The outcome of the discovery could be in the form that a very high number of users are
performing boot sequences at 9:01am each Monday compared to any other time of the day, and day of the week.
The difference is very fine. Nevertheless, the outcome of the discovery does not specifically identify workers starting their computers at the head of the workweek. It simply suggests a very high number of boot sequences taking place at 9:01am on Mondays compared to other times and days.
It is up to an interpreter of the data to make sense of the outcome of the discovery. As such, discovery is all about an awareness of a situation in order to allow intelligent
Since computers can process much more data than the human mind can handle, computers can take vast amounts of seemingly disparate data and create linkages, patterns, or
relationships that are beyond the capabilities of the human mind. These types of transactions are prime examples of Big Data at work.
These discoveries help individuals to take an immense amount of data and produce
information in a human-‐consumable quantity. This information then helps humans to make better decisions more quickly, and most importantly in a more informed manner.
Search & Discovery and Dark Data
By using varying combinations and permutations of search and discovery technologies; enterprises are able to generate value from Dark Data. Computers can do so with an efficiency that goes beyond any group of people. They can also deal with data that would otherwise have been forgotten or ignored.
Today, the combination of all data (dark or otherwise) and the use of search and discovery technologies form the basis of what we call Big Data. Already, Big Data has been
demonstrated to help enterprises:
• Improve output efficiency; • Improve customer targeting;
• Personalize offerings for individual customers; • Improve customer service; and
• Improve revenue and profits.
The benefits of Big Data can range from saving a few thousand dollars; to improving profit by millions or billions of dollars; to stopping medical pandemics from occurring; to saving lives and families in underprivileged nations.
Neuraspective™
Generating Value from Dark Data
Data overall is like currency. Not money, but currency. Think of each piece of data you having as a $1 bill. There are many ways of protecting it, the most popular of which is to put it into a bank.
But what about the change you receive, that you pile up in a jar? What about the lost coins in your couch? For many, after a while, these innocuous and relatively immaterial coins can add up. After several months, many may find hundreds or even thousands of dollars
collected over time. Now, include the odd $1 or $5 bill that may have been left lying around as a result of change that you received and unconsciously or hurriedly shoved into your pant pocket. These, like the coins, often get quickly pulled out of pockets before the pants get put into the laundry. Again, like the coins, these bills can collect over time.
Imagine what this could buy? In fact, if invested wisely, what could have become of the previously “immaterial” amount? Consider the value that is lost because what $1 could have purchased one year, through inflation, is worth less in subsequent years.
Data is just like currency. It too can get left behind. It too can lose value over time. Except, unlike coins stuck in a sofa, or change on a dresser, enterprises have a lot of “loose” or dark data lying around.
Regulatory compliance compounds this further. Now, there is data that the enterprise may not want, yet is forced to retain.
Data retention is costly, data reuse is valuable
Data that is stored with no immediate use will lose value over time. This is costly to an enterprise. Data that is preserved or archived simply for the purpose of being available is also costly. There is the cost of the storage media, the maintenance, power and cooling, the real estate cost of the storage system and many other considerations that add up to a substantial amount of investment and support required to retain data.
So instead of simply retaining data, enterprises need to look at reusing data. When data is able to be reused, its value can be sustained.
Reusing data requires coordination and consolidation
A big challenge for most enterprises is that the data that can be reused, or should be reused, are often found in disparate locations, or silos. The applications that generate and maintain the data may even be owned by separate organizations inside the enterprise. For example, call center data may be owned by the support organization through their function specific software. Conversely, sales and marketing data that reside in a customer
relationship management (CRM) system may not be available (or considered useful) to the research and development (R&D) department, and as such, the R&D department have no purview or access into the CRM data.
Apart from the internal politics associated with data ownership, are issues such as data coherency and duplicity across these organization. In some cases, similar data may be conflicting in nature due to the fact that each organization has collected the data from a different viewpoint or perspective.
Adding to all of this are issues related to corporate and regulatory compliance and governance, the equitable application of security, retention and other corporate policies across each and every silo of data. The “quality” of the data plays a major role in an
enterprise’s ability to reuse data and obtain the necessary insight and intelligence that will generate strategic value to the enterprise.
HDDS helps to coordinate and consolidate access to multiple data sources. The use of the word access is deliberate. HDDS avoids the costly replication of data that would only lead to the extra costs associated with the additional storage capacity and management of that capacity.
Instead, HDDS deploys a scale-‐out indexing technology that securely processes petabytes (PBs) of data to facilitate search and discovery of disparate storage systems and data types.
In today’s global economy, HDDS has been enhanced by providing the ability to index 454 different data types in 56 different languages including character-‐based languages (like those found in Asia).
Discover, Search and Profit
The result of being able to discover and search across so many formats and languages is that queries can be generated to quickly generate informed decision support systems.
Often, from a discovery perspective, it can help not only to prove a hypothesis, but by leveraging HDDS, it can also quickly disprove hypotheses. This is critical, as it allows enterprises to dispense with ideas that may not yield a reasonable return. Being able to determine these situations quickly is one way enterprises can be more agile, while being able to avoid unnecessary costs at the same time.
In the same vein as Hitachi Data Systems’ (HDS’s) approach to data storage, whereupon, the philosophy is “one platform for all data”, HDDS allows enterprises to have “one search
for all data”.
By using a consolidated and universal interface, business users across organizational structures can collaborate and generate a true, enterprise-‐wide, 360o view that allows
decisions to be made for the greater good of the enterprise rather than simply for the benefit of one organization, and perhaps at the cost of another.
One Search for All Data
HDDS can provide “one search for all data”. HDDS can capture metadata from known formats including X-‐Rays, EXIF data from digital images, the context of an email, as well as metadata from popular office productivity suits (such as Microsoft® Word, Microsoft Excel®
and Microsoft PowerPoint®), etc. It can capture all types of metadata from almost any data
object or file type.
For some users, data needs organization and context from the beginning. Often, the type of data that requires this is rich media data – such as digital images and videos. For example, in the healthcare vertical, this might include MRI or X-‐Ray images; in the media and
to be unstructured. But in order to gain perspective and context, these data need to be organized in a way that goes beyond a simple filename. Metadata (data that describes other data) is necessary.
For an X-‐Ray image, data that might be collected in metadata may be the body part being X-‐ rayed, the patient’s identifier, gender, age, and perhaps codes that would identify the cause of the x-‐ray being required in the first place. But the question is how do you associate that type of information? Most file systems do not provide the ability to have user-‐defined metadata. Most file system metadata is restricted to user, group, read-‐only, creation and modification dates.
Minimizing Time to Insight
When it comes to search and discovery, time to insight is one of the key performance metrics. Time to insight, along with accuracy, allows business users to make well-‐informed, empirically supported decisions quickly.
Object stores like Hitachi Content Platform (HCP) store, share, synchronize, protect,
preserve, analyze and retrieve file data from a single system, and extend the capabilities of traditional file storage solutions. It has a policy engine that allows the automation of day-‐ to-‐day IT operations such as data protection and readily evolves to changes in scale, scope, applications, and storage and server technologies over the life of data.
HCP multi-‐tenancy capabilities allow it to support a variety of simultaneous workloads and the advanced metadata capture, update and search capabilities provide intelligence tools and bring structure to unstructured file data. This intelligence automates and facilitates deeper analysis of data. With the HCP Anywhere secure file synchronization and sharing option and Hitachi Data Ingestor (HDI), data can be collected and accessed from mobile devices and remote offices while being centrally managed.
Conclusion
Dark Data has a lot of potential value. It is just waiting for users to take advantage of this
potential through search, discovery and Big Data. Users need systems that can help index and manage metadata in order to make Dark Data valuable.
With the multitude of different data objects and data types, HDDS is a leading solution in helping enterprises understand what data enterprises are actually storing, and assist business analysts to leverage the untapped value within Dark Data.
Extending HDDS capabilities with the policy driven HCP platform creates a “perfect storm” of scalability, manageability, “searchability”, and “discoverability” of data located anywhere in an organization.
With just a little effort, enterprises of all sizes will find Dark Data lurking around their storage systems – from USB thumb drives to monolithic storage systems. This data, for the most part, is preserved innocuously, for sentimental, historic, referential or regulatory reasons. Yet, this data holds the potential to generate new enterprise value, and create competitive advantage.
However, unless this data is properly indexed, so that intelligence can be applied to it, and value extracted from it, through search and discovery, Dark Data is simply costly.
Neuralytix believes that enterprises of all sizes will need to integrate solutions that enable data discovery into their datacenter, or risk becoming uncompetitive. In a negatively polarized case, enterprises can leverage HDDS for loss mitigation. When used properly, an investment in HDDS will yield significant return to the enterprise.
About the Author
Benjamin Woo is the Managing Director of New York based Neuralytix, Inc.
During the course of his career, Mr. Woo has advised clients whose collective market capitalization is over US$1 Trillion. Prior to founding Neuralytix, Mr. Woo was Program Vice President of IDC's Worldwide Storage Systems Research, where he led a team of analysts responsible for advising clients on the evolution and trends related to data storage systems; the impact storage systems have on adjacent technologies including servers, software, cloud and virtualization; and best practices in go-‐to-‐market strategies related to the storage industry.
In addition to authoring thoughtful and provocative insight on the storage industry, Mr. Woo is a frequently sought speaker at industry and customer events worldwide and is frequently quoted in the leading business and technology presses. While at IDC, Mr. Woo also initiated the global Big Data research. Mr. Woo has keynoted Big Data conferences worldwide, advising leading and emerging vendors across the technology spectrum on how they can exploit the Big Data opportunity.
About Neuralytix™
Neuralytix is the global leader in IT research and consulting focused on advising vendors and business users on strategies that maximize competitive advantage and generate enterprise value.
Visit http://www.neuralytix.com to learn more.
Copyright
© Copyright 2013, Neuralytix, Inc. All rights reserved. Reproduction is forbidden unless authorized. For reprints, web rights, and consulting services please contact Neuralytix via email at [email protected]. HITACHI is a trademark or registered trademarks of Hitachi, Ltd.