Applying a data-centric approach to eDiscovery, investigation,
cybersecurity, privacy and other enterprise information challenges
IT’S ALL ABOUT THE DATA
by Angela Bunting,
Executive summary ... 3
Asking questions of unstructured information ... 4
Litigation: What is our risk exposure? ... 4
Regulatory and audit requests: Who knew what, when? ... 4
Investigation: Who did what, when, and can we prove it? ... 5
Mergers, acquisitions and divestitures: Where is our intellectual property? ... 5
Records management: Are these documents company records? ... 5
Privacy and intellectual property protection: Where is high-risk and high-value data stored? ... 6
Cybersecurity: How did the bad guys get in and what did they steal? ... 6
Storage optimization: Do we need to keep this data? ... 6
With unstructured data, bigger isn’t better ... 7
Hard-to-understand data formats ... 7
Too much data ... 7
Important data stored inappropriately ... 7
Multiple tools and point solutions ... 7
“Frequent flyer” custodians ... 7
Delivering the answers to difficult questions ... 8
A unique technology advantage ... 8
Litigation ... 8
Investigation ... 9
Regulatory and audit information requests ... 9
Storage optimization and records management ... 9
Privacy and intellectual property protection ... 9
Cybersecurity ... 9
A data-centric approach—and ways to use it ...10
Workflow automation ...10
Sharing the workload ...10
Text and visual analytics ...10
Living index ... 11
Information governance ... 11
Where to start? ... 11
About the author ...12
About Nuix ...12
By cracking open the
content of unstructured
data, your organization
can find the answers to
complex and essential
When it comes to understanding unstructured, human-generated data, most enterprises use different kinds of software to search and manage this data for various parts of the business. The legal department may have eDiscovery or contract management software; risk and compliance might have forensic investigation tools; records managers may have dedicated records management applications.
These tools are all trying to achieve the same thing in slightly different ways: They are asking difficult questions about the content of unstructured information. These questions include: • What is our risk exposure?
• Who did what, when, and can we prove it?
• How did the bad guys get in and what did they steal? • Are these documents or emails company records? • Where is our intellectual property?
• Does this data we’re storing have business value or does it pose a risk?
• Where do we keep high-risk or high-value data and how can we find out if it escapes? Providing the answers to these questions, quickly enough to be valuable to the business, is hard work. Unstructured data formats are much harder to search and analyze than databases or simple text. In this context, “big data” is a challenge, not an advantage. This is due to the massive volumes of unstructured data organizations create and store, and the large proportion of this information that is not relevant to the task.
This paper will demonstrate how Nuix’s patented parallel processing technology is unique in its ability to access unstructured data formats and provide the answers your organization seeks. You can apply a data-centric approach across diverse business areas including litigation, investigation, regulatory and audit information requests, mergers and acquisitions, storage optimization, records management, privacy, cybersecurity, and intellectual property protection.
We will also show how, having cracked open the content of unstructured data, your organization can develop processes and competencies that will reduce costs, improve efficiency and deliver new sources of business value.
Unstructured data questions
are typically not about predicting
the future but reacting to
an event such as litigation,
regulatory compliance or a
ASKING QUESTIONS OF UNSTRUCTURED INFORMATION
This may be the case with structured numerical data and semi-structured simple text. Transactions, databases, sensors, social media and other data streams lend themselves to number crunching and predictive analytics. You can ask questions such as “How many air conditioners are we likely to sell if there is another week-long heatwave?” or even, with enough intelligence, “What proportion of people tweeted sarcastic responses to our hashtag campaign?”
However, when it comes to unstructured data—human generated information found in emails, documents, photos and other formats—the questions become more complex, and more challenging to answer. The questions are typically not about predicting the future but reacting to an event such as litigation, regulatory compliance or a cybersecurity breach. Often the biggest struggle is not getting enough data but having too much.
The much overused and over-hyped phrase “big data” has at its core a promise:
With enough data, processing power and intelligence, organizations can create insight,
predict the future, make better decisions and gain a competitive edge.
The types of questions organizations ask about their unstructured data include:
What is our risk exposure?
For legal counsel, the essential question in any litigation matter is: What is our risk exposure? Making informed decisions about this requires rapid access to the key facts. For technology or legal services staff, the more difficult question is: How do we find those facts?
For cases that involve more than a handful of custodians, the facts may be hidden among hundreds of gigabytes of emails, attachments, documents and other files. They may be in data from legacy systems your organization no longer uses, in formats you cannot easily search or in employees’ personal devices.
Most eDiscovery software lacks the ability to process large volumes of complex data within a reasonable time. This in turn makes it difficult for organizations to rapidly assess their situation and to respond to discovery requests within a reasonable time. Even if they have made considerable investments in eDiscovery software, they are likely to require external help for all but the smallest cases.
Regulatory and audit requests:
Who knew what, when?
It can be very difficult for organizations to answer, quickly and cost-effectively, what their staff knew about a particular issue raised by a regulator or auditor. The essential problem is the same as with litigation: An inability to find and extract all relevant information from masses of unstructured data. However, the stakes are often higher, with broader requests for information, shorter deadlines and more expensive consequences. Unlike litigation, there are limited avenues for settling the claim to avoid providing the requested information.
Speed is essential to ensure
you contain the problem—so is
context, enabling you to detect
patterns of behavior
Are these documents company records?
For records managers, the ease with which employees create, copy and share documents is a major impediment to effectively managing corporate records. The pace and volume of content creation is so great it is impossible for records managers to apply classifications to everyone else’s documents and employees don’t take the time to classify their documents correctly.
Thus technology systems are required to search through unstructured content, identify likely company records, apply classifications based on the content and allow records managers to review the results. The huge volume of documents involved is an obvious challenge. So is the diversity of formats and locations where these documents are stored, including email, file shares, archives and content management systems such as Microsoft SharePoint.
Who did what, when, and can we prove it?
Internal investigation into matters such as fraud, employee misconduct and data breaches also require rapid access to key facts. Speed is essential to ensure you contain the problem. So is a broad insight into the context of the matter, enabling you to detect patterns of behavior.
Investigations usually draw from many of the same data sources as litigation and regulatory matters. Perhaps surprisingly, many organizations use an entirely different set of tools for this task relying on processes developed by law enforcement organizations and based around data forensic tools.
These tools are important if your organization is required to prove in court that a certain person used a computer in a certain way at a certain time. However they have significant shortcomings. They are designed to interrogate a single evidence source such as a computer hard drive or USB flash memory stick. They do this extremely thoroughly and very slowly.
Using this process, investigators must extract information from each evidence source in turn and then piece together the connections between them. With this counterintuitive approach, you need to have a very good idea of what you’re looking for before you start and hope you can manually match up the pieces at the end.
Mergers, acquisitions and divestitures:
Where is our intellectual property?
Information assets are a core component of the value equation in many corporate transactions. The seller must understand in detail where these assets are stored, while the buyer is concerned with what potential business risks they pose. After the transaction, the parties must be confident they have either acquired or divested all the relevant information assets.
Once again, this can be a struggle when these assets are stored in complex repositories. It is also made more difficult by the records management problems we discussed, in that high-value intellectual property may be stored “in the wild” and not appropriately classified.
Privacy and intellectual property protection:
Where is high-risk and high-value data stored?
Organizations in most jurisdictions around the world are subject to privacy legislation. Private data relating to customers and employees is the number-one target for online criminals, because of its high value on the black market. Intellectual property is another highly sought-after commodity.
Most organizations do their best to store this very important information in locations with strict encryption and access controls. However, this data can find its way outside the organization through several pathways, including:
• Network infiltration by malware or hacking • Insider leaks
• Inadvertent exposure through misconfiguration of network or access controls
• Physical loss or theft of devices such as laptops. The opportunities for a breach are multiplied when organization are unsure if high-risk data is only stored in appropriate places, or if employees have made
unauthorized copies or emailed it outside the organization.
How did the bad guys get in and what did they steal?
Traditional perimeter security solutions alone can’t keep the bad guys out. Gartner Analysts Peter Firstbrook and Neil MacDonald in their report Malware Is Already Inside
Your Organization; Deal With It, say “determined attackers
can get malware into organizations at will.” They argue that organizations must assume they are already compromised and focus more of their efforts on investigation, incident response and monitoring capabilities.
Rapid, thorough and effective post-breach investigation and remediation requires insights into semi-structured log files and unstructured email, hard drives and files shares, as well as structured data sources such as databases and the Windows Registry. Only by combining data from all these sources, including deleted or obfuscated information, can organizations understand how the breach happened and the extent of the damage.
Do we need to keep this data?
Many organizations seek to reduce their storage costs or delay a major systems upgrade by migrating email or other data to the cloud. Often they are seeking to atone for the sins of the past, such as failing to implement or properly apply data retention rules or allowing large volumes of data to remain in the wild outside the control of content management systems. As a result, they have retained large volumes of data that may not have business value but might contain hidden risks.
Just as records managers struggle to apply classifications to data retroactively, it is equally difficult if not impossible for storage managers to decide which data has value, which must be retained according to regulatory requirements, and which is redundant, obsolete or trivial. As a result, many organizations end up lifting and shifting all their data to the cloud or new storage systems. In essence, they are not solving their storage management shortcomings, just kicking the can down the road a year or two.
ASKING QUESTIONS OF UNSTRUCTURED INFORMATION cont
Only by combining data from
all these sources can
how the breach happened and
the extent of the damage
WITH UNSTRUCTURED DATA, BIGGER ISN’T BETTER
Hard-to-understand data formats
A lack of visibility into the content of unstructured data makes it difficult to answer questions, assign value or apply appropriate retention or data protection rules. This is especially the case for data stored in legacy systems and complex formats which do not easily lend themselves to searching, extracting and managing data. The trend for “bring your own” devices adds a new range of complex data formats from smartphones and tablets.
Too much data
The massive and growing scale of data makes it very hard to gain timely insights and sort the valuable from the irrelevant. This explosion of data often leads IT teams to seek archiving and lower-cost storage solutions. While these may solve short-term storage problems, they can also add to the challenge of searching and retrieving data rapidly when required.
Important data stored inappropriately
High-value and high-risk data is often stored in the wild as a result of employees failing to categorize the value of the data or making “convenience copies” and keeping them in inappropriate locations such as open file shares or their personal computers. This important data is intermingled with low-value and irrelevant information and there is no easy way to distinguish between the two types. In addition, having limited visibility into the data that is stored in the wild makes it hard to address behavioral or technical issues that lead to employees’ lack of compliance.
To solve these disparate problems, most organizations invest in point solutions such as
eDiscovery, forensic investigation, records management and information security applications.
IT departments must provide technical and logistical support for many of these tools at
Even though the questions being asked of the data are quite similar, the answers are required by different parts of the business in language they are comfortable with. It’s difficult under these circumstances to expect a coordinated response. However, the end result is organizations using several different and very expensive hammers to drive the same nail.
Although the people in different parts of the business may not realize it, they all face common challenges. These include:
Multiple tools and point solutions
Organizations lack a single window through which they can search and analyze data across multiple sources or silos. In many cases, they use multiple tools to complete different processes or even different parts of the same process, making it necessary to copy and convert data multiple times. This adds to the storage burden by requiring more space for working copies of the data.
“Frequent flyer” custodians
In many organizations, a few individuals face repeated scrutiny relating to litigation, investigations, regulatory inquiries or audits. With most eDiscovery and investigation tools, the organization must treat each incident as an entirely separate case. This often involves re-indexing and searching the same data for very similar information.
Although the people in
different parts of the business
may not realize it, they all face
Only one technology can solve all these issues.
Organizations including the United States Securities and Exchange Commission, the US Department of Health and Human Services and the European Union Directorate General for Competition have stated publicly that of all the technologies they evaluated, Nuix was the only one that could provide the insights they required, in the time frame they needed, across terabytes of data per day.
A unique technology advantage
At the heart of Nuix’s enterprise software is the patented Nuix Engine. It has a unique combination of load balancing, fault tolerance and intelligent processing technologies, which we have developed and refined over the past 15 years. These make it possible to search, analyze, categorize and manage massive volumes of unstructured data—quickly, thoroughly and reliably.
Nuix has combined these capabilities with a deep understanding of enterprise information stores such as Microsoft Exchange, Lotus Notes, Microsoft SharePoint, email archives and compliance storage systems, cloud repositories including Microsoft Office 365 and Amazon S3, and common mobile device and social media formats. Most other eDiscovery and digital forensic investigation technologies ignore these data sources because they are technically difficult to access or burdensome to maintain. Nuix also works with an extensive range of forensic artifacts including deleted files, slack space, the Windows Registry, browser caches, link files and network capture files.
Nuix’s unique abilities make it suitable for all the business challenges we have discussed. Here’s how it works.
Many enterprises use Nuix’s eDiscovery software to preserve, collect, process and analyze large volumes of unstructured data. This gives you fast access to the key facts and risk exposure of the case. You can apply multiple approaches including deduplication, clustering, topic modeling, text summarization and auto-classification to rapidly identify the most relevant documents.
DELIVERING THE ANSWERS TO DIFFICULT QUESTIONS
A common thread connects litigation, investigation, records management, privacy,
cybersecurity and storage optimization: All require organizations to ask difficult questions
of their unstructured data and receive comprehensive and timely answers.
Nuix was the only
could provide the
insights they required,
in the time frame
they needed, across
terabytes of data per day
Internal investigators gain a single window into multiple evidence sources, including difficult formats such as archives and legacy applications. You can trace connections and interactions between multiple people and information sources. Rather than manually piecing together intelligence from each individual evidence source, you can automatically extract and correlate names, IP and email addresses, credit card and identity numbers, and sums of money.
Regulatory and audit information requests
As with litigation and investigation matters, you can find the answers you seek quickly and comprehensively. Nuix customers tell us their ability to provide timely and specific responses to regulators and auditors leads to a much more positive relationship. As one investigator with HM Revenue & Customs in the United Kingdom put it, “If someone replies to a request the next day, you know they’ve not had time to tinker with the results. If they take six months, you wonder what they’ve been up to.” Once they see a pattern of receiving quick and thorough replies, regulators tend to ask more specific questions, requiring smaller and less onerous responses.
Storage optimization and records management
These tasks revolve around assigning value to information assets. With the ability to understand the content and context of each item in your unstructured data, it is much easier to apply rules and classifications—consistently and much more efficiently than with manual processes.
Privacy and intellectual property protection
Rapid scanning and pattern-matching technology makes it possible to identify high-risk private data such as credit card and personal identity numbers, and high-value intellectual property, across vast amounts of unstructured data. Conducting regular sweeps helps to ensure this data is only stored where it should be, thus minimizing the opportunities for data breaches or leaks. In the event of a breach, you can quickly identify the systems that were targeted and close security holes faster.
Many organizations already use Nuix’s Investigator software in cybersecurity investigations and incident response activities. Nuix’s rapid scanning and pattern-matching technologies come to the fore in this context as well.
By collecting and processing log, event and time-sensitive data from structured, semi-structured and unstructured sources, you can quickly identify the source and extent of the breach, find out what happened next and start the remediation process. Nuix also uses techniques such as autoclassification and clustering to filter out large volumes of irrelevant data and highlight, summarize and visualize the key facts in a timeline of events surrounding the breach.
“If someone replies the next
day, you know they’ve not had
time to tinker with the results—
if they take six months, you
wonder what they’ve been up to”
A DATA-CENTRIC APPROACH—AND WAYS TO USE IT
Nuix has worked with customers and thought leaders in the fields of litigation,
investigation, information governance and data management to develop processes
that help solve problems intelligently. Below are some ways our customers have used
Nuix’s versatile toolkit to improve their efficiency and cost-effectiveness.
eDiscovery and investigation tools are highly complex with large numbers of options for processing data in different ways. Inconsistent handling of evidence sources is risky, especially for matters that could end up in court. Using Nuix Director, organizations formalize processing workflows and settings into a template or series of templates. This means staff members with limited expertise can process data consistently and defensibly. A national construction company was struggling to meet deadlines for discovery in a major litigation matter. Applying Nuix’s eDiscovery Workstation, Director and Web Review & Analytics technologies enabled the firm to create and streamline processes for data processing, culling, analysis and review, helping it to meet deadlines. The firm is now looking to use Nuix technology to locate and migrate intellectual property that has changed ownership after acquisitions.
Nuix’s Investigator Lab and Web Review & Analytics technologies make it possible to provide compartmentalized and secure access to case data for external parties such as lawyers and subject matter experts. It is easy to divide up tasks along whatever lines make sense, including date ranges, custodians, locations, languages or content types. For example, you could pass financial records to a forensic accountant or internet activity records to a technical specialist.
The Guernsey Border Agency and Guernsey Police, while investigating financial crimes such as corruption, money laundering and fraud, must often share case data with external subject matter experts such as lawyers. They also handle requests from overseas authorities under mutual legal assistance treaties. Using Nuix Investigator Lab and Nuix Web Review & Analytics, these agencies can easily provide secure access to case data for people who have the greatest expertise and case knowledge.
Sharing the workload
Nuix’s desktop, server and web-based applications all work from the same case file format. This means enterprises and service providers can hand-over work at any stage of the discovery or investigation process as simply as transferring a Nuix case file.
For example, the New York office of a multinational bank needs to conduct data collection, culling and basic searching in-house because of the sensitive nature of the information involved. The bank uses Nuix collection and eDiscovery technologies to automatically collect highly specific data. Where required, it hands over the Nuix case files to its litigation support vendor for further processing.
Text and visual analytics
Customers use Nuix’s built-in text analytics such as auto-classification, clustering, topic modeling, text summarization, deduplication and near-duplicate
management to search, understand, classify and minimize data sets.
Nuix also provides interactive graphical tools including timelines, communication network diagrams, commonality network diagrams and trend, pivot and intersection charts. These make it easy to slice, dice and visualize data so you can quickly identify trends, locate information of interest and drill down to specifics.
A major global insurance firm was brought before a national industrial relations tribunal after dismissing an employee whom it accused of using company resources to start up his own business. Using Nuix’s timeline and trend charts, the insurer clearly demonstrated to the tribunal that the employee had a pattern of spending each afternoon sending multiple emails related to his new venture from his company email account during work hours. This helped convince the tribunal that the dismissal was fair.
WHERE TO START?
As we have seen, many crucial business processes rely on asking difficult questions of unstructured data and receiving detailed, timely answers.
In litigation, investigation and regulatory matters, you want to know who said or did what, and when. For records management, mergers and storage optimization you must ask about the age, ownership and value of millions or billions of documents and emails. To protect customers’ private data and your organization’s intellectual property, you must know where this critical information is stored, who has access to it and how it could get outside.
Organizations worldwide across many different industries use Nuix technology to provide these answers, because of its unique ability to make unstructured data accessible for searching, analysis and management.
At Nuix, we are confident our technology can help your organization solve all these issues—and more. But we also recognize that you likely have existing solutions in most or all of these areas. For example, we know Nuix is not a replacement for existing information security solutions such as firewalls and antivirus software. Rather, it provides a new way to minimize the data that is exposed to potential breaches and rapidly discover, investigate and patch cybersecurity gaps.
That’s why we recommend an incremental approach: Start with a single problem and compare Nuix technology to your existing software—or look to integrate it into your current solution set. If you succeed, look to apply the technology to solving other problems and helping other parts of the business.
Frequently litigated or investigated organizations use Nuix’s collection and discovery technologies to maintain a regularly updated index of all files and emails related to their frequent flyer custodians. The automated collection technologies conduct scheduled updates, adding only the most recent data to the index. This index is instantly searchable, eliminating the lag between when someone asks a question and when the organization can start finding the answers.
A global investment management firm uses Nuix technology to proactively monitor its traders’ email for indications of insider trading or other risks. It maintains a live index of metadata, but not content, for all staff members’ emails. Its compliance staff periodically extract the full text of recent emails for selected custodians and run a series of searches for key phrases that could indicate prohibited behavior. The same index also serves as a baseline for eDiscovery and investigation requirements, making it easy to track content owned by relevant custodians and rapidly extract the full text of their emails.
The proliferation of data is a major driver of costs in litigation, investigation and many other information-gathering activities. Some organizations are seeking to minimize their storage volumes by eliminating data that is duplicated, trivial,
obsolete, past its retention period or even potentially harmful. Rather than waiting for a trigger event such as a lawsuit or an email migration, some organizations are initiating information governance processes to do this as a matter of course. Such information governance projects very quickly become self-funding through smaller litigation budgets, reduced storage spending and improved risk management. They can also become a source of business value as employees become more effective and organizations leverage the knowledge they have gained from understanding their own data.
After a divestiture, an international pharmaceutical company held around 300 terabytes of litigation data which it had collected and preserved over a decade. This data was held in various data sources, from file shares to CD-ROMs, for a large number of custodians still under litigation hold by both companies. Using Nuix software, the company organized the data by custodian and assigned to the appropriate company so they could continue to meet their preservation obligations. The Nuix software also applied deduplication
Nuix enables people to make fact-based decisions from the content of unstructured data. The patented Nuix Engine is the world’s most advanced technology for accessing, understanding and acting on human-generated information. Our unique parallel processing and analytics capabilities make small work of big data volumes and complex file formats. Organizations around the world turn to Nuix software when they need fast, accurate answers for digital investigation, cybersecurity, eDiscovery, information governance, email migration and privacy. And the list of uses is always growing because our customers expect us to make the impossible possible.
To find out more about Nuix’s data-centric approach
to enterprise information challenges visit
Director of User Experience, Nuix
Angela Bunting has worked with unstructured data technologies for more than 15 years. As Director of User Experience at Nuix she is responsible for the the company’s eDiscovery solution development, technical support, documentation and quality assurance. Angela is one of Australia’s electronic discovery pioneers who has worked in managerial and technical roles at law firm Mallesons Stephen Jaques and as technical lead at Law in Order, the country’s leading litigation support bureau.