Building Successful
Big Data Solutions
Executive Summary
The decision to invest in and leverage the widespread “Big Data1” revolution, whether you’re a large
multinational corporation or the smallest sole-‐proprietorship, is no longer an option as data growth has outstripped the ability for people and 20th Century technology to make sense of it all. Differentiation and successful execution requires a 21st Century approach to intelligent analytics, which go beyond the ability to count and sort methodologies, but rather approach all data automatically, whether structured or unstructured. The successful business requires tools which continuously learn and reveal actionable and unforeseen connections, while also being able to flexibly move between legacy data (which may reside in highly organized silos) and unstructured data generated in real-‐time. Atigeo’s xPatternsTM intelligent Big Data platform is capable of providing the required level of analytics visibility into data, both structured and unstructured, against any application now and into the future.
According to McKinsey & Company2, there is a growing shortage of both data managers and skilled data
analysts necessary to handle the continued exponential unstructured data growth. Current technologies require multiple analyst touch points, volume limits and strict data policies; today’s solutions must lead with complex and constantly evolving sets of open source technologies. However, solutions must go beyond the ability to store, manage, and retrieve the copious amounts of data (that is simply the point of entry), and provide advanced analytics, which can lead to quantifiably more effective marketing and optimized operations. The question to ask is, “Does my solution just enable search, recommendations and classifications over large volumes of data, or does it also achieve unprecedented relevance necessary for robust ROI?”
Additionally, privacy, compliance and security of one’s data is paramount. As data explodes, these concerns explode with it; xPatterns was designed innately to solve for these concerns in a Big Data world. Merchants, governments, and hackers are all looking for ways to leverage personal data and consumers are right to be wary about the shifting boundary between more services and less privacy. The question to ask is, “Does my Big Data platform secure my data out of the box?”
As noted above, the shortage of personnel is magnified by the investment of time required to retrain current IT staff, as the Big Data learning curve is significantly higher due to the number of technologies and components involved. Companies must decide: Do I train my personnel or do I partner with 21st Century technology?
While all of the above is challenging, Atigeo’s seven-‐year head start in addressing Big Data analytics has ensured xPatterns is the appropriate application framework for building enterprise-‐grade, intelligent Big Data applications, which can be deployed on-‐premise or in the cloud with minimal IT support required. The robust Software Developers Kit allows data scientists to easily plug and play to try with the best in class tools.
In summary, Atigeo’s xPatterns platform makes it easy to combine different kinds of intelligent components in Big Data applications, including those built by our partners and available in open source.
1 Big Data definition: the unprecedented growth in the volume, velocity and variety of data in our world.
2 The McKinsey Global Institute: “Big Data: The next frontier for innovation, competition and productivity” June 2011 2 The McKinsey Global Institute: “Big Data: The next frontier for innovation, competition and productivity” June 2011
The Big Data Opportunity
“Big Data” has been defined to address the unprecedented growth in the volume, velocity, variety; in addition Atigeo believes it is necessary to address both the visualization of data in our world and its accessibility for all. This explosion in unstructured and semi-‐structured data is expected to account for 90% of newly created data going forward3. This opens up significant business opportunities to leverage
Big Data through advanced analytics that tie directly into business processes and applications. As the following diagram presented by McKinsey & Company in September 20114 shows, early adopters in this space have significantly outperformed their respective markets.
The question today is no longer about whether or not a company should invest in Big Data Analytics to stay competitive, but how to gain insight hidden beneath the surface of the data as well as lower the total cost of ownership and the time to market in order to increase their chance of success and return.
The difference between Big Data and traditional, smaller transactional data sets is that Big Data, being large in sample, provides more insightful patterns when applying advanced analytics such as statistical analysis, machine learning, data mining, natural language processing, information retrieval and predictive analytics in automated ways, otherwise your ability to unlock the value of your data (relevance) will dramatically fall off because there aren’t enough people on the planet to analyze and structure for this global data growth.
3 J.P. Morgan – Big Data Primer June 2012
However, according to a 2011 analysis by McKinsey & Company, for the first time in history there is a current shortage of 1.5 million data-‐savvy managers to tackle the unstructured data relevant to enterprises. The diagram on the right summarizes the general trend we see where the combination of massive amounts of data (volume), coming from multiple sources (variety) at real time (velocity), causes traditional approaches (in particular those that rely on human tagging, prioritization and analysis) to become ineffective or impractical. Atigeo believes that we are at an inflection, or decision point, where the growth of unstructured data overwhelms the growth of analysts who identify structure within unstructured data. Thus, relevance falls off if no adjustment is made to handle unstructured data. This gap will grow exponentially for the foreseeable future. Hence, the collection, analysis and integration of Big Data into business operations must be automated and requires an expanded portfolio of technologies.
For enterprises to capture the Big Data opportunity effectively, “accessibility” in their Big Data solution is extremely paramount. Access is the mean to democratize the Big Data tools to empower every employee throughout the organization to maximize value of the data for the company, instead of leaving the job to a small group of specialists.
Therefore, building successful Big Data solutions is about taking advantage of volume, velocity, variety and visualization through analytics and making it accessible to all.
Implementing Big Data Solutions
Our framework for Big Data analytics implementation, successfully applied across multiple verticals to date, confirms that the best solution requires a different technology mix per customer, substantial domain-‐specific knowledge and data, and multiple iterations of data-‐driven continuous improvement.
Until now, there has been no end-‐to-‐end solution that fits all these criteria; we expect to see tremendous advancements in technology in the next few years from both incumbents and new entrants. Companies must therefore consider a platform that is flexible in quickly adopting new technologies, for both distributed data processing and advanced analytics, as they become available and enterprise-‐ready. In addition, such a platform must also have the ability to comply with each company’s unique requirements while leveraging existing data and infrastructure.
Traditional database technologies, analytics, etc. have served industry well until recently where in the 21st Century we can take advantage of real time advances available in Open Source, across high speed networks, breakthroughs in compute power and systems, and advances in intelligence technologies like xPatterns are game changing.
Introducing xPatterns
xPatterns is an application framework for building enterprise-‐grade, intelligent Big Data applications and an abstraction platform which can leverage all these advances by ISVs, Open Source community technologies, NLP, machine learning, semantic, academics, etc.. Our roadmap is guided by our belief
5 The McKinsey Global Institute: see footnote 2
that the opportunity to capture value of Big Data is through access, analytics and visualization. xPatterns democratizes the current technologies by abstracting the complexity of usage of i.e. open source Hadoop framework (Access) and adding ever increasing proprietary and open source Analytics and Visualization tools to enable automated and easy manipulation of data to fit all business needs.
It can be deployed either on-‐premise or in the cloud. xPatterns provides an SDK for data scientists to easily configure plug-‐and-‐play components and to experiment with best in class tools, reusing and integrating with the company’s existing assets. Data scientists can then directly deploy apps as web services or analytical jobs, providing a seamless transition from analysis to production. The runtime environment (Hadoop, NoSQL, search, etc.) is completely abstracted away, allowing for faster time to market, no need for in-‐house expertise and easy transition between underlying technologies.
xPatterns – what’s included out of the box?
Distributed Processing • Scalable, reliable processing • Scalable, reliable storage • NoSql (key/value) storage • Pig & Hive queries
• Workflows & Scheduling • High availability
• Backups • Auto-‐scaling
• Search, filtering, faceting • Real-‐time dataset updates • Shared schema mgmt.
Advanced Analytics • Natural language toolkit • Supervised learning • Unsupervised learning • Concept extraction • Ontologies
• Plotting & Visualization • Information Retrieval • Data Mining • Scientific computing • Predictive analytics • Inference Framework Features • Create & deploy apps • Scheduled workflows • Data ingestion, push or pull • Normalize, filter and de-‐dup
incoming data
• Plug & play analytical tools • Continuous measurement • Automated feedback loop • Data lifecycle management • Logging & monitoring • Personalization
On the next page is the xPatterns architectural diagram consisting of the infrastructure layer, horizontal and domain specific intelligence layer and development and administrative environment layer. The framework is designed to achieve flexibility for customers to choose the right intelligence to solve their specific Big Data problem using a simple high-‐level programming language. While customers focus on business solutions, xPatterns take care of Big Data environment. Thus, xPatterns can lower the barrier of entry for any enterprises or application to take advantage of Big Data opportunities.
Intelligence Components
xPatterns makes it easy to combine different kinds of intelligence components in Big Data applications. Some of these components are open source including popular Python libraries such as nltk6 for natural
language processing, scikit7 for machine learning and matplotlib8 for visualization. Additional intelligence
components are those built by our partners such as IBM’s SystemT9 and SystemML10. The third category
6 http://nltk.org/ 7 http://scikit-‐learn.org/stable/ 8 http://matplotlib.sourceforge.net/ 9 http://www.almaden.ibm.com/cs/projects/systemt/ 10 http://www.almaden.ibm.com/cs/projects/systemml/
of components comprises patented innovations enabling xPatterns to deliver better results, using algorithms available by Atigeo and exposed through a rich set of APIs. Examples of these are:
Relevance: xPatterns Relevance takes a "relevance discovery" approach that delivers on the promise of deriving actionable intelligence from an enterprise's disparate sources of structured and unstructured data. xPatterns automatically creates and dynamically maintains semantic ontologies known as domain experts (DEs). At the core of the relevance technology is the creation of high-‐quality DEs in near real-‐ time.
The DE is built as a Relevance Neural Network (RNN) that maps relationships between a set of terms (i.e., semantic concepts) and related terms (output layer), intermediated by context (i.e., documents or articles). The network weights are initialized (or bootstrapped) with statistically optimal values based on frequency statistics. Thereafter, the weights are strengthened or weakened through training by live interaction with users, as well as with new data. This learning capability enables better relevance by leveraging the wisdom of crowds. The figure on the left below shows a depiction of a DE RNN; the figure on the right is an xPatterns visualization of network relationships, showing relevant documents for a concept and relevant concepts for a document:
A DE captures and represents relationships between concepts within a given domain. DEs are created automatically by analyzing and processing large bodies of unstructured text information about the domain. They can be leveraged to determine indirect semantic relationships between queried concepts and related concepts, and to facilitate understanding of the relevance of a specific document to a specific concept. DEs represent "IsAssociatedWith" relationships for domains, derived simply from reading and reviewing large bodies of unstructured text information about a given area of interest.
Inference: xPatterns Inference delivers complex predictions from evidence. Combined with a Bayesian
Model Average (BMA) approach to integrate user preferences embodied in a Bayesian network (BN), xPatterns Inference can provide higher accuracy even when collective preferences are sparse. The power of Inference is attributable to its ability to integrate evidence from different domains at various levels of scope in a scalable way.
Inference incorporates ontological information in the task of prediction. This information can be captured through a representation of DEs thereby allowing the incorporation of unstructured information, which is particularly well suited to cold-‐start prediction scenarios.
Cold-‐start prediction describes a situation where the data sample is still small and forming, and there is not enough sample to make prediction using traditional statistics models. An example is shown in the diagram below, where user A provided a small set of cuisine preferences and the task is to infer user A’s other preferences on cuisines not listed. The algorithm takes into account the preferences of all users and the additional relationship weightings represented by Domain Experts to infer the likelihood of user A’s other preferences in cuisine. This allows us to calculate with high confidence the probability whether A likes Chinese Food even if preferences collected from the population are too small of a sample, especially in the beginning of the sample collection process.
As the scenario evolves, personal or local evidence grows in tandem with population level behavior. This evidence may be structured or unstructured. The three main components of ontology, personal/local behavior and population level behavior combine to render optimally informed inference.
Classification: xPatterns Classification infers type or class from complex information. Classification integrates structured and unstructured data into classification scenarios, which may have large scales in the volume of data, the size of the input space and the number of possible classes that may be inferred. Classification develops deeper understanding of unstructured data through processing natural language to decipher complex relationships. The deeper understanding enables qualities of sentiment, time and reference, which are applied to distinguish among subtly distinct classes.
For example, a domain-‐specific classification tool is incorporated for healthcare professionals, leveraging the Unified Medical Language System® (UMLS®) from US National Library of Medicine as well as International Classification of Diseases (ICD-‐9 and ICD-‐10) data to return precise classifications for unstructured medical text.
Cooperative Distributed Inferencing (CDI): xPatterns CDI is a new paradigm for Inferencing and Optimal Control in real time. It is a distributed optimziation approach with built-‐in synchronication in a continuous optimization of all types of rules, soft and hard rules. The paradigm for inferencing converts multiple knowledge bases from exponential complexity to polynomic complexity. Then, constraints are build with a pareto strategy that synchronize different rules to form a converging optimal result.
The application of this inferencing model is vast. One example is optimizing the power grid, which has multiple knowledge bases and rules that are not all taken into account by the out-‐dated algorithms. This leads to local ad-‐hoc adjustments and empirical corrections, which are sub-‐optimal. The figure below shows the current model and an xPatterns CDI model.
In summary, the three main differentiating points for CDI are: 1. Deal with large size rule sets in real time
2. Express variety of rules and constraint with optimization
3. Distributed cooperation between independent nodes, without needing trust among nodes
Personas -‐ The unique xPatterns privacy model makes it possible for individual users to create, build and control their own digital “personas.” These anonymous, secure profiles keep users’ identities completely private while accurately reflecting their interests and behaviors in the digital landscape. In this way, it becomes possible to deliver highly relevant, personalized content and experiences to individuals without learning those individuals’ actual identities; instead, only their relevance scores are visible.
Here is how the xPatterns persona module works:
• All content types are given a relevance score based on the personalized attributes of the user • The user profile can be initialized from existing enterprise data sources
• Profile attributes can be dynamically updated from real-‐time inferred or explicit behavioral data • Applications can be designed to give consumers full management of their personas
• Persona attributes are unstructured, meaning they don’t have to be selected from static lists
NLP-‐P (Natural Language Pre-‐Processing): Atigeo has a set of healthcare domain-‐specific natural language processing built on top of the existing open source projects and mutliple sources of references. The pre-‐processing, which can be applied to any domain, consists of body and sentence extraction, negation tagging, normalization, lemmatization and removal of stop words. This is used to improve overall relevance of xPatterns at time of generation of corpuses and at query time.
Applications
Atigeo has been working with several partners to solve their real life important Big Data analytic challenges. The following are some examples:
xPatterns Clinical Auto-‐Coding: Often times, there is an under-‐coding problem where hospitals are not billing the insurance companies correctly to get paid an accurate amount. Hospitals are facing a shortage of trained staff to translate Electronic Medical Records (EMRs) to required ICD-‐9 and ICD-‐10, CPT, HCPCS, APC Grouper, Charge Master, DRG codes and more. Atigeo has developed an intelligence system to automatically suggest correct codes for any number of EMRs. We are also able to take big data sets of past EMRs and run them through our intelligence system to perform an audit or add more accurate
codes, creating a complete view of actual clinical services for compliance or research purposes. In addition to NLP, the product assembles multiple intelligent methodologies including inference, classification, ontology and machine learning that differentiate Atigeo from its competitors.
Research -‐ Document Discovery: As our analytics algorithms are specially designed to solve unstructured data relevance problems, we have applied them to a large set of unstructured text documents as our first Big Data usage scenario. We processed gigabytes of medical research documents (PubMed) and patents (USPTO) by assigning relevance scores and generating domain concepts. Users can submit search queries to find relevant documents organized in clusters. The platform continues to improve through applying machine learning to users’ interactions with the documents.
Through xPatterns Relevance, document discovery is no longer a linear search problem. We have developed a visualization tool that allows users to easily navigate among clusters of many relevant documents and sometimes even discover relevant concepts and documents that are non-‐obvious to the original search query.
Clinical Analytics/Intelligence as a Service: Atigeo developed a clinical intelligence layer on the xPatterns framework. With easy access to pre-‐loaded medical domain toolboxes in the cloud, users can run analytics against their own large data set such as EMRs. xPatterns’ analytical toolset allows users to do natural language data mining, correlations, etc. to find insightful patterns on a given research topic. For this specific use case, there are tremendous benefits of leveraging a cloud service, which xPatterns supports. Benefits include:
1. Scalability and agility: Initially, processing Big Data requires a large number of servers, which will then not be required once the data is processed and the results stored. Cloud services provide the flexibility for scaling up and down as needed. Leveraging the cloud, an enterprise can optimize their processing power without waste.
2. Deployment and maintenance cost: Upfront investment is high for infrastructure deployments and skilled staffing. The cost of keeping up with the latest software is expensive when advancement is happening very rapidly in this space.
3. Time: Cloud flexibility takes deployment time out of the equation, and it also gives an enterprise the ability to control turnaround time for output.
Conclusion
The question today is no longer about whether or not a company should invest in Big Data Analytics to stay competitive, but how to gain insight hidden beneath the surface of the data as well as how to lower the total cost of ownership and improve time to market in the face of these challenges. Big Data brings both opportunities and challenges are met by xPatterns, which lowers barriers to entry in the Big Data space by taking away the complexity and advancing the insight. Infrastructure and talent acquisition should not be any enterprise’s major concern. The focus should be on the solution, which means Atigeo’s xPatterns is the enabling “Big Data Intelligence platform” for the 21st Century.