Next-‐generation Data Management Platforms for
Predictive Analytics
Dr. Michael Zeller of Zementis interviews Dr. Raghu Ramakrishnan of Microsoft
Raghu Ramakrishnan, Ph.D.
Technical Fellow; Head, Big Data Platforms and Cloud Information Services Lab
Microsoft
Dr. Ramakrishnan shares his
perspectives on the future of machine learning and artificial intelligence, and describes how cloud architectures can provide advantages when working with big data and predictive analytics. He also highlights some of the work that he and his team at Microsoft are
doing in predictive analytics using Microsoft's Azure cloud.
MZ: Hello, and welcome to this Zementis video chat on the topic of real-‐‑world use cases for predictive analytics, a very exciting topic. Today it is my pleasure to welcome Dr. Raghu Ramakrishnan to share his perspectives about predictive analytics, data mining and machine learning, and today with a particular focus on the next generation of data management platforms. Not only will we cover the data science and the technology behind the next-‐‑ generation platforms, but we’ll also explore the business implications of these technologies.
But, before we begin, please allow me to introduce Dr. Ramakrishnan, who has a rich professional background that straddles (the) corporate as well as academic worlds. Since 2012, he has been with Microsoft, where he leads their Cloud and Information Services Lab. He also leads the engineering for Big Data Platforms and serves as a Technical Fellow. He came to Microsoft from Yahoo!, where he served as a Yahoo! Fellow for 6 years. Prior to Yahoo!, he founded QUIQ, a company that
developed innovative collaborative customer support and knowledge management solutions.
Previously, Raghu was a Professor of Computer Sciences at The University of Wisconsin, in Madison, and his teaching and research focused on database systems, emphasizing data retrieval and integration, integration analysis, and mining. And he often collaborated with researchers in industry.
He and his group developed scalable algorithms for clustering, decision-‐‑tree construction, and itemset counting, and they were among the first to investigate mining of continuously evolving and streaming data, a very hot topic today. His work on query optimization found its way into several commercial database systems, and his work on extending SQL to deal with queries over sequences influenced the design of window functions in SQL 99.
As if this is not enough, he also serves in ACM1 SIGKDD (the ACM Special Interest Group on Knowledge
Discovery and Data Mining2), where he has received numerous awards, and he co-‐‑authored one of the most popular textbooks on Database Management Systems, also known as the "Cow Book”, and it's currently in its third edition.
So, Welcome, Raghu.
RR: Thank you, Michael.
MZ: A pleasure to have you here!
RR: Likewise. A pleasure to be here!
MZ: Let me start by citing a recent article in the Harvard Business Review3. It really caught my attention, since it seems to offer a very interesting statistic that seems to frame the debate… the current debate about prospects for machine learning and AI. So let me share with you: the authors surveyed a large group of CTOs and CIOs, as well as other senior executives, and asked what percentage of them believed that technology will be able to capture and meaningfully utilize critical, experience-‐‑ based knowledge. (we call that “deep smarts” maybe in reference to deep learning technologies that we look at as scientists). But, surprisingly, 71%, 71%! of the executives gave the answer “Partially”, while only 4%, only 4%! said it's “Almost Completely”. So, this doesn’t really seem like a very strong vote of confidence for machine learning and AI!
And, if I may compress that, in comparison to Stephen Hawking’s perspective (of course that's the other extreme here), and he said “The development of full artificial intelligence could spell the end of the human race”4. Of course that's a little controversial, but I wonder if in the business world our technology executives – CTOs and CIOs – may not really grasp the full impact of all the technology in deep learning, machine learning, predictive analytics that we see today. So, what is your perspective on that?
2 http://www.sigkdd.org/
3 https://hbr.org/2014/11/artificial-intelligence-cant-replace-
hard-earned-knowledge-yet
4http://www.bbc.com/news/technology-30290540
RR: So, I think there a two distinct phenomena at play here, Mike. First off, if you look at companies like the web companies, their basic task is to either present content the users are looking for, or to monetize that user’s attention by showing them an ad. And both of these involve selecting from many choices. In search, you are selecting from many different URLs you could display as a search result. In any kind of portal scenario, you are trying to choose from a range of articles you could show. Or ads, of course, you are selecting which to show. These are, fundamentally, tasks where we are predicting what the user will respond to.
So, they needed to be good at prediction as a condition for being in business, right? And they also had access to enormous amounts of data, the logs of what people did when you showed them an article or when they saw an ad. And you could learn from this and it could get better and the economic incentive was so high, they invested heavily and they made it work. This has been a journey 10 years in the making.
Today, there is a certain part of the world that totally understands the value of data and data-‐‑driven action. If you take the typical enterprise, they understand the value of data as well, but from a slightly different standpoint. For one thing, they all understand that transactional systems are the backbone, the bookkeeping. But when you consider using that data to guide future actions, future decisions, they still for the most part leveraging reporting and retrospective analysis.
They are now in the early days of using data to look ahead, and some of what we hear is this early phase in the journey. Some of what we hear is just the plain difficulty of operating on large amounts of data to gain insight. If you take the web companies again, their stakes are so high, a tiny percentage increase in click rates means hundreds of millions of dollars. So they were able to invest.
Where the gain is not quite that sharp, it takes a lower barrier to success to get more people to be engaged, and right now frankly, the burden of setting up systems that
bring together a lot of data, maintain them, allow you to cleanse them, allow you to bring business logic to bear on them and make meaningful decisions, is simply high and there aren't enough qualified people to go around. It's not that enterprises are not desirous of doing this. There is a practical gap in the number of people who are available… who can help in all phases of this task.
MZ: Excellent, excellent point here. I absolutely agree. You look at the companies that have embraced the technology, you look at maybe manufacturing companies, which are just starting to embrace it, but the Internet of Things is (I think) one of the hottest topics today. So, you brought up exactly a very good point. Let’s turn a little bit to the technology and the architecture for a moment. You have, over the years, worked with many different platform technologies in your career, and in recent years we really have seen the emergence of cloud as a viable architecture. So, in that context, what does an enterprise-‐‑scale global cloud architecture bring to the table for data analytics, for lowering the barrier of entry for predictive models, for machine learning and for big data in general?
RR: It's a great question. I think the cloud has the potential to greatly simplify some of the challenges. So, as opposed to buying a collection of machines, installing software and then administering them, you have an option of renting that software preinstalled, operational with effectively a bunch of administrators who operate the servers for you. This takes away some of the mechanical burdens. It could also in principle take away some of the more challenging operational concerns, like security (and) compliance certifications. At the end of the day you still have the opportunity and the responsibility of understanding your business, understanding your data, bringing your tools to bear to make sense of it and to get some insight. But, what the cloud can do, I think, is reduce some of the avoidable friction. However, there is another thing to keep in mind.
Historically, people think in terms of data warehousing as part of the journey towards insights. And data
warehousing for the past 20 years has meant relational data warehousing. Today that's no longer the case. While relational warehouses continue to thrive, increasingly people want to look at any and all forms of data: weblogs, documents, imagery, along with tabular data. Sometimes tabular data that is from transactional systems; in other cases, transactional tabular data that is distilled from some of these other sources of data. But over and above the tabular data, they want to deal with unstructured, partially structured data. In addition to the cloud helping to avoid some of the friction with analysis, part of the challenge people face, is… there is another revolution going on.
Historically, people create so called data warehouses as a staging area for data analysis, and this has typically meant relational data warehouses. However, increasingly, while relations continue to be very important, both relations that come from transactional stores and relations or tables that are distilled from all these diverse types of data that we talked about, people want to directly intermingle analysis of relational and unstructured, loosely structured, multimedia data. So this is fundamentally a departure from how they have framed the goal of analysis. Along with the diversity of data and the enormously greater scales of data, you are seeing the diversity in their types of analysis they need to be able to carry out. And this has led to interest in a new class of systems called distributed systems (such as Hadoop), so people are now simultaneously grappling with how to wrap their heads around this new class of systems, new class of heterogeneous data warehouses and heterogeneous analysis at the same time that they are trying to move to the cloud.
MZ: Yes, sometimes I have heard that called a "data lake" in the cloud.
RR: Yes, and in fact we riffed off of that in naming our new store product on Azure. It is a service called Azure Data Lake.
MZ: Are we where we should be with cloud technology, then? Is it abstracting enough away of the nitty gritty
details of tweaking machines and looking at virtual boxes?
RR: I think we are on a path that's the right trajectory. We’re not there yet. I would say that today, by far the majority use of cloud is as a means to rent infrastructure, whether it's compute cycles or storage. What we need is a transition to renting higher-‐‑level services and eventually vertically integrated solutions. And, we are seeing that happening. All the major cloud vendors, they are now offering services such as database services. You can have transactional databases or even analytic stores now that you can rent, which is a far cry from simply renting blobs and virtual machines.
MZ: And then talking about data and the cloud, I think no discussion of a cloud environment is complete until you touch on the topic of security. We have seen very high profile data breaches in the last few years. They are happening more often and they are getting more severe. So, is the cloud kind of an additional risk, or maybe is it actually an opportunity for us?
RR: it's a fantastic question. I would have said that, say, 3 years ago, the answer would have been that most people viewed the cloud as a risk… that they could secure their data more effectively themselves. After the rash of high-‐‑ profile breaches, I think that is shifting. If you look at the large cloud providers, they are investing enormously in security, because they must, and that's their core competence. And as this becomes recognized, I think people are increasingly more comfortable placing their data in the cloud. But it's not just security. As you well know, compliance is a big, big issue, and in many verticals, there are critical compliance certifications that must be achieved before you can use systems. And as cloud providers get certified, this provides a very attractive alternative to doing it yourself and going through all of that stress. So, I think, 3 years from now, most people will view compliance certifications, security as reasons to move to the cloud.
MZ: That's actually a fantastic point. So, I guess with that prediction, maybe in a few years we will see an inflection point where it becomes better to move to the cloud to
shed the responsibility, to a certain extent to rely on someone who has more capacity, more expertise in security, in compliance to really take on these extremely important tasks to secure your enterprise. Because it’s, as we know, a high risk. Reputational risk and losing the data is never, never a good idea.
I think in closing I would say thank you very much for your time today. This conversation was extremely insightful, and I hope it also helped our viewers to deepen their knowledge about analytics, advanced analytics (and) how cloud technology and similar technologies really can help them in their business endeavors.
And to you, our viewers, let me close with a quote from another person who knew a thing or two about practical applications of science, it is Albert Einstein.
Albert Einstein once said: “Information is not knowledge”. Let that sink in a little bit: "Information is not knowledge". So, as we sometimes really get caught up in the hype of big data, let's not forget that the goal here is not to collect data but really to derive knowledge from it, which ultimately will lead to better business decisions.
So thank you, thank you again for joining us today! To learn more about predictive analytics and various practical use cases, please visit our website at www.zementis.com
About Zementis
Zementis, Inc. provides software solutions for predictive analytics. The company was founded on the principle that data science teams and IT departments can collaborate seamlessly and efficiently, allowing predictive models to rapidly move from development to deployment, so that businesses and other data-‐‑centric organizations can easily incorporate predictive analytics into their routine operations. Agile deployment of predictive solutions is the cornerstone of the Zementis philosophy.
CIO Review recognized Zementis as one of the "Top 20 most promising Big Data companies in 2013”, and Gartner named Zementis a “Cool Vendor in Data Science” in 2014. Its ADAPA® and Universal PMML Plug-‐‑in (UPPI) scoring engines are designed from the ground up to benefit from open standards and to significantly shorten the time-‐‑to-‐‑ market for predictive analytics in any industry. Customers such as Bosch, FICO, Equifax and Western Union have used Zementis solutions successfully to enhance their predictive analytics capacity and capabilities.
Zementis partners with leading analytics and data warehouse solution providers to enrich and extend customer capabilities. Supported partner solutions and platforms include: Amazon Web Services, Apache Software Foundation (Hadoop, Hive, Spark, Storm, Tomcat), Cloudera, Datameer, FICO, Hortonworks, IBM (BigInsights, PureData / Netezza, WebSphere), MapR, Microsoft Azure, Oracle WebLogic, Pivotal Greenplum, RedHat JBoss, SAP (HANA, Sybase IQ), Teradata and Teradata Aster.
For more information, please visit www.zementis.com.