Best Practices Series
The
FUTURE
of
DATA
WAREHOUSING
IBM
PAGE 42 APRIL/MAY 2016 | DBTA
Best Practices Series
10 Ways to Embrace the
Datawarehouses aren’t what they used
to be. No longer do they arrive at the front door of data centers as sophisticated software loaded on large servers, tied to vast amounts of storage arrays. Today, a data warehouse, as with many other technology solutions, may come as a simple, browser-based interface connected to a third-party cloud service. Plus, the type of value the data warehouse delivers is evolving as it moves from a repository of historical information to a supplier of real-time analytics for high-volume, high-variety datasets. The following presents 10 ways to help begin or encourage the evolution of data warehousing in the enterprise. 1. STOP THINKING OF
‘DATA WAREHOUSING’ AS DATA WAREHOUSING The data warehouse once was true to its name: a large facility to which large volumes of data were shipped and stored for later retrieval. That concept is rapidly becoming outdated. The idea of a large, centralized, physical storage facility has been fading in recent years, as the indus-try has recognized that a “data warehouse” may also be more virtualized or logical—a shared resource that resides closer to orig-inal data sources that are spread across the
enterprise. The data warehouse will still serve its primary mission of providing for integration and transformation of key enterprise data targeted for use by business analysts. However, the location of data—or even of the data warehouse solution itself— will increasingly become irrelevant. 2. MAKE YOUR DATA
WAREHOUSE STRATEGIC More than ever before, organizations rec-ognize data as a strategic asset that provides competitive differentiation in their markets. Data warehouses, as big data analytics plat-forms, need to serve as purveyors of these strategic resources as well. The data ware-house environment can be the focal point for data integration, provide for fluidity of analysis, and facilitate data lifecycle manage-ment to help determine which data is most essential to enterprise decision making. 3. EMBRACE THE CLOUD
Cloud—specifically public cloud ser-vices—is valued by data managers as an approach to back up and recover data. Close to two in five managers and professionals say they are weighing the use of public cloud services as backup resources for their databases. One in four is interested in set-ting up their data warehousing with public
cloud providers, as well as dev/test environ-ments (“Cloud Steps Up as a Data Storage Platform: 2016 IOUG Survey on Enterprise Data Storage Trends,” March 2016). Due to the growing convergence of the data ware-house with the cloud, it’s feasible that data warehousing itself may be as simple as any other online service to maintain and use, improving upon previous requirements for manual data scrubbing, transforma-tion, and integration that were required. In the process, data warehouses will be able to grow with the business. Cloud services will provide the functionality and scale required to serve any and all requirements. While cloud-based data analytics have only begun to catch a foothold in the enterprise in recent months and years, it may be an inevitability that cloud will become the default platform for analytics.
4. WADE INTO THE DATA LAKE, BUT CAREFULLY
Does the data lake present a form of competition to the traditional enterprise data warehouse approach? Perhaps, but enterprises will need to proceed cautiously with data lakes, ensuring that proper gov-ernance and management strategies are in place to maintain security and compliance. The rise of adjacent technologies such as
FUTURE
of
DATA
3
APRIL/MAY 2016 | DBTA
Hadoop as a repository for data in all its formats and file sizes facilitates the cre-ation of the data lake, in which data can be ingested and maintained for current and future enterprise requirements. Data lakes serve data analysis needs here and now, not just for future, yet-unknown data require-ments—though they are ideally suited for that as well. Indeed, economics may favor data lakes. Data maintenance and storage costs may be minimal in these more open environments, compared to the higher costs involved in data warehousing which supports well-integrated and transformed assets. However, data warehouses maintain years of organizational learning, intellec-tual property, and compliance standards in organized repositories that cannot be captured on-the-fly with data lakes. 5. POSITION DATA WAREHOUSES
FOR BOTH REAL-TIME AND HISTORICAL ANALYSIS
Data warehouses will still serve their original role as historical repositories but are also increasingly combining real-time data. The operational data warehouse that supports real-time or near-real-time data as it moves through production has been around for some time now, but its role will accelerate rapidly as adoption of real-time analytics takes hold. For example, a cus-tomer service representative on a phone call may be able to extend offers to upsell or cross-sell based on a hybrid blend of historical data and the customer’s current transaction information.
6. THINK INTEGRATION
The key challenge to moving to a data-driven business isn’t the technology or data underneath, but decision makers’ confidence in the data with which they are working. The data warehouse isn’t just for structured or relational data moving through enterprises but also serves a role enabling unstructured or semi-structured data, the types that flow from NoSQL envi-ronments, XML stores, and even Hadoop environments. The role of the new data warehouse will continue to be that of the integration engine for the enterprise,
espe-cially as organizations rely on diverse types of data for analytics requirements. The data warehouse can serve as an assured integration point, bringing together vari-ous sources within a trusted, well-designed, and standardized presentation for decision makers. This new, wider purpose requires re-evaluation of how information is being brought together from various parts of an enterprise. Increased reliance on data to form business decisions requires a high degree of confidence in data, which the data warehouse can deliver.
7. MOVE AWAY FROM ETL GRADUALLY
Extract, transform, and load (ETL) approaches have long been the transitional tiers between original data sources and data warehouses. These technologies often tend to be expensive and also slow down the availability of data to business users. Today’s business end user is accustomed to the speed of queries through Google and may perceive the time it takes to engage with data that is subject to ETL processes as cumbersome. Some sites are moving the transformation of ETL closer to the actual data, or positioning it later in the process. 8. EMBRACE VIRTUALIZATION
Virtualized, or logical, data warehouses have been limited in the past in terms of capacity and transfer speeds from disk to memory. Emerging technologies such as in-memory processing and solid-state disk remove many of the constraints that hold back virtual data warehousing. In the pro-cess, data warehouses can continue their evolution from server boxes and storage racks to that of services delivered through private, public, or hybrid clouds.
9. MOVE BEYOND DASHBOARDS AND STATIC INTERFACES Dashboards and standard reporting interfaces have been available for a num-ber of years through most BI and analytics solutions. Users dealing with the volume and complexity of data moving through their organizations require ease of access, as well as the ability to quickly grasp
pat-terns and implications. Data scientists and analysts are assigned the role of storyteller to provide perspective, but such analysis may take time. Decision makers will need tools that offer highly visual and in-depth examinations of information, such as 3D visualization. These new capabilities require consistent and rich data platforms that enable insightful analysis, along with the means to ask any question pertaining to any aspect of the business. Data warehouses need to be designed for this new type of workload, as data flows change to meet the needs brought on by real-time analytics. 10. DEVELOP AND NOURISH
KEY DATA SKILL SETS
Many data warehouse vendors have pro-moted the notion that their solutions could be run with a minimum of administration. With cloud-based data warehouses, admin-istration requirements may drop to zero. This is not the end of the story, however. With the rise of highly diverse data envi-ronments, and with a variety of resources that include data warehouses themselves, in addition to data lakes, Hadoop, and NoSQL databases, there is an ever-growing need for critical skills that span the gamut of technologies. The data scientist short-age is well-documented, but these diverse requirements also require key data man-agement and analysis skill sets. As analyt-ics grows—in scale as well as importance to the enterprise—the challenge will be in finding the skills to enable this growth. Enterprises will need to encourage greater training from within and also develop new strategies to recruit the best talent.
The evolution of data warehousing is well underway, but it won’t happen over-night. Cloud services, virtualization, the emergence of the data lake, new options for integration, and other emerging technol-ogies and approaches are key to enabling faster access to the increasing variety and volume of data pulsing through organiza-tions today. n
Sponsored Content
4 APRIL/MAY 2016 | DBTA
Why Cloud
is the Future
of Data Warehousing
EntErprisEdatawarEhousEs remain
as relevant as ever in today’s business environment. However, the traditional data warehouse is not up to the task with a flood of new data pouring in at an increasingly rapid pace. To maintain their competitive advantage, organizations must take action now to modernize the traditional data warehouse.
The key to modernizing is flexibility, as business requirements are changing at a faster pace than technologies. Historically, building a data warehouse has been a painstaking endeavor. You had to decide on specific data warehousing software and then determine and secure the proper balance of hardware and storage to allocate for it. When you needed to expand the data warehouse, you would then need to purchase new allocations of processing power, storage and software. At any point in the growth process, you were opening yourself up to a host of issues that could occur. Add to this, you had to pay for capacity, even if you were not yet using it.
ENTER THE DATA
WAREHOUSE APPLIANCE
Data warehouse appliances helped alleviate much of the pain of building your own data warehouse. These systems came pre-configured and integrated the data warehouse for analytical performance. All you really had to do was pick out your model and size, and once plugged in and turned on, you loaded your data into the warehouse.
The allocations of hardware and software were already done for you, yet the appliance still required you to buy a new system when you hit your resource limits. Additionally, with the growth of new applications that capture information about customers, employees, products, and suppliers, maintaining the data warehouse appliance could get expensive as you dynamically scale and build the integration needed to support specific data types, data-processing and analytic needs.
THE FUTURE IS CLOUD DATA WAREHOUSING
Why not combine the power of a data warehouse appliance with the flexibility of the cloud? You can meet the demands of growing data volumes and unpredictable analytic workloads with ease through a pay-as-you-go, on-demand, and elastic scalability model that provides significant benefits for both the business and IT.
As a fully managed cloud service or software-defined edition (currently in early access preview), IBM dashDB™ transforms the traditional data warehouse with a hybrid architecture that is
5
Sponsored Content APRIL/MAY 2016 | DBTA
A CLOUD DATA WAREHOUSE THAT IS ALSO FAST AND OPEN
The speed in gaining business outcomes from analytics is critical. What makes dashDB different from other cloud-based data warehouses to enable that fastest time to value? A lot. First and foremost, it includes IBM technologies that have been proven in enterprises to meet high performance analytic demands across petabytes of data so users get lightning-fast answers. Second, dashDB includes a range of support for tools and technologies to enable users to quickly gain insight for their unique analytical needs. Click here to read how dashDB provides competitive advantage to Avnet.
UNIQUE IN-MEMORY DATA PROCESSING
At the core of dashDB is BLU Acceleration®, IBM’s in-memory processing technology. BLU Acceleration provides cutting-edge data processing features, removing the typical constraints of in-memory solutions, with:
• Advanced processing that does not require the entire dataset to fit in-memory
• Prefetching of data to keep necessary data in or close to the CPU
• No decompression required because it preserves of the order of data • Data skipping intelligently determines
which data would not qualify for analysis within a particular query and skips it for efficiency.
WIDE RANGE OF IN-DATABASE ANALYTICS FOR FAST
MODEL DEVELOPMENT AND DEPLOYMENT
Building upon the performance of BLU Acceleration, dashDB also integrates IBM Netezza® analytics for fully integrated advanced in-database analytics. The same technology has its roots in IBM PureData™ for Analytics systems. What this means is that with dashDB, you get a myriad of predictive modeling algorithms built directly into the database. These algorithms are available whenever you want to use them.
An example of the algorithms included with dashDB are: • Linear regression • Decision tree clustering • K-means clustering
• Esri-compatible geospatial extensions
UTILIZE FAMILIAR BUSINESS INTELLIGENCE AND OPEN SOURCE TOOLS TO SPEED UP THE ANALYTICAL PROCESS
dashDB was developed with the open source and business intelligence ecosystem in mind. The dashDB service works natively with core IBM technologies like IBM Watson™ Analytics, IBM Cognos®, IBM DataWorks and others, yet it definitely does not stop there. dashDB was built to work with IBM’s myriad of business partners and BI tool sets including Looker, Aginity Workbench, Tableau and many others.
The open source community has enabled mass adoption of analytical
tools such as R for advanced data analysis and graphical visualization. It can be used to analyze data from many different data sources including external files or databases. dashDB integrates R for predictive modeling through an R runtime alongside the data to optimize performance. A web console can be used to load data and perform analytics within minutes. Data analysis could include SQL, BI tools, or R scripts and models.
To support the analysis of unstructured data sources, dashDB is integrated with the IBM Cloudant NoSQL database to automatically transform and push semi-structured data into dashDB with the click of a button. dashDB infers the right schema, manages the cloud-based integration service, and even synchronizes JSON documents online. n
For more information visit
dashdb.com
IBM www.ibm.com
CLICK HERE TO READ HOW DASHDB ENABLES AVNET TO MAKE
Sponsored Content
6 APRIL/MAY 2016 | DBTA
A Thoughtful Approach to Optimizing
the Data Warehouse With Hadoop
whiletheenterpriseDatawarehouse
(EDW) continues to play a critical role in leveraging big data to drive business value, many EDWs are being pushed to their limits. This overload stems, in part, from the many processes, such as ELT, that have been moved into the EDW over the years. In fact, according to Gartner, data integration and transformation workloads consume as much as 80% of EDW capacity, and 70% of EDWs are capacity constrained.
HADOOP: THE NEW
CENTERPIECE OF THE MODERN DATA ARCHITECTURE
In response to the challenges Big Data puts on traditional architectures, many organizations are shifting data and ELT workloads out of the EDW and into Hadoop.
Hadoop delivers great benefits to organizations seeking bigger insights and lower costs—and the ecosystem of related tools and solutions is constantly expanding to keep up as demands (real-time, self-service, etc.) and data growth (more sources, larger volumes) increase exponentially.
HADOOP COMES WITH CHALLENGES, TOO
A rapidly-evolving ecosystem results in more choice and innovation, but added complexity, uncertainty and risk can get in the way of realizing the full potential of Big Data.
So, how do you pick the framework that is here to stay? As the saying goes, the only thing constant in life is change— and even more so with Big Data.
Organizations want to take advantage of the powerful performance and scalability offered by the rapidly evolving compute frameworks, but they also want to be confident that the applications they invest in today won’t need to be rewritten, or require new skills, in 12 or 18 months.
KEEPING UP WITHOUT STARTING OVER
As a leader in enterprise software for more than 40 years—from Big Iron to Big Data, Syncsort knows a thing or two about adapting to change. That’s why we designed Syncsort DMX-h, our industry-leading Big Data integration software, specifically with these challenges in mind. The foundation of the design is intelligent execution, which allows users to visually design data transformations once and deploy them anywhere—across Hadoop MapReduce, Apache Spark, Linux, Unix, Windows—on premise or in the cloud —while achieving high levels of security, performance and scalability by taking advantage of native integration. As new frameworks inevitably emerge, DMX-h users will be able to deploy those same transformations there, too.
Because there are no required code changes when deploying on different frameworks, users can design sophisticated data transformations focused solely on business rules, without worrying about the underlying platform or execution framework—and without needing a new set of skills. The user enjoys a consistent experience while still taking advantage of the powerful native performance of the evolving compute frameworks.
DON’T FORGET REAL-TIME USE CASES
The proliferation of tools and execution frameworks aren’t the only changes to manage. Data sources and types have exploded, as well. For example, we see increased demand for real-time analytics in industries such as financial services, healthcare, telecommunications and retail. The opportunities for these companies are significant. However, streaming telemetry data, real-time data from sensors and Internet of things (IoT)
use cases currently require managing different components in the technology stack. Businesses can also combine these real-time data sources with batch data to drive additional insights. But this, too, adds complexity.
For organizations to adopt new technologies and realize faster time to value, they need an easier way to manage all their processes with one tool. Many organizations are interested in using a single software environment for streaming and batch processing, while taking advantage of the emerging compute frameworks like Apache Spark for analytics and the speed and resiliency of Apache Kafka for low-latency, fault-tolerant services.
Syncsort DMX-h integrates with messaging frameworks including Kafka to bring streaming and batch together through a single user interface. By removing obstacles to building streaming analytics applications on Hadoop, our solutions help close the gap between batch and real-time workloads. Organizations will be able to move data at scale in real-time across the enterprise to keep business systems in sync and speed-up slow batch processes with streaming ETL.
SIMPLICITY IS KEY TO ROI
While much of the Big Data innovation is focused on performance and scalability, flexibility and simplicity might prove to be just as important for long-term success. A single software environment that handles a broad spectrum of uses cases from operational efficiency to transformative applications—and insulates users from change—will be the key to maximizing ROI and achieving those Big Data dreams. n
SYNCSORT