• No results found

A total data management approach to big data

N/A
N/A
Protected

Academic year: 2021

Share "A total data management approach to big data"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

RESEARCH

PAPER

A total data management

approach to big data

How companies are developing a comprehensive data

management strategy to encompass big data along

(2)

This document is property of Incisive Media. Reproduction and distribution of this publication in any form without prior written permission is forbidden.

ContEntS

Executive summary

p3

Introduction

p3

Types of data

p4

Understanding the data

p5

The big data platform

p7

Benefits of the big data approach to TDM

p8

Survey perspectives on big data benefits and challenges

p9

Sought after features in big data tools

p12

Implementation and future trends

p12

Conclusion

p14

(3)

Executive summary

In recent years attention has increasingly focused on the corporate data that lies outside conventional structured enterprise databases. As the realisation has grown that this data is much larger in volume than previously thought, and potentially just as valuable as enterprise data, corporate users and vendors have both focused on developing technologies and approaches for capturing and integrating this big data into mainstream enterprise data infrastructures.

The development of big data technologies, in particular, presents new ways that such data can be brought onto centralised, high performance platforms to provide Total Data Management (TDM) across the entire enterprise.

Benefits of the TDM approach include the ability to achieve new levels of business intelligence (BI), data integration between big data, unstructured data and enterprise relational databases,

customer intelligence, and the ability to achieve major performance improvements while reducing the amount of storage and hardware necessary throughout the organisation. This Computing report looks at how big data technology has developed into a means for organisations not only to manage big data itself, but also to achieve TDM though importing corporate unstructured and semi-structured data onto big data platforms. It examines the benefits to organisations of achieving this new level of TDM, and the challenges – both technical and organisational – that need to be overcome to achieve the benefits of big data.

Introduction

Very large volumes of variable data are being generated at dizzying speeds by customers interacting with corporate websites, monitoring and logging software and devices, trading platforms and many other pillars of the modern business environment.

The term “big data” has become a potent brand, with many vendors avidly beating the big data drum as a panacea to the problem of managing all non-structured corporate data, or even, in some cases, as an all-encompassing new paradigm for enterprise data, structured and non-structured. Big data technologies that have been developed to handle very large volumes of variable, fast-changing and non-schematic data also provide large organisations with an ideal platform for centralising and storing their unstructured and semi-structured data, yielding major advantages in scope of analytics, knowledge discovery, new opportunity identification, as well as providing high speed, high performance storage and retrieval which can help slash hardware costs.

(4)

types of data

The “big” part of big data may be defined in terms of the Three Vs: volume, velocity and variety.

Volume – the quantity of data relative to the ability to store and manage it

Velocity – the speed of calculation needed to query the data relative to the rate of change of the data

Variety – a measure of the number of different formats the data exists in (eg text, audio, video, logs etc)

If any of these Vs is low, then it may be more efficient to analyse the data using traditional BI methods. However, if volume and required velocity are high, then big data techniques and technologies become more efficient and economical.

The data part of the big data equation can comprise almost any content: server logs; content captured from the web; mp3s; document scans; case notes; financial market data; banking transactions; audio and video streams; manufacturing process information; geophysical data; lab data; satellite telemetry; and so on. Again, the wider the range of formats, the more favourable big data techniques become when compared with relational database approaches.

For many people big data is synonymous with unstructured data, but as the above definition makes clear, this is an over-simplification.

For the sake of argument let us separate big data (rapidly changing, high in volume, wide in variety of formats) and corporate unstructured/semi-structured data, which may not be changing rapidly, but which also tends to exist outside of the enterprise database infrastructure.

Neither lends itself naturally to the relational database model, but big data technologies, with very fast and efficient storage and retrieval of very variable data types, together with tools that allow sophisticated high performance analysis of unstructured and semi-structured data held in big data databases, provide a platform for consolidating corporate unstructured and semi-structured data as well as big data.

With all manner of data integrated into a single platform, these technologies can provide

organisations with unparalleled BI, competitive information and customer insights, including new knowledge about the markets that organisations operate in, emerging trends, and potential new business opportunities.

They can also yield valuable information about the state and long term health of customer facing business systems, and how vulnerable these may be to failure: an invaluable source of business continuity and risk management data. Finely grained organisational knowledge, as well providing insight into the business also has the potential for exposing and correcting any number of false assumptions, errors and miscalculations that may have found their way into the corporate strategy.

However, considerable challenges exist in integrating big data file systems. These obstacles include: a lack of big data integration skills and expertise, as it presents a very different paradigm to conventional enterprise database and programming systems; discovery and selection of appropriate file sets; and how to perform integration into big data file systems.

There can even be issues around pilot studies and testing, which can be very different with big data when compared with conventional enterprise computing.

(5)

Understanding the data

Commentators agree that the majority of corporate data exists not in structured formats (i.e. within corporate enterprise databases) but rather as non-structured files: either unstructured or semi-structured.

As early as 1998, financial business consultancy Merrill Lynch put the proportion of non-structured data within large enterprises and organisations at around 80 percent. Some put the figure even higher: Anant Jhingran of IBM Research said in 2011 that the figure was nearer to 85 percent. And according to IDC, the source of Merrill Lynch’s 80 percent figure was also an earlier study by IBM.

A 2006 study by the Data Warehousing Institute looked at the composition of data held within corporate and enterprise applications. It found that 47 percent of data within centralised corporate computing was structured, while 31 percent was unstructured and 27 percent was semi-structured. Given that corporate computing applications are largely enterprise database focused, while personal and work-group computing is far more likely to be dependent on unstructured data, such as documents, presentations, emails and spreadsheets, these figures in fact accord largely with the Merrill Lynch / IBM consensus. Semi-structured data, such as XML, which the DWI study places at 27 percent, is likely to trail off sharply once one moves outside the corporate computing environment and more into workplace and business mobile computing – since XML and similar semi-structured data formats, including JSON and SQL files, are far more likely to be used by data

processing professionals, for transferring data between database tables, or into enterprise content management systems, than by management or administrative workers.

Examples of unstructured data include word processing documents, PDFs, spreadsheets, emails, reports, presentations, images, videos, and plain text files. Unlike big data, these tend to be small in size, but highly dispersed throughout organisations, and amount to very large volumes of data in aggregate.

Unstructured data: word processing documents, PDFs, spreadsheets, emails, reports, presentations, images, videos, plain text files

Semi-structured data: JSON, XML

Structured data: SQL database records

Unstructured data tends to be heavily represented in the tools by which companies are managed and decisions taken. Spreadsheets and presentations are still the key management tools for corporate decision-making at all levels. In terms of BI, unstructured enterprise data is a goldmine waiting to be tapped. The challenge is capturing that data from its widely dispersed locations, and secondly rendering or processing it in a way that yields up organisational knowledge and BI.

(6)

It is worth adding that for many organisations, the unstructured data challenge does not just lie outside the realm of enterprise databases. Many large organisations hold vast amounts of unstructured data within enterprise database systems themselves, as Variable Character or Large Text fields, or as Binary Large Objects (BLOBs). Some enterprise databases can be optimised for full text search, while other enterprise databases lack advanced full text functions. But

implementing sophisticated full text analytics on large enterprise databases can be both resource intensive and complex.

When asked to identify major users of unstructured data, the Computing survey respondents placed corporate, finance and HR first (Fig. 1). According to this view, it is remarkable is how much core business data is unstructured. While much of the focus of big data often revolves around sales and IT, core business corporate functions – HR, Finance and Corporate – are major users of unstructured or semi-structured data.

Fig. 1 : Which business functions would you identify as major users of

unstructured data?

Corporate Finance HR Marketing Sales IT Other

47%

42%

40%

31%

27%

22%

10%

*Respondents could select more than one answer.

Structured non-RDBMS data formats such as XML and JSON have become well established in the data centre and IT department, with nearly 60 percent of IT departments reporting their use. Outside IT, their use falls significantly to a third or below of most business functions (Fig. 2).

(7)

Fig. 2 : Which business functions would you identify as major users of

structured flat files (eg JSon, XML)?

IT Finance Corporate Sales Marketing HR Other

57%

33%

28%

23%

22%

19%

7%

*Respondents could select more than one answer.

An interesting feature about the Computing survey respondents’ estimates for unstructured and semi-structured data is how much closer they are to the 2006 Data Warehousing Institute study than to the more established figures publicised by Merrill Lynch, Aberdeen and other consultancy groups (see box page 5).

the big data platform

If corporate unstructured and semi-structured data is generated largely within organisations, big data is driven by the increasing convergence of organisations with Web technologies. However, although the Web is responsible for generating much of the data that most organisations would consider big data, as well as developing most of the technology to deal with it, not all big data is Web generated. Many organisations, including those in finance, insurance and Government, face significant big data processing challenges that have little or nothing to do with Web use.

The combination of massive volumes and large variety of formats that characterises big data means that conventional databases can struggle to process information in a timely fashion. This has the crippling effect of greatly reducing the types of analysis that organisations can effectively run on very large bodies of data. Big data technologies such as Hadoop meet this challenge by supplementing conventional databases – which store the data – with technologies that distribute, or map, the processing across multiple servers, and then recombine, or reduce, the result set, and return it to the

(8)

Another approach does away with underlying SQL databases altogether, replacing them with databases that are capable of very rapid record storage and retrieval, and greater flexibility and efficiency in handling very different types of data. Both Google and Amazon have pursued this course in their web infrastructures, while Facebook and Twitter have infrastructures based more on mapping and reducing data derived from more conventional database designs.

This addresses both the second and the third of the ‘V’s: velocity and variety. Big data often accumulates at such a fast pace that the locking and validation procedures of conventional databases struggle to keep up. Dedicated NoSQL databases do much better at keeping pace with very fast storage and retrieval, at the price of less functionality. Instead, NoSQL databases replace conventional SQL querying and analysis with specialist analytics tools that retrieve data from the NoSQL database and then analyse it separately.

Such velocity implications attach not only to web data, but also to financial trading systems, banking transaction systems, scientific observational data, and military systems.

Big data systems also set out to preserve the variety and scope of source data. Where possible, according to big data advocates, keep everything. Converting data to be stored in a relational database often strips away vital information.

Dedicated NoSQL databases can also be designed for specific data types, such as XML, graphs, or documents. Their storage and retrieval can be orders of magnitude both faster and more efficient, and they can retain all of an object’s full information than if it is converted to fit into an SQL schema, and can be far more agile in extracting meta-data from variable or differently sourced semi-organised data such as XML or JSON than relationally structured databases, which require a far more static and set data model for data extraction.

Benefits of the big data approach to tDM

Capturing and integrating corporate unstructured and semi-structured data into big data systems presents a number of significant organisational benefits. These include the ability to perform far more complex analytics; far faster data analysis, which in turn allows a considerable degree of dynamic BI to be extended to online and near real-time processes; faster data storage and retrieval; the ability to capture whole data; and efficiencies in storage caching – all of which can translate into considerable cost savings.

It becomes possible to integrate a large volume of highly diverse and widely distributed unstructured data files such as spreadsheets, documents, and presentations, together with corporate data into single enterprise file stores.

The benefits are not limited to better and more comprehensive corporate data analysis by taking in the totality of enterprise data, not just a fraction of it. Timeliness of querying could also be radically improved. Complex queries that in the past took hours to schedule and execute could in future be set up and run in seconds.

There is also a veritude pay-off that can extend right across the enterprise. Just as big data can enhance decision making at the corporate level by integrating crucial unstructured data into BI and other enterprise data systems, it also promises to improve decision making, planning and

execution at the departmental and workgroup levels.

At the local level, most project management, planning and delivery is based on unstructured data files. And these files contain errors, questionable assumptions and misinformation. Big data systems can highlight and spot-check such errors and misinformation on the fly, greatly improving the timeliness and effectiveness of local programmes.

(9)

There are potentially significant benefits for corporate IT departments too. Storage and licensing costs can be sharply reduced, integrity of corporate data can be improved, and disaster recovery and data reliability can be simplified and made more cost-effective.

There is significant evidence that big data systems can even reduce the level of processing and memory requirements of server systems by a significant amount, which could have the result of lowering long-term organisational IT CAPEX costs.

Survey perspectives on big data benefits

and challenges

Efficiency savings are seen as the main benefit of integrating dispersed data into big data systems. However, BI, customer intelligence and improved corporate decision making are also viewed as significant benefits by most respondents, while improved data quality – also a significant corporate business benefit – is the other main benefit for large businesses (Fig. 3).

Fig. 3 : What would you identify as potentially the main benefits of

integrating disparate data via big data file systems?

Organisational benefits

Efficiency savings Business intelligence Data integrity Customer intelligence Improved corporate decision making Better risk management Greater agility Opportunity identification Compliance

47%

37%

30%

28%

26%

23%

19%

12%

10%

(10)

Operational or technical benefits

Better data quality in general Improved performance Faster queries Reduced storage Reduced hardware requirement None Other

65%

49%

43%

35%

14%

4%

4%

*Respondents could select more than one answer.

Looking at the main technical benefits of big data file systems, 65 percent identified better data quality as the principal benefit – a finding that is largely accordant with the findings of the previous question. A large number of respondents also looked to faster queries and improved performance, which are business user benefits, and are service-focused. Cost and reducing footprint, largely internal IT issues, are seen as significantly less relevant to big data. Above all, respondents are looking at big data as a way to improve service and data functionality to business end-users.

Potential hurdles and significant questions remain.

Setting up and running big data systems can present significant skills and knowledge challenges as big data presents a very different paradigm to conventional enterprise relational database systems. Programming skills required for big data are also often quite different, centring around language types such as Perl and statistical analytical programming languages such as R and ECL. Expertise in areas such as multi-tiered SQL constructions are replaced by a need to understand efficient regex constructions and Levenshtein algorithms.

Some of these new languages can demand in-depth knowledge of data science and advanced maths. ECL, for example, is an extremely terse language that looks more like artificial intelligence code than the procedural programming languages most organisational IT staff are familiar with. On the other hand, it can be extremely efficient. One line of ECL is claimed to be equivalent programmatically to 120 lines of C++ code.

There can also be significant issues around selecting which unstructured file sets to be incorporated into enterprise big data systems, and how to prioritise integration of unstructured data.

In addition to the ever present problem of budget constraints – likely to feature at the top of any list of barriers to IT project implementation – survey respondents see the main barriers to implementing big data as the variety of data structures and volume of files to be incorporated. Hence, it is the perceived technical difficulty of implementing big data that is seen as being the main challenge in big data projects (Fig. 4).

(11)

Fig. 4 : What do you see as the challenges to implementing big data in

your organisation?

Organisational challenges

Technical or operational challenges

Making the business case

Uncertainty about return on investment (ROI)

Lack of or cost of staff resources required No compelling need Insufficient buy-in by the board Departmental opposition Cost of buying software Cost of buying hardware Other

48%

40%

32%

25%

20%

18%

16%

11%

8%

Budget constraints Variety of data structures Volume of files to be incorporated Skill levels Rate of change of the data you need to process Other

59%

53%

42%

28%

21%

8%

(12)

The main organisational challenge to implementing big data appears to be uncertainty over identifying and quantifying its benefits. Forty-eight percent identified difficulties in making the business case, while 40 percent found difficulty in quantifying the return on investment (ROI). In principal, these are closely related issues. They are almost certainly related to the way that respondents perceive big data as principally a business benefit to the organisation in a much wider sense, and to end-user departments. Since they are seeing it as a top-line business benefit, and not a bottom line business benefit, and because the technology is relatively new, they find it much more difficult to quantify the likely measurable difference big data could make to their

organisations.

Sought after features in big data tools

When survey respondents were asked what benefits they would most wish for from a big data implementation, top of the list were:

Better data quality

‘A single version of the truth’

Eliminating confusion caused by multiple spreadsheet versions

Improved opportunity identification and decision support

Ease of use and the ability to integrate multiple data formats rapidly are the most looked for features in big data systems, underlying how they are seen as a business and organisational benefit, and that there is some urgency to their implementation. Second to this are high performance and scalability (Fig. 5).

Technical issues and concerns about programming languages, APIs and other technical facets, are a much lesser concern.

Implementation and future trends

Big data is currently at an early stage of development as an enterprise technology. The technology originated within a number of the largest Web content providers, such as Amazon, Google and Yahoo, as a way of handling very large volumes of data. Cross-over to the enterprise computing world has been gradual to date.

Some notable figures within the big data world have bemoaned the slow uptake of big data within corporates and enterprise organisations. Amazon’s Werner Vogels, Chief Technology Officer of Amazon Web Services, told a conference in London in March 2012 that large enterprise organisations were lagging far behind small tech start-ups in their adoption of big data technologies.

“What we would consider a level playing field is actually tilting towards younger businesses,” said Vogels. “They are using big data much more rapidly at the core of their businesses to continuously improve their services, and enterprises are the ones who are having to catch up.”

The Computing survey shows that just 2 percent of large organisations are currently engaged in large scale roll-out of big data projects, and 61 percent of large organisations have not engaged with the issue at all yet (Fig. 6).

(13)

Fig. 5 : What features would you look for in a big data platform?

Ability to integrate different data formats rapidly Ease of use High performance Scalability

Low learning curve Familiar or easy to learn querying language

Wide range of prog. language interfaces Similarity to existing systems Powerful programming functions Other

69%

66%

51%

49%

39%

26%

16%

14%

14%

7%

*Respondents could select more than one answer.

Fig. 6 : How far has your organisation engaged with big data?

No interest shown

Discussions about using it Discussed and rejected its use Planning and appraisal stage Pilot projects Large-scale roll-out

61%

4%

8%

24%

2%

1%

(14)

However, our survey also shows the likelihood of very rapid growth of live production big data implementations in the near future, with large scale roll-outs likely to double, if not triple.

Furthermore, there is a considerable funnel building up behind enterprise big data, with a doubling at almost every stage up the funnel, apart from at the mouth of the funnel, which shows a tripling of interest.

Of the 15 percent of large organisations in our survey who have moved beyond discussions to take active decisions about the use of big data in their organisations, only 1 percent of enterprises had rejected the use of big data.

In contrast, 38 percent of large organisations are either considering the use of big data, or are actively planning, piloting or implementing big data implementations.

Conclusion

Big data technology platforms offer enterprises considerable opportunities to integrate large volumes of unstructured and semi-structured data that presently lie outside the ambit of the enterprise IT infrastructure towards to goal of Total Data Management. Integrating these dispersed sets of data onto big data platforms provides organisations with opportunities to gain new levels of BI, analytics, corporate risk management, and customer intelligence. It also provides scope to gain significant savings in dispersed and work-group computing hardware expenditure. Although big data is at an early stage in acceptance within enterprise computing, our survey shows that the number of installations is set to grow very rapidly, with the installed base likely to double, and then double again within the near future.

However, potential enterprise users see a number of issues with current big data technology, of which the chief are the difficulty in integrating different data formats rapidly onto a big data platform, and the general ease of use of big data technologies. In our survey, concerns over the ability to integrate different data formats rapidly and easily into big data systems, together with ease of use of these new technologies were the key concerns of enterprise IT staff by a clear margin.

(15)

About the sponsor, talend

Talend is one of the largest pure play vendors of open source software, offering a breadth of middleware solutions that address both data management and application integration needs. Companies are faced with exponential growth in the volume and heterogeneity of the data and applications they need to manage and control. A key challenge that IT departments face today is ensuring the consistency of their data and processes by using modelling tools, workflow management and storage, the foundations of data governance in any company today. This challenge is actually faced by organisations of all sizes – not only the largest corporations. In just a few years, Talend has become the recognized market leader in open source data management. Many large organisations around the globe use Talend’s products and services to optimise the costs of data integration, data quality, MDM and application integration. With an ever growing number of product downloads and paying customers, Talend offers the most widely used and deployed data management solutions in the world.

Contact talend: Visit: www.talend.com

References

Related documents

Nurses feel that both the software and the nurse are essential to clinical decision-making, and describe a process of ‘dual decision- making’, with the nurse as active decision

The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item.. Where records

human body can persist through death is equally a reason to suppose that a. human animal can persist through death, and any reason to deny

In this scenario total energy consumed is above the “target” energy demand for the transport sector for this scenario of 403 TWh (Table 3) and again, it was not possible to push

The 10 resident domains cluster into three groups : universal requirements for older people living in residential settings (privacy, the ability to personalise their

to the Convention for the Protection of Human Rights and Dignity of the Human Being with Regard to the Application of Biology and Medicine, on the Prohibition of Cloning Human

The role of dopamine in chemoreception remains to be fully established, but it is clear that stimulus evoked transmitter release from type I cells on to afferent nerve endings is a

The International Board held that any organisation composed of at least 75% of Business and Professional Women was eligible for membership in the Federation, and the Council of