1
THE BIG DATA DILEMMA
Memorandum from RCUK in response to the House of Commons Science and Technology Committee inquiry examining the opportunities and risks of 'big data'.
1. Research Councils UK (RCUK) is the strategic partnership of the UK’s seven Research Councils. Our collective ambition is to ensure the UK remains the best place in the world to do research, innovate and grow business. The Research Councils are central to delivering
research and innovation for economic growth and societal impact. Together, we invest £3 billion in research each year, covering all disciplines and sectors, to meet tomorrow’s challenges today. Our investments create new knowledge through: funding research excellence; responding to society’s challenges; developing skills, leadership and
infrastructure; and leading the UK’s research direction. We drive innovation through: creating environments and brokering partnerships; co-delivering research and innovation with over 2,500 businesses, 1,000 of which are SMEs; and providing intelligence for policy making. Find out more about our work at www.rcuk.ac.uk
2. This evidence is submitted by RCUK and represents its independent views. It does not include, nor necessarily reflect the views of the Knowledge and Innovation Group in the Department for Business, Innovation and Skills (BIS). The submission is made on behalf of the following Research Councils:
Arts and Humanities Research Council (AHRC)
Biotechnology and Biological Sciences Research Council (BBSRC) Engineering and Physical Sciences Research Council (EPSRC) Economic and Social Research Council (ESRC)
Medical Research Council (MRC)
Natural Environment Research Council (NERC) Science and Technology Facilities Council (STFC)
Executive Summary
3. New and pervasive computational and internet technology, sensors and mobile devices have created a data deluge. The Big Data challenge is about gaining value and insights from data, across a range of sectors, in innovative and beneficial ways. The data might be massive, varied, complex, combined, or fast-moving, for example from sensors connected to the Internet of Things (IoT) or unstructured media data. There has been a major change in many disciplines, ranging from history and physics to biology, due to the availability of huge
amounts of research data and the development of internet enabled data sharing. New digital technologies revolutionise our ability to analyse, compare, cross-reference and draw
conclusions from data. This data-driven transformation in research now requires future
researchers across all disciplines to be part of a research system that creates and rewards the skills in capturing, managing and analysing data that are now needed.
4. Big Data approaches allow new questions to be asked, providing opportunities for data to be exploited across many sectors, enable the improvement of services, can increase productivity and allows the commercialisation of very large datasets, such as access to and manipulation of multimedia collections in the creative economy. Four elements put the UK in a globally strong position to help entrepreneurs, business and society:
• Our open research approach
2
• Our well-connected e-Infrastructure, supporting research across disciplines, as delivered through recent sizeable investments
• Our strong position in data analytics, bolstered by the establishment of the Alan Turing Institute
5. Despite these strengths, the UK will fail to capitalise on its global strong position unless current data, analytics and expertise silos are broken down and a cohesive approach to data-driven research is established. This includes addressing significant research challenges limiting the exploitation of data, concerning privacy, security, public trust, informed consent for data reuse and legislation. As data volumes and variety continue to grow, new research issues
associated with simply managing the data being produced are also arising. The UK risks being left behind if we do not invest in more research and infrastructure in these areas, and in the exploitation of data analytics and infrastructure more generally, as well as ensuring there are sufficient high quality skilled people so the UK can make the next big breakthrough. 6. Our joint response covers the opportunities that have arisen as all disciplines and sectors
become data rich and the challenges this raises, both in terms of addressing outstanding questions and providing the researchers and practioners of data science that are needed.
Q1. The opportunities for big data, and the risks:
7. What constitutes ‘Big Data’ varies between disciplines and sectors. It goes beyond the extremely large and complex datasets generated by, for example, the Large Hadron Collider, DNA sequencing, Earth observation, government records and transactions, commercial or online interactions to include data from new technologies. Smaller scale data of high complexity and variability, for example from environmental monitoring and the Internet of Things (IoT), where sensors capture and process large amounts of fast-moving (and often personal) data, is such technology. Regardless of the source, Big Data is about gaining value and insights from extremely large, complex, fast moving or combined data, across a range of sectors in innovative and beneficial ways. New digital technology revolutionises our ability to analyse, compare, cross-reference and draw useful conclusions from data.
8. UK innovators across all sectors, including the creative and cultural economy, and types of organisation can benefit from techniques to analyse, cross-reference and combine multiple, disparate data sets from a variety of sources which will be developed through Big Data-related research across all disciplines. These all draw on advanced mathematical and statistical methods. Using artificial intelligence, visualisations, and semantic technologies, we can gain new insights, make new predictions, and add statistically valid evidence to existing hypothesis. To realise these opportunities requires new technology, new methodologies, and new thinking around issues of data privacy, security and informed consent of data use.
9. The UK has a world-leading open research agenda and RCUK has clear principles for open data and data sharing. The combination of computational power, digitisation of research data, data sharing and powerful data analytics are opening up new research paradigms across all disciplines. The vision of the Square Kilometre Array (SKA) or the Farr Institute are examples of this revolution in research. The Research Councils are cognisant that this data-driven transformation in research now requires future researchers across all disciplines to be part of a research system that creates and rewards the skills in capturing, managing and analysing data that are now needed.
10. Sophisticated data analytics can help UK entrepreneurs and business to: identify areas of opportunity for innovation in new products, processes and services; improve customer engagement; identify inefficiencies; improve productivity, identify market trends; and use the UK Government’s Open Data (data.gov.uk) to innovate and to create new companies.
3
11. Benefit from the insights that Big Data research can provide allow opportunities to improve government and services across, security and defence, the NHS, Met Office, the National Archives, the British Library and HMRC. Big Data can provide evidence and insight to fuel high quality service designby combining and analysing a multitude of data streams.
Government can gain better information and insights that can help policymakers solve large economic, social and environmental challenges, address inefficiencies in public services, and enable new ways to deliver solutions that were previously impractical or even impossible. 12. The third sector can gain better understanding of their volunteers, users and customers, as
well as improve their effectiveness and efficiency, and strengthen their competitiveness by informing public choice and driving innovation.
13. Researchers can gain new research insights, collaborate more effectively in important areas and advance new technologies and methodologies that rely on the outputs of Big Data Research, for example, in advanced materials, biotechnology, drug discovery, robotics and autonomous systems, imaging, design, linguistics, impacts of climate change on the urban environment, and human health.
14. While there is considerable potential for new discoveries and increased business and policy impact through the analysis of multiple, unstructured datasets, there are also considerable challenges around managing the ever-increasing volume of research data. Dealing with large amounts of data on the petabyte+ scale requires a different set of data management tools and analysis techniques and new methods of engagement and mediation for that data. Design has a key role to play and UK expertise offers a major opportunity to help inform thinking.
15. Progress in many key areas of research with huge potential impact is increasingly limited by capabilities of both the large national facilities and academic communities to effectively exploit big and complex data. For example, potential step changes in molecular understanding of cancer through STFC CLF's OCTOPUS facility is limited by our current ability to integrate and analyse the large and complex biological image data obtained.
16. One of the main issues for the wider social and community engagement with Big Data is that it seems remote, threatening and dehumanizing. Predictive policing or the use of predictive analytics for insurance purposes can provoke a strong reaction from the public. Community projects that use big data methods to explore local or family history, or support care of the elderly, or relate big data to local and community facilities, can build a wider understanding of the potential and issues surrounding big data in society. Such projects can demonstrate the structure and issues of data and will encourage wider understanding about issues concerning the collection, use and sale of data.
17. It will be important to understand the level of risk, particularly concerning privacy, security, trust, and develop the appropriate responses such as proportionate legal and ethical
frameworks. It is imperative that we address questions of building trust into our systems and processes, ensuring that our digital technologies and services are reliable, robust and secure, including in a post quantum world, and making access and use of the data generated by these technologies and services ethical and secure.
18. The risks to personal privacy and security need to be carefully considered. One potential consequence of data analysis may allow an individual’s identity being deduced from
combining independently anonymous data sets. Much of the value of these datasets lies in the potential to link, amalgamate and calibrate them with other datasets. However, gaining
genuinely informed consent for each individual use is likely to be difficult or even impossible in some cases. We need to explore how the law deals with issues of ownership and consent around data. Mitigating these risks requires among other things:
4
an emphasis on research, training and appropriate accreditation for investments and experts;
ensuring infrastructure for safe and secure data access and use;
continuous questioning and challenging of power, ownership, focus, use and governance;
public engagement to outline the benefits of these data to research and the safeguards in place to avoid inappropriate access and use.
19. These are fundamental questions across the entire digital economy. These systems must ensure that data or results (which may be personal or sensitive in nature or which in its analysed or aggregated form might be easily exploitable for nefarious purposes) are able to be accessed, handled and utilised in a secure, reliable, legal, ethical and responsible manner. There are many technical, cultural, social and ethical areas which need addressing to allow the full exploitation of Big Data (whatever its source), such as secure programming, network security, and embedded system security. Two multidisciplinary cross-Research Council
activities involving AHRC, ESRC and EPSRC (the Partnership for Conflict, Crime and Security Research and the Digital Economy Theme), and a cross-EPSRC Advisory Group workshop, have identified these areas as priorities going forward. The DE Theme, PaCCS and other funded research are also exploring issues of legislation and governance, ethics and rights, identity management, ideologies and beliefs, conflict and behaviour etc. Much work has already been done in partnership with the security agencies, but more needs to be done. The Strategic Defence and Spending Review will play an important part in mapping out the relevant landscape and the role of agencies such as GCHQ, CPNI, Dstl etc.
20. The data requirements here are such that this cannot be left to private sector companies alone noting that citizens and organisations need to be confident that their personal and confidential data is handled responsibly, and that they trust their interactions with others to be genuine. 21. There is a risk of the UK being left behind by international competitors and of insufficient
absorptive capacity in UK businesses (there is already evidence of skills shortages which are set to increase). Upskilling industrial sectors and research disciplines, addressing key areas where data science are vital, and exploring new opportunities to exploit Big Data, are all at risk if there is not continued and significant investment in research and research training in the domains which contribute to new tools and techniques in Big Data.
22. There is the risk that intellectual property rights issues and legislation do not keep pace with the fast-moving nature of technology. These issues affect both ‘born digital’ data and relevant digitized historic data, which may have different legislative requirements. Finally, data
preservation, sustainability and access face majopr challenges that may vary widely depending on the sources and nature of the data.
Q2. Whether the Government has set out an appropriate and up-to-date path for the
continued evolution of big data and the technologies required supporting it
23. Government investment aligned with the Big Data “Great Technology” has played a large roleadding to the infrastructure funding of data-related activities. This includes funding to establish the ESRC’s Big Data Network, and to the AHRC to support the Digital
Transformations theme and cross-Research Council Connected Communities programme. The MRC and partners have invested over £100m in the past two-years to establish the necessary infrastructure and capabilities to underpin UK health and medical bioinformatics research, including the establishment of the Farr Institute of Health Informatics Research and supporting six major Medical Bioinformatics awards). BBSRC supported other data
5
24. The Science & Innovation Strategy and Capital Funding Roadmap, published in December 2014, set out plans for Big Data investment in National E-Infrastructure, Longitudinal Studies, and an International Centre for New Forms of Data. These investments have yet to come to fruition, but they are widely considered to be crucial to driving forward the UK’s big data capacity and capability. RCUK is keen to maintain momentum in these area with a focus on effecting a culture change in the research data value chain.
25. The UK is leading internationally on opening access to research output. The G8 science ministers’ statement in London in June 2013 marked a commitment to openness in research as a means to speed up discovery and create innovation that is being pursued through UK engagement with international bodies such as CODATA, Science Europe and the Research Data Alliance where the UK is seen as a prime mover of the agenda.
26. The establishment of the Alan Turing Institute, a £67M joint venture between EPSRC and five leading UK Universities (Cambridge, Edinburgh, Oxford, UCL and Warwick), was announced in the March 2014 budget. This will be a centre of excellence for research, discovery,
development, training and translation of next generation data science methods. It is critical that there continues to be substantial investment in the range of research and research training that expands the provision to deliver progress in this area and the trained people needed.
27. The Government announced £75M of funding from the Large Facilities Capital Fund (LFCF) in December 2011 to build on UK leadership in bioinformatics and Big Data, and underpin the UK’s Data Capability Strategy. Adding to this LFCF support, BBSRC, MRC and NERC have invested in ELIXIR, a major EU Research Infrastructure project providing a sustainable European Infrastructure for biological information, supporting life science research and its translation to society, bioindustries, environment and medicine. ELIXIR will manage and safeguard the data generated by publically funded research. NERC has invested in JASMIN, which provides petascale storage and cloud computing for environmental science, building on long term investment in managing environmental data to enable the exploitation of exploding volumes of earth observation and climate data.
28. Large scale infrastructure to support research communities which collect, manage and distribute large volumes of data are established within fields such as Particle Physics,
Environmental Sciences, and Bioinformatics. Initiatives such as UK-T0 are seeking to federate and develop infrastructure across the research sector. Emerging scientific programmes such as the SKA, the European XFEL and the ITER Fusion experiment will be drivers for the development of next generation of Big Data science infrastructure and the analysis to exploit that data.
29. However, the full potential of Big Data will not be realised without further advancements in the state of the art in tools and techniques for Big Data analysis. For instance, advances in
computer science, statistics and analytical mathematical techniques alongside close interactions with other disciplines are essential to fully utilise the rich mine of information in data sets as diverse as GP notes and the outputs from Genomics England, through to image and film archives, and parliamentary records data.
Q3. Where gaps persist in the skills needed to take advantage of the opportunities,
and be protected from the risks, and how these gaps can be filled?
30. A major and critical gap exists in the availability of skilled people, both at the PhD and graduate level, needed to better exploit data. The Research Councils recognise the need to transform the UK’s training capacity and take action. It is vital to improve analytical and
6
quantitative skills across all disciplines, and building on existing initiatives designed to boost mathematical and statistical skills so that the research community is upskilled to be competent data researchers, are also important. The UK also needs the skills and techniques to build and manage large scale infrastructure to handle and distribute the volumes of data. Two areas, below, are relevant across the entire space:
Big Data Research; the tools, techniques and understanding to generate actionable information from large or diverse data sets.
Data Literacy to ensure secure, reliable, legal, ethical and responsible
handling and exploitation of data especially where data are personal or
sensitive in nature, or the results of its analysis and linkage might be easily exploitable.
31. The Research Councils are making efforts to boost the supply of highly skilled doctoral level people through support of activities such as Centres for Doctoral Training (CDT). As
examples, EPSRC supports 8 CDTs, which address some of the key themes in this domain, NERC has a CDT is Risk and Mitigation using Big Data and ESRC will commission a new CDT aimed at developing capacity in new and novel forms of data, and the methodological challenges that they present.
32. The Research Councils also support doctoral training through Doctoral Training Partnerships (DTP). In recognition of the growing importance of Big Data and data science research
BBSRC has mandated each of their DTPs is to include an element of statistical, mathematical or computational training. Furthermore, to increase understanding of the issues of skills and training requirements in these key areas, BBSRC convened an Expert Working Group to look at People and Skills issues, with an emphasis on data intensive bioscience. The group made a number of recommendations which were published in their report1.
33. The ESRC, partnering with the Nuffield Foundation and HEFCE, funds the Q-STEP centres, a network of 15 centres supporting the development and delivery of specialist undergraduate programmes in quantitative social science areas. The ESRC is in the process of
commissioning a new round of Methodological Research Projects under the National Centre for Research Methods (NCRM), one of which will be focused on analysis of online, digital and big data. The Hartree Centre, a joint investment of £313M at STFC Daresbury Laboratory, is a centre of excellence in applying High Performance Computing uses Big Data Analysis and Visualisation to solve real-world problems and support businesses in this area, runs a training programme which includes a Big Data Summer School.
34. However, demand is still out stripping supply. We need to both up-skill our existing cohort of researchers and develop the next generation of researchers. Similarly, outside of academic research there is a pressing shortage of skilled practioners of data science. Addressing this is critical to give the UK traction on problems across a range of sectors and application domains. The current state of the art, while powerful, struggles to exploit data sets fully because of their scale, diversity, unstructured nature, complexity, incompleteness, and perhaps most
important, the dynamic nature of the data sets concerned (where data sets can be rapidly changing or a decision is needed rapidly). There is a real need for the UK to build capacity and address methodological development in data science. The UKs capability in this area is at risk if there is not continued and significant investment in research and research training, both in the domains upon which these techniques are built (such as natural language processing, machine learning, digital signal processing, human computer interaction, formal verification, design, advanced quantitative, interdisciplinary and mixed method approaches etc.) and in the intersection of big data research with other disciplines.
7
35. There are insufficient skilled people working in interdisciplinary teams able to tackle some of the big problems in this area. The skills required include awareness of the need for: new algorithms; an improvement in predictive power; new forms of visualisation and
communication of information; new forms of data harvesting and ingestion; new approaches to sentiment analysis; exploiting multi-media and multilingual unstructured datasets; the legal and ethical issues surrounding big data; and new business models
Q4. How public understanding of the opportunities, implications and the skills
required can be improved, and ‘informed consent’ secured
36. There is a need for more research to answer big questions around how citizens understand what ubiquitous technologies are being operated around them and provide appropriately informed consent. The December 2014 Blackett Review “The Internet of Things: making the most of the Second Digital Revolution” noted a fundamental and underpinning issue was the need for research to ensure that appropriate security, trust, ethics and privacy are designed and implemented from the beginning. Likewise the Health and Biomedical Informatics Research Strategy for Scotland (produced by the Scottish government with input from Farr Scotland and Andrew Morris, Chief Scientist at the Scottish Government Health Directorate) recommended more training in the health Big Data area and work on trust and acceptability in the use of patient data. This big issue is broader than just health data (social care data, pensions data, insurance data etc.) and encompasses a multitude of research areas. 37. More interdisciplinary research, legislation and public engagement is needed in five broad
areas.
(1) Privacy preserving design and consenting to ubiquitous computing –there is a need
for algorithms and techniques so that privacy preserving techniques cannot be undone. How can we help people understand how their privacy might be affected/secured?
How can design be used to allow individuals to understand what data are being shared and when their data are being collected to allow for a more informed system of consent?
(2) Systems and their ecosystems – Human Data Interaction (HDI) frameworks; Citizen
understanding is required and the need to feel safe and secure with how our data is being used (Design principles can be used to make data gathering and reuse clearer and the benefits of big data approaches more relatable, thus increasing participation).
(3) Regulation and internationalisation – How is trust and economic impact of
cyber-systems affected by regulation and governance covering a range from consumer protection measures to those enabling surveillance by states?
(4) Legal and ethical frameworks – How can public perceptions of privacy built upon in
the development of future legal, corporate and governmental frameworks?
(5) Establishing a social contract – Clear and transparent public dialogue, which
illustrates not only the risks but also the societal benefits of data led research innovation will be crucial in gaining the public’s ongoing trust and support.
38. The Research Councils are starting to address some of these issues in a more holistic way following the March 2015 budget announcement of £40M for an integrated IoT research, demonstrator and incubator activity spanning across the Research Councils, Innovate UK, Digital and Future Cities Catapults and NHS England. A call for proposals issued by the EPSRC through the RCUK Digital Economy Theme to establish the £9.8M Research Hub, covered several aspects of IoT and data with a strong people focus.
39. EPSRC has also suggested that adoption of health sensor technologies and public acceptability could be demonstrated through everyday life scenarios to enable a more comprehensive understanding of the key issues around how the public perceive this sort of technology and its risks once it is deployed. Research is needed to address the ability to access and exercise control over one’s own data, build upon public perceptions of privacy, and to look at the future legal, corporate and governmental frameworks needed to ensure
8
individuals continue to have access to and control over their own data. One way in which to explore public acceptance is through community engagement with the research process. For example, the AHRC-funded Secret Life of a Weather Datum2 project has identified how there are different spaces of openness and engagement within the processing of big data, and has emphasized the importance of community involvement in collecting and sharing weather data, in turn offering a greater awareness of big data issues to members of the public.
40. Internationally, many discussions are underway around how to better understand, govern and communicate ethical issues, tools, principles and best practice. Examples include Science Europe,the OECD Global Science Forum expert group on ‘Ethics of New Forms of Data’ and the Expert Advisory Group on Data Access (EAGDA). The Research Councils (in particular the ESRC and MRC) are heavily engaged in these activities at both a strategic level and through their investments.
41. Informed consent has been the cornerstone of ethical research involving people; however as already noted, gaining genuinely informed consent from individuals relating to each individual data point is difficult if not impossible. Is it logistically possible for the individuals to whom the data pertains to be contacted each time it is used? What would the cost and time implications be? What happens to informed consent if/when that data is linked with other data to create new datasets which nobody agreed to being shared? How can researchers control whether (and how) informed consent is obtained if they aren’t involved in the process of primary data collection? While it is important that individuals are comfortable with their data being used for research, there is an argument to be made that traditional models of consent are not fit for purpose in an age where we are trying to maximise the research potential of data.
42. Buy-in from the public is key to ensuring the benefits (individual or wider) of using their data for research purposes are communicated effectively, and this starts with building a culture of trust, understanding and openness between participants (or potential participants) and researchers. Proper accreditation, kite marking and assured training can begin to give the public confidence, and RCUK investments such as the ADRN and Business and Local Government Data Research Centres are well-placed to lead, or to demonstrate best practice.
Q5. Any further support needed from Government to facilitate R&D on big data,
including to secure the required capital investment in big data research facilities
and for their ongoing operation
43. Big Data presents such a challenge and opportunity for the UK, but more Government support is needed for the provision of skilled people and research on both Big Data analytics and to create secure, reliable, ethical and responsible methods of handling data. This is important to ensure that the UK can remain world-leading in this area, in terms of research, delivery of public services and business. There also must be appropriate and commensurate resource investment which enables the people we are developing to conduct research with impact using big data in order to ensure that the infrastructures we are investing in are exploited to their greatest benefit (e.g. ESRC Secondary Data Analysis Initiative). Without appropriate resource funding, investment in data infrastructures will not be properly exploited. 44. There are many initiatives already underway which will help maintain UK leadership in this
area. To maximise impact there is a need to co-ordinate between these investments,
particularly following the £189M infrastructure/capital funding started in 2012, together with the Alan Turing Institute and other relevant investments, such as the current £40M call for an integrated IoT Research Hub, Demonstrators and Incubator, the ESRC’s Big Data Network,
9
the Open Data Institute, the Farr Institute and Data Science activities within universities (e.g. Imperial College’s Data Science Centre).
45. A policy and infrastructure for archiving digital data is essential to the effective use of big data. UK leadership spans several domains, such as NERC’s network of environmental data
centres and in social sciences through the UK Data Service. There is a need for consistent standards for data discovery and given the likely future higher cost of storage and the
technical demands it will impose, a co-ordinated infrastructure of national services that provide access to research data should be developed. Government infrastructure investment should take account of the role of libraries and archives in preserving data of all types. Government needs to be more actively engaged with European initiatives and networks in this area. The fundamental infrastructure need remains network connectivity and this requires continual review and upgrading so that it can meet the growing demand on it.
46. Government support provided to the research communities for establishing their computing infrastructure has placed the UK as a leader in this field. The UK has utilised this infrastructure to link, through the Hartree Centre, industry to significant High Performance Computing power. However, the global trend is a move to larger datasets and more computational intensive processing of data. Projects such as the SKA, LHC, European XFEL, ITER and new
opportunities for high resolution brain imaging, exemplify this trend. Today there is not enough resource for one of these projects in the UK. There has been a clear acceleration in Big Data infrastructure requirements as a result of strong innovation in technology for the public and private sectors. The upgrade path for national facilities needs to be revisited based on this new information to remain a global leader: the hardware is the engine of Big Data.
47. Focusing too strongly on capital investments alone risks neglecting the need for recurrent investments in power, networks, software, operations and the people. The real challenges lie in ensuring the pipeline of high quality skilled people trained in all aspects of data science is maximised. Ideally, there needs to be sufficient numbers of people with mathematical and computational groundingwithin all disciplines. These people must be of sufficiently high quality to be utilised by businesses, government, the third sector, users and in academia in this rapidly developing field.