The Big Data challenge - Widescale analysis of transcriptomics data using cloud computing metho

systems. It can be used as an interactive command line system, where the user types one line of instructions at a time and waits for the resultant output, or it can be used by submitting a program or pre-prepared sequence of commands which are all executed and output received as directed. RStudio is an implementation of R which allows multiple windows simultaneously showing for example an R script, a current interactive window, output such as plots, help information about R commands, and history information relating to previous R commands used. Thus RStudio is very useful for developing R scripts, for running them on different data and for visualization of the data and computations being studied. In this work R has been used extensively to analyse microarrays and to visualize the collected results.

2.8.3 Bioconductor

Bioconductor is an open source and open development software project for the analysis of genome data (for example sequence, microarray, annotation and other data types). Many packages have been developed and shared from the Bioconductor website, by a large number of different researchers. affy, affycomp, and affyPLM are examples of Bioconductor packages available for use with Affymetrix microarray data. affy has been used in this work.

2.9 The Big Data challenge

In recent years the term big data has come to mean any data produced in such large quantities that it poses a significant challenge to process it and extract meaningful conclusions for action. Sometimes big data is spoken of as having a number of V properties, for example:-

• Volume: the vast quantity of new data being generated and stored each day.

• Velocity: the increasing speed at which data is generated and at which it can be processed.

2.9. The Big Data challenge 20 • Variety: data can be structured or unstructured and can vary from text to geo-spatial

form, or from tweets to photos and videos.

• Variability: the meaning of some data can vary depending who collects it. Data can also be variable depending on when it was collected in the same sense that language can change the meaning of words over time.

• Veracity: data is only useful if it is accurate. Often data is found to be messy because of errors and inconsistencies within it.

• Visualization: after data has been processed, it needs to be presented in a readable and accessible way. New visualization packages are being developed every year. • Value: data is only as valuable as the accurate insights and information that it provides.

Big data can be hugely valuable, but only with the analysis tools that unlock its information.

The European Bioinformatics Institute (EBI) in Hinxton, UK, stores over 20 petabytes (1 petabyte is 1015_{bytes) of data and backups about genes, proteins and small molecules,}

according to an article by Vivien Marx in the June 2013 edition of Nature [13]. The data are used heavily by scientists around the world who are working in both academia and in industry. The nucleotide sequence databases “have a doubling time of less than one year” according to the EMBL-EBI annual report of 2012-2013 [14].

Breakthroughs are being made in many areas because of the large amount of research data that is publicly available at data centres such as the EBI. In the area of computational biology and bioinformatics there have been further major successes in the ENCODE project (ENCyclopaedia Of DNA Elements) which was planned as a follow-up to the Human Genome Project. One example is the production of a detailed map of human genome function [15]. A major analysis of the gut metagenome was performed by the Structural and Computational Biology Unit at EMBL Heidelberg, so that more than ten million mutations in the bacterial strains in the gut of 207 individuals were identified [16]. Another group

2.9. The Big Data challenge 21 devised a method to store information in synthetic DNA, which might provide the technology for long-term storage of infrequently accessed or archive data [17]. The BGI (formerly the Beijing Genomics Institute) in Shenzen, China, is the world’s largest genetic research centre. It generates at least a quarter of the world’s genomic data, having 178 machines [18] (in January, 2014) to sequence the genomes of samples from people, plants, animals and microbes.

With so much data being generated on the planet, there is much scope for computational biologists to make discoveries using other people’s data. Much data sits “under-analysed in databases all over the world” says Marcie McClure, a computational biologist at Montana State University in Bozeman [13]. McClure and her team have discovered eleven new fish retroviruses by analysing genomes computationally. Some of the approaches and tools of bioinformatics will be outlined in the next section.

The challenge big data presents is how best to efficiently analyse it in the different fields of research that it can benefit. Most researchers have historically tended to download data to their local machines for analysis. But this method is “backward” according to Andreas Sundquist, chief technology office of DNAnexus [13], because “the data are so much larger than the tools, it makes no sense to be doing that.” The alternative is to use a grid or a cloud for both data storage and for the analysis of the data. Hopefully the time and cost of accessing large amounts of data for computation will be reduced when the data resides ’near’ the compute facility. The proximity of data to the CPUs of clouds will depend on how the cloud provider has organized their data centres and their back-up resources. Cloud prices for data storage, data access and computation tend to reflect a provider’s methodology or business model. More about grid computing and cloud computing will be discussed in chapters 4 and 5. A brief introduction to the concepts of grid and cloud computing will be given in section 2.10.

In document Widescale analysis of transcriptomics data using cloud computing methods (Page 44-47)