Development of an integrated omics in silico workflow and its application for studying bacteria-phage interactions in a model microbial community

(1)

PhD-FSTC-2017-06

The Faculty of Sciences, Technology and Communication

DISSERTATION

Defence held on 20/01/2017 in Luxembourg

to obtain the degree of

DOCTEUR DE L’UNIVERSITÉ DU LUXEMBOURG

EN BIOLOGIE

by

Shaman NARAYANASAMY

Born 13 January 1985 in Klang, (Malaysia)

D

EVELOPMENT OF AN INTEGRATED OMICS

IN SILICO

WORKFLOW AND ITS

APPLICATION FOR STUDYING BACTERIA

-

PHAGE

INTERACTIONS IN A MODEL

MICROBIAL COMMUNITY

Dissertation defence committee

Dr. Paul Wilmes,

dissertation supervisor

Assistant Professor, Université du Luxembourg

Dr. Reinhard Schneider, Chairman

Université du Luxembourg

Dr. Jorge Gonçalves,

Vice Chairman

Professor, Université du Luxembourg

Dr. Anders Andersson

Associate Professor,KTH Royal Institute of Technology

Dr. Rohan Williams

National University of Singapore

(2)

(3)

Development of an integrated omics

in silico

workflow and its

application for studying bacteria-phage interactions in a model

microbial community

A dissertation by

Shaman Narayanasamy

Completed in the

Eco-Systems Biology Group, Luxembourg Centre for Systems Biomedicine

To obtain the degree of

DOCTEUR DE L’UNIVERSITÉ DU LUXEMBOURG

EN

BIOLOGIE

Dissertation Defence Committee:

Supervisor: Asst. Prof. Dr. Paul Wilmes

Chair of committee: Dr. Reinhard Schneider Vice chair of comittee: Prof. Dr. Jorge Gonçalves Committee members: Asst. Prof. Dr. Anders Andersson

Dr. Rohan Williams

(4)

(5)

For my parents. For my sisters.

(6)

(7)

Declaration

I hereby declare that this dissertation has been written only by the undersigned and without any assistance from third parties. Furthermore, I confirm that no sources have been used in the preparation of this thesis other than those indicated herein.

Shaman Narayanasamy Belval, Luxembourg February 10, 2017

(8)

(9)

ACKNOWLEDGEMENTS

I would first and foremost like to thank my direct supervisor Asst. Prof. Dr. Paul Wilmes for providing me with the opportunity to work under his supervision and in collaboration with his esteemed Eco-Systems Biology group. This work would not be possible without your wholehearted guidance and advice. I’ve learned a great deal about my topic throughout my time with you. We’ve managed to push the limits of our work to a level beyond my expectations. Thanks for allowing, not only me, but every single member of the group to formulate and expand upon our own ideas. Also, thanks for providing us with the best possible working environment and infrastructure to conduct our work! In my humble opinion, your open-mindedness was the foundation that brought this group to one of the top in this field of study. Although, this did not come by easily. I still remember the days when we as a group had difficulties getting our papers accepted. Now, we realized that it was simply because we were simply ahead of the rest. And this is thanks to your vision which has been cultivated within me and, I’m sure within the rest of the group. One thing I always had the utmost confidence is that if I put in the effort, you will always be there to back me up. Thanks for always having my back! I sincerely hope that my contribution represents all that you, and the group stands for and will further accelerate our work to further push the limits of what is possible, or rather achieve the IMPossible.

Asst. Prof. Dr. Anders Andersson and Dr. Rohan Williams, thanks for taking the time off your schedules part take in my PhD defence. It is a great honour to have you both in my jury. I would like to thank my CET committee members Prof. Dr. Jorge Goncalves and Dr. Reinhard Schneider who were always supportive of my work and for their guidance throughout my PhD studies. I am also honoured to have you both as chairpersons of my PhD defense. Thanks to Prof. Dr. Serge Haan and his team in the doctoral school for always doing their best in supporting the PhD students. Thank you to both the LCSB and the University of Luxembourg support staff for their help. Special thanks to the institute director, Prof. Dr. Rudi Balling for being the pillar of this institute and making it grow to what it is today. Your enthusiasm has always been and will always be inspiring and infectious!

My fellow ESBers, past and present, I would like to first thank all of you for creating an awesome working environment. Dr. Emilie Muller, there are no words that could express my gratitude towards you. This work would not have been possible without your guidance and support. It was the utmost joy working with you. You’re the best! Dr. Anna Heints-Buschart, thanks very much for always supporting us PhD students.

(10)

Your problem solving abilities are parallel to none! Laura Lebrun, thanks you so much for your tireless and consistent effort in obtaining our samples and providing support in the lab. This work would definitely not be possible without you. Malte Herold, thanks for your supporting me in my work. Please bear the torch for the LAOers! Anne Kaysen, thanks for helping me along in my work and for being the number one user of IMP! Linda Wampach, thanks for all your graphical skills you brought to the group. Your puns make a hard day at the office more bearable! Joelle, Joanna, Kacy, Audrey, Janine, Luise and Mark, thanks for making the group a great one to work in, even if we don’t necessary coincide in terms of work, I am grateful for all your support! Former members, Cedric, Abdul, Bimal, Dil, Pranjul and Mahesh. Thanks for all the good times! Finally, I would like to thank the coffee machine and my laptop for not breaking throughout my PhD, despite heavy usage.

Yohan Jarosz, thanks for your enthusiasm and your positive attitude in times of difficulty. You brought the work within this thesis to another level. You the man! To our number one collaborator, Dr. Patrick May, working with you has truly been a valuable learning experience! Thanks for all your support and guidance. Dr. Nicolas Pinel, thanks for setting up the foundation of this work. Also, thanks to the rest of the R3 team and the HPC team for all their support during this work.

I thank Fonds National de la Recherche for funding my PhD. Thanks to the University of Luxembourg, the Luxembourg Centre for Systems Biomedicine and Doctoral School in Systems and Molecular Biomedicine for their support throughout my PhD.

To my former supervisor from University of Helsinki, Dr. Jaarko Salojarvi, you’ve always been and still are an inspiration to me. Prof. Kaarina Sivonen, thanks for introducing me to the word of microbiology which sparked the interest in my current work. Dr. Hao Wang, thanks for your supervision during my master’s thesis. University of Helsinki provided the foundation necessary to carry out the work.

To all my good friends in Luxembourg, Susana, Shankari, Aishwarya, Raman, Dheeraj, Zuogong, Anestis, Hector and Sevgin. You are my family over here. Thanks for always being around. Special thanks to my fellow UNIjammers for making my experience in Luxembourg something I would never have imagined! My friends in Helsinki, my home away from home, thank you. And last but not least, to my friends back in Malaysia, your encouragement is greatly valued. I’ll never forget where I come from and I will always be a Klang boy. Satu hati sampai mati!

To my extended family back in Malaysia: Chinnavva, Susi pinni, Prema pinni, Vengkat chinnaya, Uncle Rajan, Mani mava, Ananthi At the, Karishma, Neelam, Snegha, Tharisan, Ram, Raj, Hamsa, Sarvin, Suresh mava, Aunty Evis, Nastenca, Yascinta, Sabrina, Berenth. Thanks for keeping Amma company when all of us are not around and when times were rough for her. I am truly indebted to all of you for all that you have done! Finally, to my dearest sisters Kaarjel and Sai; thanks for being irritatingly over achieving siblings, and making me feel the need to push myself to fit in with you both! You are the best sisters anyone could ask for. Thanks very much for always being there. Love you both! Amma and Papa, thanks for always believing in me. Thanks for being an inspiration and a motivation. I am the person I am today because of your love and the values you instilled within me. And for that, I am eternally grateful to you both.

(11)

ABSTRACT

Microbial communities are ubiquitous and dynamic systems that inhabit a multitude of environments. They underpin natural as well as biotechnological processes, and are also implicated in human health. The elucidation and understanding of these structurally and functionally complex microbial systems using a broad spectrum of toolkits ranging fromin situsampling, high-throughput data generation ("omics"), bioinformatic analyses, computational modelling and laboratory experiments is the aim of the emerging discipline of Eco-Systems Biology. Integrated workflows which allow the systematic investigation of microbial consortia are being developed. However,in silicomethods for analysing multi-omic data sets are so far typically lab-specific, appliedad hoc, limited in terms of their reproducibility by different research groups and sub-optimal in the amount of data actually being exploited. To address these limitations, the present work initially focused on the development of the Integrated Meta-omic Pipeline (IMP), a large-scale reference-independent bioinformatic analyses pipeline for the integrated analysis of coupled metagenomic and metatranscriptomic data. IMP is an elaborate pipeline that incorporates robust read preprocessing, iterative co-assembly, analyses of microbial community structure and function, automated binning as well as genomic signature-based visualizations. The IMP-based data integration strategy greatly enhances overall data usage, output volume and quality as demonstrated using relevant use-cases. Finally, IMP is encapsulated within a user-friendly implementation using Python while relying on Docker for reproducibility. The IMP pipeline was then applied to a longitudinal multi-omic dataset derived from a model microbial community from an activated sludge biological wastewater treatment plant with the explicit aim of following bacteria-phage interaction dynamics using information from the CRISPR-Cas system. This work provides a multi-omic perspective of community-level CRISPR dynamics, namely changes in CRISPR repeat and spacer complements over time, demonstrating that these are heterogeneous, dynamic and transcribed genomic regions. Population-level analysis of two lipid accumulating bacterial species associated with 158 putative bacteriophage sequences enabled the observation of phage-host population dynamics. Several putatively identified bacteriophages were found to occur at much higher abundances compared to other phages and these specific peaks usually do not overlap with other putative phages. In addition, there were several RNA-based CRISPR targets that were found to occur in high abundances. In summary, the present work describes the development of a new bioinformatic pipeline for the analysis of coupled metagenomic and metatranscriptomic datasets derived from microbial communities and its application to a study focused on the dynamics of bacteria-virus interactions. Finally, this work demonstrates the power of integrated multi-omic investigation of microbial consortia towards the conversion of high-throughput next-generation sequencing data into new insights.

(12)

(13)

SCIENTIFIC OUTPUT

Major parts of this thesis are based upon work that has either been published, is currently under peer-review and/or ready for submission with the candidate as the first author. In addition, the candidate has also co-authored several publications of which minor parts are incorporated in the thesis. The full list of scientific outputs is listed in sections below and the original manuscripts are provided in theAppendix A.

Publications in peer-review journals

• Shaman Narayanasamy, Emilie E.L. Muller, Abdul R. Sheik, Paul Wilmes (2015). Integrated omics for the identification of key functionalities in biological wastewater treatment microbial communities. Microbial Biotechnology8: 363-368. [Appendix A.1]

• Shaman Narayanasamy†, Yohan Jarosz†, Emilie E.L. Muller, Anna Heintz-Buschart, Malte Herold, Anne Kaysen, Cédric C. Laczny, Nicolàs Pinel, Patrick May, Paul Wilmes (2016) IMP: a pipeline for reproducible reference-independent integrated metagenomic and metatranscriptomic analyses.Genome Biology17: 260. [Appendix A.2]

• Emilie E.L. Muller, Nicolás Pinel, Cédric C. Laczny, Michael R. Hoopmann,Shaman Narayanasamy, Laura A. Lebrun, Hugo Roume, Jake Lin, Patrick May, Nathan D. Hicks, Anna Heintz-Buschart, Linda Wampach, Cindy M. Liu, Lance B. Price, John D. Gillece, Cédric Guignard, Jim M. Schupp, Nikos Vlassis, Nitin S. Baliga, Robert L. Moritz, Paul S. Keim, Paul Wilmes (2014). Community-integrated omics links dominance of a microbial generalist to fine-tuned resource usage.Nature Communications5: 5603. [Appendix A.3]

• Hugo Roume, Anna Heintz-Buschart, Emilie E.L. Muller, Patrick May, Venkata P. Satagopam, Cédric C. Laczny,Shaman Narayanasamy, Laura A. Lebrun, Michael R. Hoopmann, Jim M. Schupp, John D. Gillece, Nathan D. Hicks, David M. Engelthaler, Thomas Sauter, Paul S. Keim, Robert L. Moritz, Paul Wilmes (2015). Comparative integrated omics: identification of key functionalities in microbial community-wide metabolic networks.NPJ Biofilms and Microbiomes1: 15007. [Appendix A.4]

†_{Co-first author}

(14)

Submissions in peer-review journals

• Linda Wampach, Anna Heintz-Buschart, Angela Hogan, Emilie E.L. Muller,Shaman Narayanasamy, Cédric C. Laczny, Luisa W. Hugerth, Lutz Bindl, Jean Bottu, Anders F. Andersson, Carine de Beaufort, Paul Wilmes (submitted). Colonization and succession within the human gut microbiome by archaea, bacteria and microeukaryotes during the first year of life.Frontiers in Microbiology[Appendix A.5] • Anne Kaysen, Anna Heintz-Buschart, Emilie E. L. Muller,Shaman Narayanasamy, Linda Wampach,

Cédric C. Laczny, Katharina Franke, Jörg Bittenbring, Jochen G. Schneider, Paul Wilmes (submitted). Integrated meta-omic analyses of the gastrointestinal tract microbiome in patients undergoing allogeneic stem cell transplantation.Journal of Experimental & Clinical Cancer Research[Appendix A.6]

Publication in non-peer review platform

• Shaman Narayanasamy†, Yohan Jarosz†, Emilie E.L. Muller, Cédric C. Laczny, Malte Herold, Anne Kaysen, Anna Heintz-Buschart, Nicolàs Pinel, Patrick May, Paul Wilmes (2016). IMP: a pipeline for reproducible metagenomic and metatranscriptomic analyses.BioRxiv. [Appendix A.8]

Manuscripts in preparation

• Emilie E.L. Muller†,Shaman Narayanasamy†, Myriam Zeimes, Laura A. Lebrun, Nathan D. Hicks, John D. Gillece, James M. Schupp, Paul Keim, Paul Wilmes (in preparation). First draft genome sequence of a strain belonging to theZoogloeagenus and its gene expressionin situ. [Appendix A.7]

• Shaman Narayanasamy, Emilie E. L. Muller, Laura A. Lebrun, Nathan D. Hicks, John D. Gillece, James M. Schupp, Paul S. Keim, Paul Wilmes (in preparation). The dynamics of bacteriophages and bacterial host populations within oleaginous microbial consortia within wastewater treatment plants. [Chapter 3]

Oral presentations in scientific conferences, symposia and workshops

• Dynamic changes in the CRISPR-spacer complement and targeted bacteriophages within a natural microbial community resolved using time-resolved metagenomics and metatranscriptomics (2014). Opening the Microbial World With Metagenomics. Helsinki, Finland.

• Metagenomic and metatranscriptomic analyses of CRISPR-Casdynamics within a microbial community (2015).Life Sciences PhD days. Belval, Luxembourg.

• Integrated omics provides unprecedented insights into microbial community structure and function (2016) European Space Agency workshop for Micro-Ecological Life Support Systems Alternative (MELiSSA). Lausanne, Switzerland.

(15)

• Metagenomic and metatranscriptomic analyses of CRISPR-Casdynamics within a mixed microbial community (2016).16th Conference of the International Society of Microbial Ecology. Montreal, Canada. • IMP: A reproducible pipeline for reference-independent integrated metagenomic and metatranscriptomic

analyses (2016).LCSB Minisymposium on Lab AutomationBelval, Luxembourg.

• IMP: The Integrated Meta-omic Pipeline: a tale of analysis and automation towards reproducible research results (2016).1st RSG Luxembourg Congress & 2nd BeNeLuxFr SymposiumBelval, Luxembourg.

Poster presentations in scientific conferences, symposia and workshops

• A dynamic population-level model of antiviral defense mechanisms (2013). Life Science PhD Days. Luxembourg, Luxembourg.

• Eco-systems biology of microbial communities: Integration of biomolecular information from unique samples (2013).EMBL Symposium: New Approaches and Concepts in Microbiology. Heidelberg, Germany. • Eco-systems biology of microbial communities: Integration of biomolecular information from unique

samples (2013).2nd Symposium of Systems Biomedicine. Belval, Luxembourg.

(16)

(17)

CONTENTS

Abstract iii

Scientific output iv

Contents vii

List of Figures x

List of Tables xii

1 Integrated omics for the characterization of microbial community structure, function and

dy-namics 1

1.1 Microbial communities . . . 2

1.2 Model microbial community . . . 3

1.3 Bacteriophage - bacterial host interactions . . . 9

1.3.1 Phage infection mechanisms . . . 10

1.3.2 The CRISPR-Casmechanism . . . 10

1.4 Eco-Systems Biology . . . 13

1.4.1 Eco-Systems Biology for the study of phage-host interactions . . . 13

1.4.2 Biomolecular extraction . . . 14

1.4.3 Multi-omic measurements . . . 16

1.4.4 Meta-omic NGS data analysis . . . 19

1.4.5 Multi-omic analyses of meta-omic data . . . 23

1.5 Objectives of this work . . . 30

2 A pipeline for reproducible reference-independent integrated metagenomic and metatranscrip-tomic analyses 31 2.1 Abstract . . . 32

2.2 Background . . . 32

(18)

2.3 Methods . . . 32

2.3.1 Details of the IMP implementation and workflow . . . 33

2.3.2 Iterative single-omic assemblies . . . 37

2.3.3 Execution of pipelines . . . 38

2.3.4 Data usage assessment . . . 38

2.3.5 Assembly assessment and comparison . . . 39

2.3.6 Analysis of contigs assembled from MT data . . . 39

2.3.7 Analysis of subsets of contigs . . . 39

2.3.8 Computational platforms . . . 39

2.3.9 Availability of data and material . . . 40

2.4 Results . . . 41

2.4.1 Overview of the IMP implementation and workflow . . . 41

2.4.2 Assessment and benchmarking . . . 46

2.4.3 Use-cases of integrated metagenomic and metatranscriptomic analyses in IMP . . . 53

2.5 Discussion . . . 57

2.6 Conclusion . . . 59

3 The dynamics of bacteriophages and bacterial host populations within the model system 61 3.1 Abstract . . . 62

3.2 Background . . . 62

3.3 Methods and material . . . 63

3.3.1 Sampling and strain collection . . . 63

3.3.2 Extraction of biomolecules . . . 64

3.3.3 Metagenome and metatranscriptome sequencing . . . 64

3.3.4 Bioinformatic analyses . . . 65

3.4 Results . . . 68

3.4.1 Large-scale analyses using IMP . . . 69

3.4.2 Community-level analysis of CRISPR elements and protospacers . . . 71

3.4.3 Population-level analysis of CRISPR elements . . . 76

3.4.4 Putative bacteriophage sequences . . . 79

3.4.5 Bacteriophage and host dynamics . . . 85

3.5 Discussion . . . 91

3.6 Conclusion . . . 95

4 General conclusions and outlook 96 4.1 Integrated omics: From data to associations . . . 97

4.2 Extending the functionality of IMP . . . 99

4.2.1 Updates with state-of-the-art tools . . . 100

4.2.2 Integration of reference-based analysis . . . 101

4.2.3 Extension to multi-sample analyses . . . 102

4.2.4 Metaproteomic analyses engine . . . 102

(19)

4.2.5 Keeping up with technological advancements . . . 103

4.2.6 Standardized benchmarking for integrated omics . . . 103

4.3 Bacterial-phage interactions . . . 104

4.3.1 Moving beyond associations and hypotheses . . . 105

4.3.2 From Eco-Systems Biology to applications . . . 106

References 107 Glossary 131 Appendices 134 Appendix A Article manuscripts 135 A.1 Integrated omics for the identification of key functionalities in biological wastewater treatment microbial communities. . . 136

A.2 IMP: a pipeline for reproducible reference-independent integrated metagenomic and meta-transcriptomic analyses. . . 143

A.3 Community-integrated omics links dominance of a microbial generalist to fine-tuned resource usage. . . 166

A.4 Comparative integrated omics: identification of key functionalities in microbial community-wide metabolic networks. . . 177

A.5 Colonization and succession within the human gut microbiome by archaea, bacteria and microeukaryotes during the first year of life . . . 190

A.6 Integrated meta-omic analyses of the gastrointestinal tract microbiome in patients undergoing allogeneic stem cell transplantation. . . 235

A.7 First draft genome sequence of a strain belonging to theZoogloeagenus and its gene expres-sionin situ. . . 292

A.8 IMP: a pipeline for reproducible metagenomic and metatranscriptomic analyses. . . 317

Appendix B Additional figures 318 B.1 Supplementary figures forChapter 3 . . . 319

Appendix C Additional tables 320 C.1 Supplementary tables forChapter 3 . . . 320

Appendix D Additional files 326 D.1 Additional file 2.1 . . . 326

D.2 Additional file 2.2 . . . 326

D.3 Additional file 2.3 . . . 326

(20)

LIST OF FIGURES

1.1 Schematic representation of an activated sludge based biological wastewater treatment plant

process . . . 6

1.2 The path from large-scale integrated omics to hypothesis testing and biotechnological appli-cation in the context of biological wastewater treatment . . . 7

1.3 Biofuel production from wastewater sludge . . . 8

1.4 Structure, life cycle and dynamics of bacteriophages . . . 10

1.5 Mode of action of type II CRISPR-Cassystems . . . 12

1.6 Concomitant extraction of biomolecules from a single unique microbial community sample and their downstream high-throughput measurement techniques . . . 15

1.7 Next generation sequencing protocols and chemistries . . . 18

1.8 Simplified workflow for reference-independent metagenomic and/or metatranscriptomic analyses . . . 22

2.1 Schematic overview of the IMP pipeline. . . 44

2.2 Example output from the IMP analysis of a human microbiome dataset (HF1) . . . 45

2.3 Assessment of data usage and output generated from co-assemblies compared to single-omic assemblies . . . 50

2.4 Assessment of the IMP-based iterative co-assemblies in comparison to MOCAT- and MetAMOS-based co-assemblies. . . 53

2.5 Metagenomic and metatranscriptomic data integration of a human fecal microbiome . . . 56

3.1 Simplified schema for the study of phage-host interaction . . . 69

3.2 Summary of input and output of IMP analyses of the LAMPs time-series datasets . . . 71

3.3 Summary of CRISPR elements detected using different methods. . . 72

3.4 Community-wide dynamics of CRISPR elements . . . 74

3.5 Summary of CRISPR element information . . . 75

3.6 M. parvicellapopulation-level dynamics of CRISPR elements . . . 78

(21)

3.7 LCSB005 population-level dynamics of CRISPR elements . . . 79

3.8 Dynamics of LCSB005 host and associated bacteriophages . . . 86

3.9 Dynamics ofM. parvicellahost and associated bacteriophages . . . 87

3.10 Dynamics ofM. parvicellahost and associated RIGes. . . 90

B.1 Microscopy photo of bacterial strain LCSB005 . . . 319

(22)

LIST OF TABLES

1.1 Multi-omic studies of microbial communities . . . 25 2.1 Statistics of iterative assemblies performed on MG and MT datasets . . . 48 2.2 Mapping statistics for human microbiome samples . . . 51 2.3 Contigs with a likely viral/bacteriophage origin/function reconstructed from the

metatran-scriptomic data. . . 54 3.1 Summary of CRISPR elements detected using different methods . . . 72 3.2 Summary statistics of CRISPR elements. . . 76 3.3 Number of CRISPR repeats and flanks (from metagenomic and metatranscriptomic data)

associated to draft genomes of lipid accumulating bacterial species. . . 77 3.4 Summary of putative phages of theM. parvicellaand LCSB005 populations . . . 81 4.1 The development of IMP . . . 100 C.1 IMP time series analyses summary . . . 321 C.2 Summary statistics of LAMP community CRISPR elements based different time points . . . 323 C.3 Summary statistics ofCandidatusMicrothrix parivicella Bio17-1 population CRISPR

ele-ments based different time points . . . 324 C.4 Summary statistics of LCSB005 population CRISPR elements based different time points . . 325

(23)

CHAPTER

1

INTEGRATED OMICS FOR THE CHARACTERIZATION OF

MICROBIAL COMMUNITY STRUCTURE, FUNCTION AND

DYNAMICS

A major part of this chapter was adapted and modified from the following first-author peer-review publications:

Shaman Narayanasamy, Emilie E.L. Muller, Abdul R. Sheik, Paul Wilmes (2015). Integrated omics for the identification of key functionalities in biological wastewater treatment microbial communities.Microbial Biotechnology8: 363-368. [Appendix A.1]

Shaman Narayanasamy†, Yohan Jarosz†, Emilie E.L. Muller, Anna Heintz-Buschart, Malte Herold, Anne Kaysen, Cédric C. Laczny, Nicolàs Pinel, Patrick May, Paul Wilmes (2016) IMP: a pipeline for reproducible reference-independent integrated metagenomic and metatranscriptomic analyses.Genome Biology17: 260. [Appendix A.2]

(24)

(25)

Chapter 1 Microbial communities

1.1

Microbial communities

Naturally occurring microbial communities (or consortia) are ubiquitous in the environment and underpin important biomedical, biotechnological and natural processes. For instance, the human microbiome (microbial communities in and on the human body) plays an important role in human health [Turnbaughet al., 2007; Greenhalghet al., 2016]; activated sludge microbial communities within biological wastewater treatment plants are important for the remediation of communal wastewater prior its release into the environment [Daimset al., 2006]; and marine microbial communities are believed to be the main photosynthetic oxygen producers [Arrigo, 2005]. Given the importance of microbial communities, it is essential for the scientific community to better understand these important components of nature in their natural environments.

This work utilizes the terms microorganisms and microbes interchangeably. While these terms may carry a general definition and are used in various contexts, within this work these terms encompass a broad range of microbial taxa including, but not limited to, bacteria, archaea, protozoa, micro-eukaryotes and viruses. A collection of microbial cells of the same species/subtype present in the same place and at the same time, is referred to as a population. In general, microbes rarely ever exist naturally as isolated populations, but rather as mixtures of different microbial populations. These mixtures of microbial populations are referred to as microbial communities (or mixed microbial communities), which may have emergent properties, i.e. properties of the constituent populations do not sum to the properties of the entire community, and thus cannot be predicted by studying individual populations separately [Odum and Barrett, 1971]. The complexities of microbial communities vary a lot from one system to another. For example, acid mine drainage biofilms represent relatively simple communities, with low diversity and dominance by specific taxa [Denefet al., 2010]. On the other end of the spectrum, soil microbial communities exhibit far more complex structures, with up to thousands of different microbial populations which undergo rapid changes of the community due to rapid environmental fluctuations [Mocali and Benedetti, 2010]. In between these two extremes, there are microbial communities such as those present within biological wastewater treatment plants which exhibit important characteristics of both low and high complexity microbial communities [Sheiket al., 2014; Narayanasamy et al., 2015; Mulleret al., 2014a]. Such communities therefore represent good model systems for microbial ecology [Daimset al., 2006].

Co-existing microbial populations within natural microbial communities are usually present in differing abundances (i.e. differing community structures) and undergo constant change over time (i.e. community dynamics). These complex structures and dynamics result from constant adaptation of the community to environmental fluctuations which include physical (temperature, pH) and chemical (substrate availability) changes [Muller et al., 2013; Narayanasamyet al., 2015]. Furthermore, populations within a microbial community are constantly interacting with each other (e.g. predation, competition, mutualism, antagonism, etc.), further affecting the overall dynamics of the community.

Understanding microbial community structure (i.e. what are the members of the community) is of general interest. However, more recently interest has also focused on deciphering the function/phenotype of different microbial populations within communities to elucidate what the different members of the community are doing. This is under the assumption that different populations within a given microbial community are believed to carry out specific functions or roles, thus contributing to the collective phenotype of the community [Mulleret al., 2013; Narayanasamyet al., 2015]. Given the aforementioned characteristics (i.e.

(26)

Chapter 1 Model microbial community

the complexity and dynamics) of microbial communities, they may be viewed as omnipresent highly complex systems, yet elusive components of the environment.

The field of microbiology and molecular biology have advanced greatly over the past years due to the emergence of cutting-edge technologies that enable high-resolution and high-throughput molecular measurements [Mulleret al., 2013; Segataet al., 2013]. Thus, to complement classical tools, techniques and strategies of microbiology based on strain cultivation within controlled lab conditions [Stewart, 2012], the scientific community has moved towards the direct study of microbial communities within their natural environments. Studying microbial consortia by application of high-throughput, high resolution molecular measurements (“meta-omics”) provides the opportunity to discover novel organisms and functionalities (genes), which may not be possible with classical microbiological methods, due to the unculturability of most naturally occurring microbial taxa under standard laboratory conditions [Staley and Konopka, 1985; Amann et al., 1995; Stewart, 2012]. However, it is important to define microbial communities that will serve as models for fundamental understanding of microbial communities (i.e. complexity, interactions and dynamics) as well as communities that play an important role, either in nature, biotechnological processes and/or human health.

1.2

Model microbial community

This present work leveraged a model microbial community found within biological wastewater treatment (BWWT) plants for extensive study. These communities are biotechnologically relevant due to their influence on the wastewater treatment process, which is in turn important for the environment. In particular, this work will focus on microbial populations that accumulate lipids that are present in floating sludge islets that occur at the air-water interface of the anoxic tanks. The lipid accumulating phenotype of these organisms represent a potential resource for renewable energy production from wastewater. More importantly for the present work though, is the fact that this system is well suited for fundamental understanding of characteristics and dynamics of natural occurring microbial community.

Direct discharge of organic (e.g., carbohydrates, fats, proteins, organic solvents) and inorganic (e.g. phosphate, nitrate, metallic ions) compounds into natural water bodies may lead to severe perturbations of ecosystems as they can either serve as nutrient and stimulate growth of heterotrophic organisms leading to a reduction in dissolved oxygen or be toxic towards the native organisms [Conleyet al., 2009; Roume et al., 2013b]. Therefore, BWWT relies on naturally-occurring microbial community-driven remediation of municipal and/or industrial wastewater, before its release into the environment. Since its conception about a century ago by E. Arden and W.T. Lockett, BWWT plants, including the standard activated sludge process and other ancillary processes, has become a widespread process that is present in most of the developed world. For instance, in 2013 Luxembourg had 109 BWWT plants that handled approximately 95 % of the total wastewater [Roumeet al., 2013b]. While the overall procedure seems rather simplified, BWWT is a complex process at the interface of engineering, biology and biochemistry, which is not completely understood to date [Wang and Pereira, 1987]. Conventional BWWT plants are made up of a combination of physical, chemical and biological stages that remove solids, organic matter and nutrients from wastewater (Figure 1.1). In summary, the objectives of wastewater treatment include: i) minimizing the release of organic compounds into natural water bodies to reduce the bloom of heterotrophic organisms and thereby reducing overall oxygen

(27)

demand, ii) oxidization of ammonia to reduce toxicity and its deoxygenation effects and iii) reduction of eutrophic substances, such as phosphate [Mara and Horan, 2003; Conleyet al., 2009; Roumeet al., 2013b]. Traditional BWWT plants consist of three stages including: i) physical treatments which removes suspended solids from the wastewater, ii) primary treatment to remove settleable organic and inorganic solids via sedimentation as well as grease and oil removal by skimming and iii) secondary treatment involving the removal/reduction of organic matter in the wastewater using an aerobic biological treatment processes, i.e. the activated sludge process (Figure 1.1). This process relies on naturally occurring microbial communities to reduce organic compound availability in the wastewater [Wagner and Loy, 2002]. These organic compounds are mainly assimilated into microbial biomass (carbon sources) or are oxidized and released as carbon dioxide. In essence, wastewater treatment relies on the digestion of the energy-rich C-C bonds by microorganisms and its transformation into microbial biomass as a means of removing these compounds from the wastewater.

Although the activated sludge process is one of the most widely used biotechnological processes in the world, it is known to be highly energy- and resource-consuming (i.e. water pumping, air bubbling). Yet, BWWT processes hold great potential for future sustainable production of various commodities, including energy, from wastewater as well as from other mixed substrates, further expanding on their original function of wastewater treatment [Sheiket al., 2014; Mulleret al., 2014a]. Indeed, BWWT plants host diverse and dynamic microbial communities, which in turn contain microbial species that possess varied metabolic capabilities over changing environmental conditions, e.g. microorganisms accumulating various storage compounds of biotechnological importance, thus making it a reservoir for potentially useful novel microbial species [Sheiket al., 2014; Mulleret al., 2014a; Narayanasamyet al., 2015]. Consequently, BWWT plants represent a readily available resource (and facility) for production of biofuels, with relatively low cost of modification to already existing structures. An approximated 226 prokaryotes were identified within various BWWT microbial communities. However, information and detailed study of these potentially useful microorganisms remain limited, with only 72 draft genomes reported so far, out of the total 226 identified organisms [McIlroyet al., 2015].

The model microbial system subject of the present work is represented by microbial communities occurring within floating sludge islets of an anoxic tank of a BWWT plant (Figures 1.1to1.3) [Sheik et al., 2014; Mulleret al., 2014b]. The anoxic tank (Figure 1.1) is part of the activated sludge process, more specifically, within the secondary treatment of the BWWT process that promotes denitrification, i.e. reduction of nitrate (NO3) to nitrogen gas (N2) by heterotrophic bacteria (i.e. bacteria that requires organic

carbon for growth). The removal of nitrogen from wastewater is achieved by limiting dissolved oxygen (O2)

levels, such that heterotrophic bacteria are forced to consume nitrate for energy, instead of oxygen, which is then released as gaseous dinitrogen. The water surface of these anoxic tanks tend to accumulate foamy sludge islets (Figure 1.3), which contain lipid accumulating microbial populations (LAMPs) whereby the most notable isCandidatusMicrothrix parvicella (also referred to asM. parvicella), a filamentous lipid accumulating organism which is highly dominant (up to 30 % relative abundance) in the system [Blackall et al., 1996; Mulleret al., 2012; McIlroyet al., 2013]. Consequently, its lipid accumulating properties are of pronounced interest from a biotechnological perspective, most specifically for lipid-based biofuel production from wastewater [Mulleret al., 2014a; Sheiket al., 2016]. The floating sludge islets could be easily collected through surface skimming, compared to other by-products of BWWT plants. Therefore, it is of great interest to maximize the abundance of LAMPs, such asM. parvicella, through systematic manipulation of this specific

(28)

microbial community for consistent and optimal production of biofuels from BWWT plants [Sheiket al., 2014].

In addition to being a resource for the production of high added value compounds, the LAMPs are also well-suited for fundamental studies aimed at obtaining generalizable understanding and knowledge with regards to the ecology of microbial consortia. LAMPs exist within a fluctuating environment (water/air temperature, pH, oxygen and nutrient concentrations). However, these fluctuations are almost always within well-defined/-controlled physical and chemical boundaries [Daimset al., 2006; Sheiket al., 2014; Mulleret al., 2014a]. Overall, LAMPs represent a unique combination of a highly fluctuating, yet relatively well-controlled environment, which is rare in most natural ecosystems. More importantly, physico-chemical parameters, such as temperature, pH, oxygen and nutrient concentrations are routinely monitored and recorded. Such detailed monitoring allows the establishment of causal links between the influence of certain environmental effects on microbial community structure and/or function when coupled to temporal sampling. As such, this system also represents a convenient and virtually unlimited (high reproducibility) source of spatially and temporally resolved samples (Figures 1.2and1.3). Obtaining temporal and/or spatial samples from other microbial habitats, e.g. the marine environment, acid mine drainage biofilms, the human gastrointestinal tract, etc. would be rather challenging or in some cases, near impossible.

While being highly dynamic, LAMPs maintain a medium to high range of diversity/complexity with an alpha(α)-diversity of approximately 600, representing an important intermediary step/model between communities of lower diversity, e.g. acid mine drainage biofilms [Denefet al., 2010], and complex communi-ties, such as those from soil environments [Mocali and Benedetti, 2010]. In addition, LAMPs also exhibit a baseline stability over time, such that there is temporal succession of repeatedly few quantitatively (up to 30 % relative abundance) dominant populations [Mulleret al., 2014b,a; Roumeet al., 2015]. Overall, the model community demonstrates high dynamics, while retaining important and interesting hallmarks of other microbial communities including, for example, quantitative dominance of specific taxa (a characteristic of acid mine drainage biofilm communities) and rapid stochastic environmental fluctuations (a characteristic of soil environments). Microbial consortia from BWWT plants, including LAMPs, are very amenable to experimental validation at differing scales, ranging from laboratory-scale bioreactors to full-scale plants, thus providing the facility of conducting controlled experimentsFigure 1.2. Overall, LAMPs exhibit important characteristics and properties rendering it an ideal model for microbial ecology [Daimset al., 2006], and more specifically eco-systematic omic studies in line with a discovery-driven planning approach [Muller et al., 2013], facilitating hypothesis formulation and verification in rapid succession (Figure 1.2).

In conclusion, the present work leverages a model community that is interesting from a biotechnological perspective of renewable biofuel production while being a representative system for studying microbial communities in general.

(29)

Figure 1.1: Schematic representation of an activated sludge based biological wastewater treatment plant process. Primary treatment consists of screening and grit removal in order to remove large-sized floating solids, while the primary clarifier is used to remove settling solids. The pre-treated wastewater is then mixed with microbial biomass present in the activated sludge and iteratively pumped into the aerobic tank where aerators enable its agitation and oxygenation and then into the anoxic tank. The activated sludge flocs are decanting in the secondary clarifier: the majority of this biomass is recycled to the beginning of the activated sludge process and the rest is either disposed or further used for methane production through anaerobic digestion. The treated wastewater effluent is then released into the environment (Adapted from Zeimes [2015]).

(30)

Figure 1.2: The path from large-scale integrated omics to hypothesis testing and biotechnological application in the context of biological wastewater treatment. Step 1; spatially and temporal resolved samples from BWWT plants. Step 2; sequential isolation of high-quality genomic deoxyribonucleic acid (DNA), ribonucleic acid (RNA), small RNA, proteins and metabolites from a single, undivided sample for subsequent systematic multi-omic measurements. Also including physico-chemical records. Step 3; multi-omic

data integration and analysis for a multi-level snapshots of microbial community structure and functionin situ. Step 4; statistical and

mathematical modelling. Step 5; testing through targeted laboratory and/orin situperturbation experiments followed by additional omic

measurements. Step 6; control of microbial community structure and/or function (Adapted from Narayanasamyet al.[2015]).

(31)

Figure 1.3: Biofuel production from wastewater sludge. Aerial photograph of the Schifflange biological wastewater treatment plant,

Esch-sur-Alzette, Luxembourg (49◦30’ 48.29” N; 6◦1’ 4.53” E) operated bySyndicat Intercommunal à Vocation Ecologique. The

“anoxic tank number 1” is highlighted by the blue circle and the corresponding photos, from that tank in autumn and winter show variable content of sludge in the different seasons. The sludge islets (e.g. highlighted in yellow) contain lipid accumulating microbial populations (LAMPs), which are potential biofuel producers (Courtesy of E.E.L Muller).

(32)

Chapter 1 Bacteriophage - bacterial host interactions

1.3

Bacteriophage - bacterial host interactions

Viruses are known to be the most abundant and diverse biological entities on the planet, inhabiting almost every environment, with an estimated range of 1030 _{to 10}32 _{of total viral particles on Earth [Marcó}_{et al.}_,

2012]. They are believed to be responsible for the lysis of up to 50 % of prokaryotic cells, thereby increasing the bioavailability of carbon and overall playing an important role in the carbon cycle [Breitbart and Rohwer, 2005]. It is important to note that viruses (and all relevant subclasses of virus), are referred to as biological entities/components within the scope of this work [Rybicki, 1990; Raoult and Forterre, 2008; Koonin and Starokadomskyy, 2016].

Bacteriophages are a subclass of viruses that infect and replicate specifically within bacterial cells (also referred to as phages throughout this work), which are believed to play an essential role in microbial communities by shaping their structure and influencing their dynamics (Figure 1.4) [Samsonet al., 2013]. Accordingly, studies have shown the involvement of phages within simple communities, such as acid mine drainage biofilms [Andersson and Banfield, 2008] and more complex microbial communities such as the: i) marine microbiome [Wommack and Colwell, 2000; Suttle, 2007; Sheiket al., 2014], ii) human gastrointestinal tract microbiome [Sternet al., 2012], iii) laboratory scale sludge bioreactors [Kuninet al., 2008], and iv) full scale BWWT plants [Yasunoriet al., 2002].

Given the capability of phages to lyse bacterial cells (Figure 1.4), they have been suggested as a viable microbial community control strategy in various biomedical and biotechnological processes that rely on microbial communities [Witheyet al., 2005; Jassimet al., 2016]. The idea itself could be traced back to as early as 1962 [Claeys, 1962], while there were also documented cases of the application of phage therapy [Witheyet al., 2005; Jassimet al., 2016]. However, the inconsistent results of phage treatment coupled with the emergence of antibiotics, brought about the decline of phage therapy [Witheyet al., 2005; Jassimet al., 2016]. More recently, the interest towards phage-based treatments has resurfaced, including its possible application as a control strategy for BWWT process [Witheyet al., 2005; Jassimet al., 2016].

In principle, phage treatment could be used to mitigate common issues that plague BWWT plants, such as: i) foaming of activated sludge (i.e. anoxic tank floating sludge islets;Section 1.2), ii) sludge de-waterability and digestibility, iii) removal of pathogenic bacterial strains or iv) reduce strains that compete with functionally important/useful bacterial populations [Witheyet al., 2005; Jassimet al., 2016]. In order to apply such strategies, it is essential to understand the role of bacteriophages in shaping BWWT plant communities, such as LAMPs [Witheyet al., 2005; Jassimet al., 2016].

Despite the abundance and diversity of bacteriophages, information with regards these biological entities are relatively sparse compared to their bacterial host counterparts with approximately 2,200 viral genomes versus more than 45,000 bacterial genomes in publicly available databases [Reddyet al., 2015; Paez-Espino et al., 2016]. This gap in information is due to several reasons including, but not limited to: i) large fraction of their host populations cannot be cultivated, and thus preventing the culturing of the associated bacteriophages (Section 1.1), ii) the absence of marker genes for bacteriophages, such as the 16S rRNA genes for bacteria, create a challenge in classifying phage genomic material (Sections 1.1and1.4.3) [Rouxet al., 2011] and iii) some phages integrate their genomes within bacterial host genomes, hindering conclusive identification of phage genomes. Finally, given that a majority of bacterial species within BWWT remain unclassified

Section 1.2, this translates to a sparse number of associated bacteriophages identified within LAMPs [Kunin

(33)

et al., 2008].

Fortunately, the advent of high-throughput omic datasetsSection 1.4.3opens up new opportunities to study bacteriophages unlike previous efforts [Prideet al., 2012; Reyeset al., 2012; Wommacket al., 2012; Shirleyet al., 2015; Paez-Espinoet al., 2016]. For instance, metagenomics (Section 1.4.3) provides access to all genomic (DNA) components within a given microbial community, including bacteriophages [Edwardset al., 2015]. Specific techniques, including the use of information from bacterial antiviral defence mechanisms to associate bacteriophages and their host populations [Edwards and Rohwer, 2005; Andersson and Banfield, 2008; Sternet al., 2012; Wommacket al., 2012; Emersonet al., 2013b,a; Edwardset al., 2015].

1.3.1

Phage infection mechanisms

Phages are known to exhibit two types of life cycles (Figure 1.4). The first type is known as a lytic life cycle (Figure 1.4), which constitutes immediate replication of phages leading to lysis of host cells. However, certain bacteriophages follow a lysogenic life cycle (Figure 1.4) by being dormant within the host cells via integration of their genetic material into the host genomes as prophages. They then replicate as a lytic phage when the conditions are suitable. Overall, phages are obligate parasites which require successful infection of a host in order to replicate [Edwardset al., 2015]. However, bacterial hosts are able to fend off the infections of phages through an arsenal of defence mechanisms [Labrieet al., 2010; Edwardset al., 2015]. Bacterial defense against phages include, but are not limited to: i) mutation of membrane receptors, ii) lipopolysaccharide coating of bacterial membrane, iii) restriction-modification of DNA and iv) CRISPR-Cas system [Labrieet al., 2010]. As a consequence of these defense mechanisms, phages are under selective pressure to counter its hosts’ defences [Samsonet al., 2013; Edwardset al., 2015]. Consequently, bacterial hosts and phages are locked in a constant evolutionary arms race driven mainly by phage-bacteria interactions, which in turn drive microbial community dynamics [Samsonet al., 2013].

(34)

Figure 1.4: Structure, life cycle and dynamics of bacteriophages. (A)Structure of a phage (Adapted from Gelbart and Knobler

[2008]).(B)Lytic phage life cycle involves: attachment to the bacterial cell; injection/adsorption of genetic material; replication of

phage genome and generation of phage components; assembly of new phages; bacterial cell lysis and phage release (Adapted from

Feineret al.[2015]).(C)Lysogenic phage life cycle involves: attachment of phage to bacteria; injection/adsorption of genetic material;

integration/insertion of phage genome into host genome; dormant state of phage as a prophage. The prophage exits the dormant stage

and replicates as a lytic phage when conditions are favourable (Adapted from Feineret al.[2015]).(D)Bacteriophage and bacterial host

dynamics (Adapted from Bullet al.[2014]).

(35)

1.3.2

The CRISPR-

Cas

mechanism

While there are multiple ways of linking bacteriophage(s) with their host populations [Edwards et al., 2015], this work primarily focuses on the CRISPR-Cassystem as a means of associating of bacteriophages and bacterial hosts. The “clustered regularly interspaced palindromic repeats” or CRISPRs, are a class of sequences present within prokaryotic genomes that include distinct short repeat sequences, interspaced by short unique sequences [Barrangou and van der Oost, 2013; Rathet al., 2015; Amitai and Sorek, 2016]. These sequences were first identified and described by Yoshizumi Ishino and colleagues upon accidentally locating these regions within theE. coliK12 strain [Ishinoet al., 1987]. Such regions were also later found in other prokaryotic species, such asHaloferax mediterranei,Streptococcus pyogenes,Anabaenasp. PCC 7120 andMycobacterium tuberculosis[Jansenet al., 2002]. The term “CRISPR” was coined by Jensen and colleagues [Mojicaet al., 2000; Jansenet al., 2002]. In addition, they also identified CRISPR-associated (cas) genes (which translate toCasproteins/enzymes) located adjacent (or in close proximity) to CRISPR genomic regions, and thereby suggesting the functional relationships between those genes and the CRISPR genomic regions. Since then, a large number of studies followed suit and deciphered the mechanism of the system as a memory-based immune system against invasive foreign genetic elements, such as bacteriophages and plasmids [Pourcelet al., 2005; Kuninet al., 2008; Shahet al., 2013; Zhanget al., 2013, 2014]. This form of defense came to be known as the CRISPR-Cassystem and is estimated to exist within ~40 % of bacteria and ~90 % of archaea [Godde and Bickerton, 2006; Kuninet al., 2007; Karginov and Hannon, 2010]. The simplified mechanism of the CRISPR-Cassystem is represented inFigure 1.5.

The CRISPR genomic region within a prokaryotic genome is made up of multiple elements (Figure 1.5). The first element is a direct repeat sequence of approximately 24 to 47 bp, which occurs multiple times within a given CRISPR region [Ishinoet al., 1987; Jansenet al., 2002; Datsenkoet al., 2012]. CRISPRloci are generally almost palindromic in nature implying that these regions could form hairpin structures upon transcription (and post-transcriptional processing) and are thereby well conserved within different prokaryotic clades [Kuninet al., 2007]. Direct repeats are separated (or interspaced) by unique sequences known as spacers (Figure 1.5) [Ishinoet al., 1987; Jansenet al., 2002; Kuninet al., 2007; Datsenkoet al., 2012]. Unlike the direct repeat sequences, spacer sequencers were shown to be highly heterogeneous and dynamic, with constant additions, deletions and replacements of spacers within the CRISPR regions [Pourcelet al., 2005]. This causes spacers to be highly heterogeneous within populations of a single prokaryotic species [Pourcelet al., 2005]. Last but not least, functional CRISPR regions were shown to include an AT-rich leader sequence upstream [Bultet al., 1996; Klenket al., 1997; Jansenet al., 2002; Karginov and Hannon, 2010].

The simplified mechanism of the CRISPR-Cassystem can be separated into three main/general stages: i) adaptation, ii) CRISPR-RNA (crRNA) biogenesis and iii) interference (Figure 1.5). The adaptation stage involvesCasproteins/enzymes, that detect and sample short fragments of sequences from foreign invasive elements. These short fragments are known as protospacers, i.e. the original elements of CRISPR spacers. Protospacers are recognized by these Cas proteins through specific sequence motifs, known as the protospacer adjacent motifs (PAM) [Marraffini and Sontheimer, 2010; Shah et al., 2013]. Upon detection, theCasproteins cleave a protospacer at two ends and incorporates the cleaved fragment into the CRISPR array of the host genome, usually within the leading end (i.e. the first repeat sequence from the AT-rich flanking leader sequence) [Marraffini and Sontheimer, 2010; Datsenkoet al., 2012]. As such, the

(36)

spacer information is conserved in the next generation of bacteria, storing the history of previous infections [Marraffini and Sontheimer, 2010]. The crRNA biogenesis involves transcription of the CRISPR genomic regions to generate an unprocessed crRNA (pre-crRNA) [van Rij and Andino, 2006; Labrieet al., 2010; Marraffini and Sontheimer, 2010]. The pre-crRNA is then further processed by splicing, such that individual spacer sequences are accompanied by a single repeat sequence, forming the processed crRNA [Marraffini and Sontheimer, 2010]. In the interference phase, these crRNAs form a complex withCasproteins to detect and interfere with foreign genetic elements via splicing/inhibition [Marraffini and Sontheimer, 2010]. More specifically, the repeat region within the crRNA folds into a hairpin structure which is used to bond with the Casproteins, while the spacer is utilized as a guide to target invasive genetic elements via complementary binding [Nishimasuet al., 2014]. While the general mechanism of the CRISPR-Cassystem is relatively simple, specific CRISPR-Casmechanisms can be further classified into types I, II and III. This classification is based on their participatingcasgenes (orCasenzymes) and could be further divided into various subtypes [Makarovaet al., 2011]. Furthermore, the catalogue ofcasgenes are continually expanding based on updated knowledge and new data [Zhanget al., 2014]. Complementary to the discovery of newCasenzymes, there are also many recent studies uncovering novel CRISPR-Casmechanisms within prokaryotes.

The CRISPR-Cassystem was rapidly translated into biotechnological application through the development of a genome editing tool based on the type II CRISPR-Cassystem, mediated byCas9nuclease protein/enzyme [Wiedenheftet al., 2011; Jineket al., 2012; Sashitalet al., 2012; Selle and Barrangou, 2015]. Yin and collaborators applied the CRISPR-Cas9based genome editing tool to successfully correct a mutation within a mouse model of a human liver disease [Yinet al., 2014].

Given that CRISPRloci, more specifically the CRISPR spacers, represent the history of previous infections, multiple studies effectively leveraged this information to identify concomitant phages, and thus representing an important method of tracking host-phage interactions within microbial consortia [Sternet al., 2012; Biswaset al., 2013; Zhanget al., 2013; Edwardset al., 2015; Paez-Espinoet al., 2016]. In summary, the aforementioned studies utilize CRISPR information and leverage omic data to expand the available genomic resource with regards to phages, and thereby advancing the field [Paez-Espinoet al., 2016] (Section 1.3.1). Despite the wealth of information delivered by these studies, there is still a gap elucidating the long- and/or short- term dynamics of bacteriophages and bacterial hosts within their natural environment, and thereby understanding the overall influence of phages within microbial consortia [Labrieet al., 2010; Samsonet al., 2013].

(37)

Figure 1.5: Mode of action of type II CRISPR-Cassystems.The direct repeats of the “clustered regularly interspaced palindromic

repeats” (CRISPR)locusare separated by short stretches of non-repetitive (unique) DNA called spacers, which are acquired from the

invading DNA of viruses or plasmids in a process known as adaptation, during which an additional repeat is also duplicated. The CRISPR

locusis transcribed as a long primary pre- CRISPR-RNA (crRNA) transcript, which is processed to produce a collection of short crRNAs (a process referred to as biogenesis of crRNA). Each crRNA contains segments of a repeat and a full spacer and, in conjunction with a

set ofCasproteins, forms the core of CRISPR-Cascomplexes. These complexes act as a surveillance system and provide immunity

against ensuing infections by phages or plasmids encoding DNA complementary to the crRNA. On recognition of a matching target sequence, the plasmid or viral DNA is cleaved in a sequence-specific manner (known as interference). The nucleotide sequence of the

spacer must be highly similar to a region of the viral genome or plasmid (known as the protospacer) for the CRISPR-Cascomplex to

inhibit replication of these foreign genetic element. In type I and II CRISPR-Cassystems, a conserved sequence motif adjacent to the

protospacer, known as the protospacer-adjacent motif (PAM), is needed for spacer acquisition and interference (Adapted from Samson

et al.[2013]).

(38)

Chapter 1 Eco-Systems Biology

1.4

Eco-Systems Biology

It is important to address the concept of Systems Biology as a prerequisite to Eco-Systems Biology [Zengler, 2009; Zengler and Palsson, 2012]. Systems Biology involves the study of multiple biological components, in-cluding but not limited to, biomolecules, cells, tissues, organs and entire organisms within a biological system. The field of Systems Biology emerged due to the highly complex and dynamic nature of living systems which cannot be predicted and/or elucidated by looking at individual components/parts of a given biological system [Shahzad and Loor, 2012] . Consequently, Systems Biology is a highly inter-disciplinary field that combines various methodologies spanning from high-throughput molecular measurements, bioinformatic analyses, laboratory experiments and mathematical modelling. There are two generalized study designs/approaches defined under the umbrella of Systems Biology which includes the “top-down” and “bottoms-up” approach. Top-down systems biology characterizes biological components of a particular system using large-scale omic datasets followed by subsequent generation of mathematical models of the system. Those generated models may aid in uncovering new insights into the biological system in question [Zengler, 2009; Shahzad and Loor, 2012]. On the contrary, bottom-up systems biology begins with a detailed models of a specific biological system on the basis of its molecular properties and are usually followed by targeted measurements. These measurements stem from either isolation, cultivation and/or various single-cell techniques [Zengler, 2009]. Despite being regarded as isolated approaches, these approaches should not be viewed as separate, but rather should applied in an integrated manner to elucidate biological systems [Zengler, 2009].

Molecular Eco-Systems Biology (hereafter referred to as Eco-Systems Biology) applies similar principles and methodologies of Systems Biology within microbial systems, such as the complex microbial communities described inSection 1.1[Raes and Bork, 2008; Zengler, 2009; Zengler and Palsson, 2012]. Systematically obtainedin situtime- and space-resolved datasets will allow deconvolution of structure-function relationships by identifying key community members and key community functions [Raes and Bork, 2008; Zengler and Palsson, 2012; Mulleret al., 2013; Narayanasamyet al., 2015]. Knowledge garnered from such studies offers the potential to discover novel microorganisms and biological functionalities within the framework of Eco-Systems Biology [Albertsenet al., 2013a; Mulleret al., 2014a; Roumeet al., 2015; Heintz-Buschart et al., 2016; Lacznyet al., 2016]. In general, such insights may enable the control of microbial communities either through interventions for improvement/optimization of biomedical treatments and/or biotechnological processes [Mulleret al., 2013].

1.4.1

Eco-Systems Biology for the study of phage-host interactions

The application of Eco-Systems Biology can be extended to the study of bacteriophage and host interactions, especially given the possible application of bacteriophages in controlling microbial communities (Section 1.3). More specifically, the information derived from microbial community samplesin situallows access to more information than compared to classical culture/isolate based methods. Accordingly, the current work combined the advantage of three separate avenues to effectively study phage and host interactions and dynamics. First, the facility to perform time-series sampling of microbial communitiesin situ, such as the LAMPs (Section 1.2), will enable the study of host and phage interactions within a natural system on a longitudinal scale. Second, the ability to mine for phage-related information from microbial community derived data, as highlighted inSection 1.3. Finally, using the CRISPR-Cassystem as a valuable information

(39)

source of phage-host interaction (Section 1.3).

The characteristic of microbial communities (Section 1.1) render standard microbiology-based methods (i.e. originally designed for pure isolate culture systems) ineffective [Mulleret al., 2013; Roumeet al., 2013b,a]. It is therefore absolutely essential to apply specialized non-culture based systematic approaches for the study of microbial systems. Eco-Systems Biology is an integrative framework that encompasses a wide array of specialized methods/techniques/analysis including: i) concomitantly extracted biomolecules ii) systematic high-throughput omic data measurements, iii) integration and analysis to the omic data, iv) experimental validation and ultimately v) the control of microbial systems (Figure 1.2). Accordingly, the following sections describe these aforementioned methods in detail, focusing primarily on the integration of the different omic data types. Briefly, the described methods enable the extraction of the information necessary for this work. These include, but are not limited to: i) genome sequences of both host and phage populations, ii) their predicted genes and corresponding functional annotations, iii) association of bacteriophages and their hosts (i.e. using CRISPR information) as well as iv) transcribed components, such as genes and CRISPR RNA.

1.4.2

Biomolecular extraction

The biomolecular extraction protocol designed by Roume and colleagues allows the sequential isolation of high-quality genomic deoxyribonucleic acid (DNA), ribonucleic acid (RNA), small RNA, proteins and metabolites from a single, undivided sample for subsequent systematic multi-omic measurements (Figure 1.2; step 2 andFigure 1.6) [Roumeet al., 2013b,a]. Importantly, this eliminates the need for subsampling the heterogeneous biomass and, therefore reduces the noise arising from incongruous omics data in the subsequent downstream integration and analysis steps (Figure 1.2; step 3 andFigure 1.6) [Mulleret al., 2013; Roume et al., 2013b,a]. Biomolecular isolations obtained from the aforementioned methodologies are subjected to high-throughput measurements, resulting in omic data derived from a single unique sample to fulfill the premise of downstream integrated omic analysis (Figures 1.2and1.6) [Mulleret al., 2014a; Roumeet al., 2015; Heintz-Buschartet al., 2016].

(40)

Figure 1.6: Concomitant extraction of biomolecules from a single unique microbial community sample and their downstream high-throughput measurement techniques(Courtesy of L. Wampach and A. Kaysen).

(41)

1.4.3

Multi-omic measurements

Studies within the context of Eco-Systems Biology were made possible mainly through the advent and availability of the high-throughput, high-resolution molecular measurements (referred to as omic data), applied to microbial consortia that were derived in situ [Raes and Bork, 2008; Zengler, 2009; Zengler and Palsson, 2012; Mulleret al., 2013]. Omic data involves the collective analysis (characterization and quantification) of certain features of a family/class of biomolecules (i.e. DNA, RNA, proteins or metabolites (Figures 1.2and1.6). The application of omic measurements to microbial communities results in meta-omic data, whereby the suffix “meta” (in the scope of this work) implies measurements/data derived from mixed microbial communities [Mulleret al., 2013; Segataet al., 2013]. More specifically, meta-omic datasets (metagenomic, metatranscriptomic, metaproteomics and (meta-)metabolomics) enables high-resolution molecular-level studies of such microbial systems on a much larger scale compared with previous efforts [Mulleret al., 2013].

Metagenomics

The concept of DNA sequencing was introduced by Fredrick Sanger and colleagues in 1975 when they proposed a chemistry that combines the use of polymerase chain reaction, inhibition/termination of DNA polymerase activity and labelled fluorescence dyes [Sanger and Coulson, 1975; Sangeret al., 1977]. This chemistry was further developed to achieve more rapid and accurate sequencing method, which resulted in the first sequenced genome of the phi-X174 bacteriophage [Sangeret al., 1977]. Sanger sequencing was the only available method of sequencing until the emergence of next-generation sequencing (NGS) technologies, which enabled large-scale and deep sequencing of DNA fractions with relatively lower cost [Liuet al., 2012]. Currently, there are multiple NGS technologies/platforms available, whereby each of these platforms employ a specific chemistries in deciphering DNA sequences (Figure 1.7) [Blow, 2008; Met; Quailet al., 2012]. NGS technologies can be further divided into “second-generation sequencing” technologies, also called massive parallel sequencing, such as Illumina [Bentleyet al., 2008], Roche 454 [Margulieset al., 2005] and SoLiD [McKernanet al., 2009] and the more recent “third generation sequencing” technologies, such as Pacific Biosciences [Eidet al., 2009] and Oxford Nanopore [Manraoet al., 2012]. The clonal amplification step of DNA molecules to produce DNA colonies (Figure 1.7) is the main difference between second and third generation sequencing methods, whereby this step is absent in the latter, culminating in the concept of single molecule sequencing [Blow, 2008; Eidet al., 2009; Manraoet al., 2012]. It is important to note that the clonal amplification steps are necessary in generating the large volumes (throughput) in second-generation sequencing methods, which is not possible with third-generation methods.

NGS platforms are unable to read genome-sized (or long) DNA molecules, due to the current limitations of all NGS technologies. Therefore, the general protocol of genome sequencing first involves a preparation step of the DNA samples, such that they can be loaded onto NGS platforms (Figure 1.7), usually by random fragmentation of multiple copies of a genome (for isolate genomic samples) to generate shorter DNA fragments that would be readable by the NGS platforms. This overall preparatory procedure prior to sequencing is widely known as a whole genome shotgun (WGS) procedure and result in WGS libraries. More specifically, the term “shotgun” in WGS is used due to the aforementioned fragmentation process which is akin to the quasi-random firing pattern of a shotgun. It is important to note that WGS library

(42)

preparation protocols vary for different sequencing technologies (Figure 1.7) [Met; Liuet al., 2012]. WGS libraries are processed by an NGS instrument/machine (i.e. sequencer) to yieldin silicorepresentations of the biological DNA molecules, known as sequencing reads [Blow, 2008; Met; Quailet al., 2012]. The Illumina sequencing platform is notably the most applied sequencing technology due to its ability to generate the highest-throughput (i.e. largest number of bases/reads per sequencing run) with relatively low cost [Liuet al., 2012].

It is also important to highlight that DNA sequencing may also be carried out using a targeted approach, which typically refers to the sequencing of a known DNAlocus, selected either for the encoded function or, more often for its phylogenetic/taxonomic information, using primer-based amplification. In the specific context of microbial communities, high-throughput ribosomal RNA (rRNA) gene amplicon sequencing (usually 16S rRNA gene sequencing) facilitates the preliminary characterization of microbial community composition and structure [Segataet al., 2013]. However, such targeted amplicon sequencing will not be classified as metagenomic (MG) data within the scope of this work. Rather, this work defines MG data to be the result of a WGS sequencing procedure applied on bulk microbial community-derived DNA samples. Beyond targeted sequencing datasets, MG data is arguably the most commonly generated high-throughput dataset for microbial community studies, with 37,239 datasets publicly available on NCBI sequence read archive (SRA) [Leinonen et al., 2011], as of 24 October 2016. MG data provides information on the community structure, (i.e. which microbial community members are present) as well as a prediction of gene functions (i.e. the functional potential) [Mulleret al., 2013; Vanwonterghemet al., 2014]. Within the scope of this work, it is important to highlight that metagenomic sequencing entails the indiscriminate sequencing of all DNA molecules within a sample which also includes viral (or phage) DNA genomes. Hence, MG data was previously shown as a large resource that can be used to mine for viral sequences, which far surpasses the what would be able to be achieved with classical microbiology methods [Paez-Espinoet al., 2016].

Metatranscriptomics

Given that the RNA molecular structure is analogous to DNA, it can also be subjected to NGS with additional laboratory processing protocols. It is important to note that NGS platforms are only able to sequence DNA molecules. Therefore, the RNA samples serve as template to synthesize reverse transcribed DNA (i.e. complementary DNA - cDNA) before undergoing NGS. Similar to the DNA samples, RNA samples can also undergo targeted sequencing (as described inSection 1.4.3) or whole transcriptome shotgun (WTS) sequencing, i.e. random shotgun sequencing of bulk RNA samples equivalent to WGS.

In the context of this work, WTS performed on bulk RNA samples derived from microbial communities are considered as metatranscriptomic (MT) data. As of 24 October 2016, there are 397 MT datasets available on the NCBI SRA [Leinonenet al., 2011], which is relatively little compared to MG data (37,239). In addition, rRNA depletion (often partial depletion) of MT samples enables deeper sequencing of mRNA and thus providing better access to functional readouts from MT data and other interesting RNA-based components, such as RNA viral genomes. Functional expression can be characterized using MT data, and by extension provides a snapshot of whi