In this paper, we propose a new algorithm for, DNA sequenceassembly using a different strategy from the previous methods. Based on preliminary investigations, our method promises to be very fast and practical for DNA sequenceassembly. Our algorithm takes advantage of several key features of the sequence data. First of all, the redundancy or the depth of coverage c is often much larger than 2. Thus, when only pair wise comparisons are considered, information about multiple overlaps of fragments is ignored, which leads to complications in reconstruction process. Second, the fragments are sequenced with errors but usually are 98% accurate. Therefore, there are very likely to be long stretches with no errors. In considering the basic data for shotgun sequencing, we have come to a very distinct strategy that takes advantage of these properties. Before we describe our strategy, we will give a sketch of the computer science associated with what seems at first to be an entirely different problem, sequencing by hybridization.
The Eulerian based de novo methods have always been widely used, and were inspired by the sequencing- by-hybridization approach [16, 28]. These algorithms represent each read by its set of k-mers (smaller subse- quences) and construct a de-Bruijn graph. A de-Bruijn graph is a directed graph where vertices are k-mers, and there exists an edge between two vertices if there is an overlapping subsequence of length ( k − 1 ) between them. Finding the Eulerian path or tour, where each edge in the de-Bruijn graph will be visited exactly once will lead to the sequenceassembly solution. Before per- forming the Eulerian tour, these approaches use differ- ent heuristics to remove from the de-Bruijn graph, nodes and edges that are created due to sequencing errors and repeat regions within the genome. Myers presented an- other graph oriented approach based on the notion of bi-directed string graph . A bi-directed string graph has direction associated with both end points of an edge produced by modeling the forward and reverse orienta- tion of sequence reads. The Eulerian-tour of such a graph enforces additional constraints that leads to improved accuracy and length of produced sequence contigs.
Controversially, scientists have begun exploring the use of these short-read technologies for de novo genome sequencing of large eukaryotic organisms, where the genomic sequence is determined solely from the short reads without using a reference genome as a guide — instigating debate similar to when the Human Genome Project strove to use Sanger sequencing to de novo sequence the human genome. In particular, it still is unclear whether or not short reads can successfully and reliably enable sequenceassembly algorithms to reconstruct the original genomic sequence of complex organisms from the set of reads alone, as the short read length inherently limits the specificity of the location in the genome the read was sampled from; longer reads are desired as eukaryotic genomes contain repeated sub-sequences that are often longer than a single short read, making disambiguation difficult or impossible. To mitigate this factor, however, pairs of reads that are separated by a statistically known distance are used to virtually extend the length of the read beyond that of most repeats.
EST Sequenceassembly is the process of assembling Expressed Sequence Tags  of an organism to contigs and then predicting gene functions from them. At the beginning of the EST project the starting material for the construction of cDNA library  is selected. This can be cells, tissues or even whole organisms. From this, the messenger RNAs are isolated. mRNAs are highly unstabe and so they are Reverse Transcribed to relatively more stable forms called the complementary DNA or cDNA. The cDNA has to be amplified to form a cDNA library. This is accomplished by cloning the cDNA into plasmid vectors. The plasmids are amplified by transforming the bacterium E-coli to generate the cDNA library. The cDNA library forms the basis of generating EST sequences. Usually the cloning of cDNA is done directionally, that is, it is known at which end of the vector the 5 prime and 3 prime ends of the cDNA are located. The cloned sequence can thus be sequenced from both ends simultaneously. The identified nucleotide sequence can be exported to a computer and the raw data is then processed.
Genomic sequencing is now a semi-industrial pro- cess that is being increasingly automated. The amount of finished sequence produced in large cen- ters worldwide more than doubles each year. This effort has required a huge investment in bioinfor- matics, and new software is under continual devel- opment both within these centers and in the wider academic community. High-throughput sequenceassembly is a complicated multistep pipeline, using many pieces of software, and we as users want to be in a position to use the best set of software tools, even if this causes problems reconciling the various data formats they use. In addition, because more than one tool may be suitable for the same task (e.g., for manually editing sequence assemblies) we also want to offer alternatives within the same frame- work.
suggesting fewer chimeric sequences. However, the size distribution of the elements in the Morex V2 assembly in- dicates a large population of overly large full-length ele- ments (Additional file 1: Table S6). In contrast, the size distribution of full-length elements in Morex V1 is nar- rower and shows two characteristic peaks corresponding to the autonomous and non-autonomous subfamilies (T. Wicker, unpublished results). Manual inspection of 50 randomly selected elements between 9900 and 10,000 bp in length showed that the large sizes of these elements are mainly due to large sequence gaps (i.e., long stretches of N’s). In the 50 manually inspected copies, we found 70 se- quence gaps in the internal domain and only 5 short gaps in the LTRs. The latter observation is not surprising as our method to identify full-length copies relied on largely gap-free LTRs. In only three cases, the large size of the element was caused by the genuine insertion of additional TEs. Overall, the Morex V2 assembly had more and larger gaps as TE length increased (Additional file 1: Table S6), a pattern that is absent from the Morex V1 assembly. In summary, the representation of repetitive sequence is similar in both assembly versions of the Morex genome. The longer read lengths and k-mer sizes used in the TRITEX pipeline may have resulted in a better repre- sentation of short tandem repeats in V2. However, the gap-free assembly of very recently inserted full- length TEs may benefit from prior complexity reduc- tion such as BAC sequencing.
DNA microarrays and high throughput RNA sequenc- ing (RNA-Seq) . The latter technique directly reveals the sequence of transcripts and is becoming increas- ingly popular, as a result of continuous improvements in both the sequencing technology and the data analysis software. This increase has been marked by the develop- ment of sequencing centers and large consortia focused on specific organisms (Rice Genome Annotation Project, 1001 Arabidopsis Genomes Project, The Maize Genome Sequencing Consortium, to name just a few). These communities work on developing and standardization of protocols to facilitate aggregating and comparison of various datasets. Current RNA-Seq applications involve assembly of the transcriptome, with or without the refer- ence genome information, gene discovery and expression analysis, identification of unknown exon junctions and alternative transcripts, measuring allele-specific expres- sion and many more [5–8]. On the contrary, microarrays can only derive information on targets that are actually represented by the microarray probes and are sensitive to cross-hybridization, as well as display poor signal resolu- tion and increased variation at low signal intensities [9, 10]. Despite these drawbacks, the results generated on microarray platforms are concordant with those obtained with RNA-Seq [11, 12]. Additionally, thousands of studies performed over the past decades proved that the micro- arrays reflect the transcriptome composition with high fidelity and that they are a rich source of biologically val- uable information. Since their introduction, microarrays have been effectively used in searching for disease mark- ers , alternative splicing , gene function predic- tion , identification of transcriptionally active regions of the nuclear, mitochondrial and chloroplast genomes [16–18] and many other applications. The microarray experiments are still much cheaper than RNA-Seq, not only regarding the price of consumables and reagents but also the computational and human resources required for data analysis and storage. The latter are often underesti- mated when calculating the real costs of high-throughput sequencing experiments . Remarkably, extracting biological information from the RNA-Seq data requires combining computational skills with deep knowledge of the problem of interest, typically by the close coop- eration of experts in each of those fields. Therefore, sequencing-based experiments may pose a substantial challenge for individual laboratories. With the small size of the resulting datasets and the relatively easy data analysis, DNA microarrays are still an attractive alterna- tive to RNA-Seq for a variety of studies, e.g., focused on differential analysis of known genes in the conditions of study and in time-course studies, where a large number of samples are to be processed and compared in a repeat- able manner. We surveyed the gene expression profiling
To reduce the large workspace associated with products that have many components, several metaheuristic approaches have been extensively researched in the literature. Common methods include genetic algorithms (GA) , ant colony optimisation (ACO) , particle swarm optimisation (PSO) , and simulated annealing (SA) . These approaches do not guarantee the optimal solution, but have been considered successful. In general, these approaches transform information in the graph, combine them with objectives such as minimising assembly direction changes and tool changes, and add constraints such as precedence, to form a multi-criteria objectives that are solved to find the optimum. Common challenges ascribed to soft- computing metaheuristic approaches are high computational time, tedious data entry and premature convergence . Many of the works present limited insight on the quality of the results and have a tendency to discuss and conclude about how a given approach makes headway in the aforementioned challenging areas.
In general, the above DFA approaches are applied at and limited to the detailed design phase. As a consequence, these approaches may require design changes after assembly analysis if certain design features are not suitable for assembly operations. This can lead to significant rework, including redesign, reanalysis through modelling and simulation and even re-prototyping. It is clear the above rework will increase the product development cost and lead time. To address these deficiencies, the proposed objective of this work is to start to use assembly knowledge in design process from the earliest stages of the development process and thus act on a product concept in a preliminary study phase. This will allow the identification of potential issues and negative consequences at early design stage so that any design decisions leading to significant negative assembly consequence can be eliminated at early stage. The changes are then easier to manage in an early stage of development rather than at the end of detailed study. Barnes et al.  proposes an approach that generates assembly sequences in parallel to the design (before the end of a project) but on a product with a very high level of details. He begins with the structure of parts to define the assemblysequence to the types of connections between parts.
The costs of assembly processes are determined by assembly plans. Assemblysequence planning, which is an important part of assembly process planning, plays an essential role in the manufacturing industry. Given a product-assembly model, assemblysequence planning (ASP) determines the sequence of component installation to shorten assembly time or save assembly costs . ASP is regarded as a large-scale, highly constrained combinatorial optimization problem because it is nearly impossible to generate and evaluate all assembly sequences to obtain the optimal sequence, either with human interaction or through computer programs.
If MLV Gag molecules truly interact only at the plasma membrane, then any MLV Gag protein with a mutant L do- main, which would be targeted to the membrane but otherwise be budding defective, should be capable of being rescued into particles. Since the L domain of MLV has not been mapped, we decided to examine a derivative of BgM that contains a deletion of the RSV L domain. For this, we made use of the RSV Gag mutant T10C.PR2 (Fig. 5A), which lacks amino acids 122 to 336, including the entire L domain, but retains the M and I domains (34). As shown in Fig. 5D (lanes 3), this molecule is budding incompetent but is rescuable by full-length RSV Gag (Fig. 5D, lanes 5). This well-characterized RSV L-domain mutation was introduced into BgM to create chi- mera T10M.PR2 (lanes 4). When expressed by itself, this recombinant, like T10C.PR2, was unable to bud from the cell (Fig. 5D, lanes 4). The presence of the strong membrane- binding sequence of Src and the high proteolytic activity of the PR1 form (data not shown) suggest that T10M is targeted to the plasma membrane. When coexpressed with an assembly- competent molecule (M.M1.PR2), T10M.PR2 was readily rescued into particles (Fig. 5D, lanes 6), which further indi- cates that it is not severely disrupted by the T10 deletion. This evidence supports the idea that interactions among MLV Gag proteins occur after the molecules are targeted to, and con- centrated on, the plasma membrane (see Discussion).
can be found in the literature, with each researcher choosing different areas to focus on and differing ontological structures to meet the requirements of their case. Lohse presented the O NTO MAS framework to reduce assembly system design effort using domain ontologies and implementing a function- behavior-structure paradigm to capture the characteristics of modular assembly system equipment . A similar abstraction approach was proposed by Hui et al. that used semantic objects to retrieve information from documents of various formats and by inference allowing domain specific tools to become better integrated . Lanz used feature based modelling to capture detailed product knowledge, categorizing features into geometric and non-geometric, to provide knowledge for a holonic manufacturing system . Raza and Harrison described a collaborative production line planning approach supported by knowledge management theory . A service- oriented architecture was proposed and supported by semantic web services that allowed automatic discovery and execution of assembly processes by modelling and mapping assembly processes and systems in . An influential architecture for integrating the PPR domains is the Virtual Factory Framework (VFF) which is a data model that links and stores knowledge to support engineering concurrency in the resource domain , but does not have the granularity to model system control logic. More recently, knowledge-based mapping has been used to support in the selection of function blocks for manufacturing resource components , and Ramis et al.  showed how product requirements could be translated directly through to dynamically changing programmable controller logic. Chen et al. extended EAST-ADL (a language developed to model automotive electronic systems, see ) to model production systems using MetaEdit+ . Mapping within and between the concepts of Equipment, Process and Product were achieved through the EAST-ADL feature links.
Abstract. Peptides that self-‐assemble into nanostructures are of tremendous interest for biological, medical, photonic and nanotechnological applications. The enormous sequence space that is available from 20 amino acids likely harbours many interesting candidates, but it is currently not possible to predict supramolecular behaviour from sequence alone. Here, we demonstrate computational tools to screen for the aqueous self-‐assembly propensity in all of the 8,000 possible tripeptides, and evaluate these by comparison with known examples. We applied filters to select for candidates that simultaneously optimize the apparently contradicting requirements of aggregation propensity and hydrophilicity, which resulted in a set of design rules for self-‐assembling sequences. A number of peptides were subsequently synthesized and characterised, including the first reported tripeptides that are able to form a hydrogel at neutral pH. These tools, which enable the peptide sequence space to be searched for supramolecular properties, enable minimalistic peptide nanotechnology to deliver on its promise.
According to the above approaches, a lack of associativity in PLM systems was highlighted by Tremblay et al. (2006) where only ‘parent–child’ i.e. ‘is part of’ class) relationship exists. For a large-scale company, the management of relative positions of parts using matrices is imple- mented in PDM systems in order to be more closely related to geometric models embedded in CAD systems, and to facilitate change management and part positioning. During the last decade, (Weber et al. 2003) have proposed an advanced PDM system based on a property-driven development/design approach by introducing the handling of predicted engineering characteris- tics (i.e. structure, shape, and material) and properties (i.e. product’s behaviour) of the product with their interdependencies in a separate manner. However, information related to product rela- tionships and assembly process engineering is not effectively treated in their proposal. More recently, PLM systems have moved towards Web-based and Web-service technologies, in order to facilitate information exchange and access in distributed and extended enterprises (Huang et al. 1999, Liu and Xu 2001, Georgiev et al. 2007). An additional effort towards ontology and semantic Web can also be found (Matsokis and Kiritsis 2010). According to the above applications and approaches, a lack of support of associability among product models using product relationships still exists and is a barrier for effective and integrated lifecycle-oriented design (Tremblay et al. 2006, Sy and Mascle 2011).
S5 Fig. Comparison of GC content of Asian seabass genome assembly (v2)with few selected fish genomes (A), with representatives from the different classes of vertebrates (B) and comparison of GC content with genome size of selected fishes (C). The GC-content of genomes of interest were calculated using a 20 kb sliding window (BedTools utilities ). In addition to Lates calcarifer, the genomes analyzed included (A) six teleosts (Danio rerio, Gadus morhua, Gasterosteus aculeatus, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis) or (B) six vertebrates (Anolis carolinensis, Callorhinchus milii, Gallus gallus, Homo sapiens, Petro- myzon marinus, and Xenopus tropicalis). Sliding windows with more than 25% of Ns (gaps) were discarded and the proportion of sliding windows with a given GC-content (%) was calcu- lated and plotted. The script utilized to run BedTools  and perform downstream process- ing is available at https://github.com/ramadatta/Scripts/blob/master/Average_GC_Content_ Analysis/knowGC-contentrun1.sh. (C) Genome size of selected fish genomes compared with their average GC content. BP: Boleophthalmus pectinirostris; DR: Danio rerio; GM: Gadus mor- hua; GA: Gasterosteus aculeatus; LC: Lates calcarifer NB: Neolamprologus brichardi; OL: Ory- zias latipes; ON: Oreochromis niloticus; TR: Takifugu rubripes; TN: Tetraodon nigroviridis. (TIF)
tation may have important applications in drug-delivery, e.g. controlled targeting in the presence of a therapeutic nucleic acid sequence. In addition, sensing applications can be envisioned for systems that, like the FRET pair shown herein, can modulate their spectral properties in a programmable way allowing for in vitro and in vivo monitoring of unshielding. Finally, the exibility in the design and synthesis oﬀered by nucleic acid based materials combined with the opportunity to tailor polymers and ligands for specic biomedical tasks suggests materials of this type may prove useful in the personalized diagnostics and patient-group stratied therapeutics.
The paper is structured as follows: We first present an estimate on how many sequences might benefit from our refinement approach in five pharmaceutical model species. Next, we validate the general idea of refining poorly annotated protein sequences by aligning the known protein sequences from human to de novo as- semblies from three tissue-specific transcriptomes (brain, liver and kidney) of these species. For this pur- pose, we use the 20,350 manually reviewed human pro- tein sequences in UniProtKB/Swiss-Prot (hereafter referred to as “known human protein sequences”) as ref- erence sequences. The Swiss-Prot subset of the Uni- ProtKB/Swiss-Prot database  is probably the most comprehensive resource for curated protein sequences. The number of human entries in this database has been quite stable for almost a decade indicating that most hu- man proteins are known. We generalise the approach used during validation of the general idea with an auto- mated sequence refinement workflow implemented in the a&o-tool and show an example application. For our analyses we used both publicly available data (mouse, rat, dog, pig and human) as well as newly generated paired-end RNA-Seq data (cynomolgus monkey).