The search for new thermodynamically stable materials (those favoured to form during synthesis, when kinetic factors are excluded) using CSP can take one of many approaches (14), but all involve a search for the lowest energy minimum in a high-dimensional configuration space. The configuration space for a periodic structure with N atoms per unit cell has dimension 3N+3, taking into consideration the rotational symmetries and unit-cell degrees of freedom, whilst the number of local minima in the space scales exponentially with N (15). Ideally, all low-lying minima would be sampled during CSP since metastable phases may be synthesised experimentally, or indeed be thermodynamically stable under different conditions; for example, graphite is the most stable allotrope of carbon under ambient conditions, but diamond can be easily synthesised under high pressure. Particularly popular approaches to CSP include the use of evolutionary algorithms to ‘breed’ new structures (15) and particle swarm optimisation (16–18).
Protein tertiary structureprediction has been an important scientific problem for few decades, especially in bioinformatics and computational biology (Eisenhaber et al., 1995). Despite more and more native structures are included in protein data bank (PDB) database, the gap between the sequenced proteins and the native structures is still enlarging due to the exponential increase of protein sequences produced by large-scale genome and transcriptome sequencing. It is estimated that <1% of protein sequences have the native structures in PDB database (Rigden et al.,2009).Therefore, accurate computational methods for protein tertiary structureprediction that are much cheaper and faster than experimental structure determination techniques are needed to reduce this large sequence structure gap. Furthermore, computational structurepredictionmethods are important for obtaining the structures of membrane proteins whose structures are hard to be determined by experimental techniques such as X-ray crystallography (Yonath et al., 2011).
Due to the complexity of the problem and the long time that takes to analyze all that possible conformations, and that even for a small protein molecule the high dimensionality of the search space makes the problem intractable , only a tiny portion of protein sequences have experimentally solved three-dimensional structures. This fact had motivated further research in Computational Protein StructurePredictionMethods. Different computational approaches for finding the three-dimensional structure have been proposed. Algorithms are based on these strategies for solving protein folding problem, these algorithms search structures on a huge space of possible solutions. These methods can obtain several structures very close to the native structure. These computational strategies can be classified into 3 categories: (a) ab initio, (b) homology, and (c) threading . Homology and threading methods use protein information looking for finding a solution of the problem, in contrast, ab initio uses only the amino acid sequence without additional structural information. Anfinsen (Nobel Prize in Chemistry, 1972) shows that only ab initio can solve PFP . Ab initio is an interesting strategy for the next reasons: a) a lot of proteins do not have any homology with other proteins which native structure is known; b) the other strategies do not give information about why a protein adopts a certain structure; and c) even though, some proteins show high resemblance to other proteins, they adopt structures completely different . On the contrary, the bases of ab initio come from physical concepts based on energy functions , which can be model as an optimization problem. As a result, only predictions made with ab initio can be fully reliable. The algorithm proposed in this work belongs to the ab initio strategy.
of unique protein folds exist in nature and structureprediction of a target sequence can be performed by consulting a database of known folds and determining which fold-model best fits the sequence. Both homology modeling and threading rely on the existence of known structures and the disadvantage of such approaches is that accurate prediction relies on proteins of similar structure already being solved. Another approach, namely the ab initio techniques  or prediction from first principles, bases structureprediction on known biochemical and biophysical facts related to the proteins. In general they are computationally very expensive methods. Machine learning methods such as neural network and nearest neighbor techniques, utilize a localized prediction methodology in the sense that a window, typically of less than 20 amino acids, is presented to the prediction system with the aim of predicting secondary structure. However, local information accounts for approximately 65% of secondary structure formation . Therefore, prediction can potentially be improved by incorporating a more global prediction scheme . Secondary structurepredictionmethods often employ neural networks (NNs) , SVMs , and hidden Markov models (HMMs) [16, 17]. In neural networks and SVMs utilize an encoding scheme to represent the amino acid residues by numerical vectors. On the other hand, in HMM methods, hidden states generate segments of amino acids that correspond to the non- overlapping secondary structure segments. There are two types of protein secondary structureprediction algorithms. A single sequence algorithm does not use information about other similar proteins. The algorithm should be suitable for a nonhomologous sequence with no sequence similarity to any other protein sequence. Algorithms of another type explicitly use sequences of homologous proteins, which often have similar structures. The accuracy (sensitivity) of the best current single sequence predictionmethods is below 70%. The prediction accuracy of the best predictionmethods that employ information from multiple alignments is close to 82.0% .
probabilistic matching between sequence profiles generated from PSI-BLAST (15) for query and template sequences and between structural features of a template and those predicted by SPINE X (16–18) for a query sequence. Predicted structural features include secondary structure (17), backbone torsion angles (16), and residue solvent accessibility (18). For binding affinity prediction, we extracted a knowledge-based energy function, DRNA, from protein-RNA complex structures (19) based on a distance-scaled finite ideal-gas reference (DFIRE) state (20). The DFIRE reference state was found to be one of the best reference states for deriving knowledge-based energy functions for folding and binding studies (21, 22). While many template-based structurepredictionmethods and knowledge-based energy functions for protein-RNA interactions exist, the coupling between fold recognition by SPARKS X and binding affinity prediction by DRNA in SPOT-Seq-RNA provides the first dedicated high-resolution function prediction for RBPs.
Contact Map (CM) prediction is a bioinformatics (and specifically a pro- tein structureprediction) classification task that is an ideal test case for a big data challenge for several reasons. As the next paragraphs will detail, CM data sets easily reach tens of millions of instances, hundreds (if not thou- sands) of attributes and have an extremely high class imbalance. In this section we describe in detail the steps for the creation of the data set used to train the CM prediction method of .
In the literature, there are some approaches to predict H of organic compounds in water based on chemical structure directly. Additionally, a number of indirect approaches for prediction of H based on vapor-liquid equilibrium data including activity coefficient, however their applications for prediction of the H are not exactly assessed [5,6]. Consequently in this paper, we focus on those approaches which can predict the H directly. There are two main types of correlation for prediction
collapse, there has been comparatively little literature on the comparison of prediction criteria. Tables 2a and 2b represent a quantitative evaluation of these existing methods using available experimental data in the literature. In these tables, application of collapse criteria reviewed here is listed for several soils of different depositional histories. The tables include soils encountered by the authors and others reported data in the literature. It was noted that in most cases no single criterion has predicted accurately the collapsibility of a particular soil. For example, for the criterion of Denisov , where the coefficient of susceptibility to soil collapse given by the corresponding expression should be less than 1, only 5 soils have been identified as collapsible over 16 reported. Furthermore:
Molecular Evolutionary Genetics Analysis (MEGA) software is developed for the comparative analyses of DNA and protein sequences that are aimed at inferring the molecular evolutionary patterns of genes, genomes, and species over time (Kumar et al., 1994; Tamura et al., 2011).The phylogenetic analysis of the cloned AMT1 will be helpful in finding its relationship with other genes of AMT1 family. A protein structure can be predicted from the amino acid sequence of cloned AMT1 by structureprediction tools. The 3D structure of an unknown protein can be predicted using experimentally determined protein template having better homology with the target protein. Comparative modelling is the most reliable and accurate protein structureprediction method (Baker and Sali, 2001). Protein structureprediction helps to provide the biological function and mechanism of action of an unknown protein (Khan et al., 2016).RaptorX structureprediction server (Källberg et al., 2012,Peng, and Xu, 2011) helps in predicting the 3D structure for protein sequences when homologs are lacking in the Protein Data Bank (PDB). When an input sequence is given, RaptorX predicts its secondary and tertiary structures, contacts, solvent accessibility, disordered regions and binding sites. RaptorX also assigns some confidence scores to indicate the quality of a predicted 3D model: P-value for the relative global quality, GDT (global distance test) and uGDT (un-normalized GDT) for the absolute global quality and modeling error for each residue.
constructed homology models of Varicella Zoster virus thymidine kinase (VZV TK) based on herpes simplex virus type 1 thymidine kinase (HSV-1 TK) structure as template. Acyclovir and ganciclovir were docked in the constructed model to investigate the predictivity of these model as well as the characteristics of the binding with other substrates. It was found that there are slight differences in the way VZV TK binds the substrates in respect with HSV-1 TK. Missing loops in the VZV TK was modeled using the loop search routine of SYBYL 6.8. The study suggested that differences could be exploited for future ligand design in order to obtain more selective drugs. Li et al.  built homology models for a glycogen
Breast cancer is the most common non-skin malignancy affecting women, with approxi- mately 1.67 million cases diagnosed annually worldwide (Ferlay et al., 2013). If an individ- ual’s risk of breast cancer could be predicted, then screening, prevention, and treatment strategies could be targeted toward those women to maximize survival benefit and minimize harm. Risk prediction models are important tools to improve breast cancer care by lever- aging multi-dimensional electronic health data. Traditional breast cancer risk prediction models use demographic risk factors to estimate breast cancer risk, but they demonstrate only limited discriminatory power. In clinical practice, mammography is the most com- mon breast cancer screening test, and the only imaging modality supported by randomized trials demonstrating reduction in mortality rate. However, its effectiveness is not univer- sally accepted (Freedman et al., 2004). Recent advances in genome-wide association studies (GWAS) have revitalized the quest for genetic variants (single-nucleotide polymorphisms— SNPs) in risk prediction. However, the optimism of these studies has been tempered by disappointment and caution (Gail, 2008, 2009; Wacholder et al., 2010).
number of nucleotides present in the structure. It is very difficult to select the best structure amongst all possible kissing pair formation. Hence prediction is bases in similarity basis. Concept is to find out the sequence for kissing pair. This dot file sequence is searched in the PDB data bank. PDB data bank consists of more than 13600 sequences. If the dot files matches with any PDB sequence 100%, its corresponding nucleotide sequenced is checked. The sequence, that gives maximum similarity in nucleotides, is selected as most appropriate sequence. Hence algorithm Similarity calculates similarity using the concept of dynamic programming.
The former case is a classical example of domain fusion without supporting evidence. We will focus on the latter case, whose annotation history can be traced. It is encoded by gene MAC_00341, which is predicted to contain two domains, the Nup75 domain at positions 244–898 and the aconitase domain at po- sitions 900–1899: the linker sequence at positions 878–920 encodes for the C-terminal region of nucleo- porin Nup75 – Figure S5 in . There are no indi- cations from any expression or short-read data that an aconitase domain follows – see also: Data Supple- ment 06 in . Unfortunately, this mis-annotation has already propagated into other database entries since its original release in May 2010, in particular actual Nup75 homologs in other fungi, with GI num- bers (date submitted): 531865436 (November 2012), 572277876 (December 2013), 597570643 (March 2014), 632915374 (April 2014), which do not appear to be homologous to aconitases, and yet they are characterized precisely as such in their description lines. While Pfam searches do not admit this descrip- tion, the fact remains that the original entry is presented in domain architecture charts as a rare in- stance of the two domains joined into a single fusion protein. These cases should not only be treated differ- ently deploying a number of community criteria to be agreed on, but literally blacklisted in automated func- tion prediction (AFP) efforts. Thus, examining phylo- genetic distributions of genes, proteins or protein families can also be expanded to encompass phylo- genetic and genomic patterns to enhance the quality of annotation.
The New Product Development environment is one in which forecasting resource demand is particularly challenging (Anderson Jr and Joglekar, 2005; Loch and Terwiesch, 2007; Loch and Terwiesch, 1998). In most environments the goal of planning is to reduce uncertainty about events and their outcomes. Inhibiting uncertainty in NPD narrows the potential for innovation, defeating the objective of developing something new. However, not everything is uncertain and assumptions can be applied to the main types of activities that will be required and their likely outcomes (Kerzner, 2006). The problem of prediction is a complex one: multiple activities with multiple potential outcomes dictate proceeding activities; and, multiple factors can impact the likelihood of each outcome. Irrespective of the sophisticated planning tools that the resource data is packaged in, using an estimation-based approach to generate resource forecasts results in a number of issues (Hird 2013):
For evaluation, usually, performance measurement of the model depends on the learning process, techniques, and type of data. Numerous performance measurements that has been used in previous research is Accuracy, Sensitivity, Specificity, Peirce skill score (PSS), Heidke skill score (HSS), AUC/ROC, Precision, Recall, Kappa Statistic, Confusion matrix, Mean Square Error (MSE), Mathews correlation coefficient (MCC) and more. Hence, in this study, we select accuracy as our focus because more general and most of the researcher using this evaluation. The used of performance evaluation mostly for justification of the model when achieved the improvement result after a new strategy is applied and for comparing several models. However, in the term of diabetes, the prediction accuracy of the model is needed not only when the model well trained. It must have the ability to handle big data or EHR with consistent accuracy, reliability, and optimized computational time.
structureprediction largely depend upon the information out there in amino acid sequence. Evolutionary algorithms are like simple genetic algorithms (GA), messy GA, fast messy GA have addressed this problem. Support Vector Machine (SVM) represents a replacement approach to supervised pattern classification that has been with success applied to a large range of pattern recognition issues, as well as object recognition, speaker identification, gene function prediction with micro array expression profile, etc. In these cases, the performance of SVM either matches or is considerably higher than that of ancient machine learning approaches, as well as neural networks. However still SVMs are blackbox models. ANN is a good technique of protein structureprediction that relies on the sound theory of Back Propagation Algorithm. Protein secondary structureprediction has been satisfactorily performed by machine learning techniques like Artificial Neural Network and Support vector machines. Most secondary structureprediction programs target alpha helix and beta sheet structures and summarize all different structures within the random coil pseudo category. For the classification, ANN is employed as a binary classifier.
16 S rRNA base-pairs (921-922)·(1395-1396) and (923-925)·(1391-1393), which are part of region 28, are unstable and an alternate arrangement, (921-923)·(1532-1534), is detected by psoralen photochemical crosslinking. Site-directed mutagenesis has been used to investigate whether changes in base-paired region 28 or the alternate secondary structure is responsible for the inactivity of the subunit. 30 S subunits with the substitution C1533A or with deletion of nucleotides 1534 to 1542 can still be inactivated like the wild-type 30 S subunit. On the other hand, 30 S subunits that contain sequence changes in the 920 to 926 region show moderate to severe decreases in tRNA binding even under activating conditions. When 30 S subunits containing these mutations were subjected to chemical probing, they failed to show the normal hyper-reactivity of nucleotide G926 and, instead, reactivity was shifted to G925 or to G928, and G929. Two mutations in the 920 region result in structures in which A1394 is base-paired rather than being unpaired as normal; deletion but not substitution of A1394 resulted in loss of tRNA binding activity and depression of the reactivity of G926. Mutations were made to insert or delete a nucleotide at position 920. The deletion mutant but not the insertion mutant has decreased tRNA
When evaluating these penalty methods, it is impor- tant to consider whether they correctly bias the search strategy to feasible regions. The GAs discussed in Sec- tion 3 that use a penalty method apply a xed constant penalty term C per violation. This policy can cause problems if the second penalty method is applied with- out the extension of Patton et al. . For xed values of C is possible to construct examples where the structure with optimal energy with the penalty method does not correspond to the optimal energy for the HP model. It is also important to consider the ecacy of the penalty method to understand how well they facilitate opti- mization. For example, we believe that the extended formulation proposed by Patton et al. may lead to a less eective search than other methods. When the hy- drophobic amino acids are prevented from contribut- ing to the objective function because they overlap, the tness landscape may have large at regions, which can make the optimization problem more dicult. These considerations recommend the use of a xed penalty approach that is adapted based on the num- ber of hydrophobics available in the protein sequence,
Prevalent approaches to software reliability model- ing are black-box based, i.e., the the software system is considered as a whole and only its interactions with the outside world are modeled without looking into its in- ternal structure. However, with the advancement and widespread use of object oriented systems design and development, the use of component-based software de- velopment is on the rise. Software systems are devel- oped in a heterogeneous (multiple teams in dierent en- vironments) fashion, and hence it may be inappropriate to model the overall failure process of such systems us- ing only one of the several software reliability growth models. In this paper we outline the constituents of the structural models. We then present a exhaustive anal- yses of the classes of methods where the architecture of the application is modeled either as a discrete time Markov chain (DTMC) or a continuous time Markov chain (CTMC), and illustrate these methods using ex- amples.
In the previous chapters of this thesis we have seen, that more often than not, many of the data sets we encounter are non-stationary in nature. We have also seen that in many important application areas, e.g. time series forecasting in Chapter 4, it is important to capture the (temporal) dependence structure between observations adequately, otherwise future predictions may be unreliable. In this chapter, we turn our attention to a non-parametric framework in which we model such non-stationary time series. Specifically, we introduce wavelets (Section 5.1) and review the literature surrounding their application within locally stationary time series modelling (Section 5.2). Finally, in Section 5.3 we review the literature surrounding detecting change- points using the model described in Section 5.2. These ideas will be used in Chapter 6 for proposing a new method for detecting changes in variance, and in Chapter 7 we extend this into detecting changes in autocovariance.