首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 625 毫秒
1.
Alzheimer's disease (AD) is the most common form of dementia and leads to irreversible neurogenerative damage of the brain. However, the current diagnostic tools have poor sensitivity, especially for the early stages of AD and do not allow for diagnosis until AD has lead to irreversible brain damage. Therefore, it is crucial that AD is detected as early as possible. Although it is very hard, laborious and time-consuming to gather many AD and non-AD labeled samples, gathering unlabeled samples is easier than labeled samples. Since standard learning algorithms learn a diagnosis model from labeled samples only, they require many labeled samples and do not work well when the number of training samples is small. Therefore, it is very desirable to develop a predictive learning method to achieve high performance using both labeled samples and unlabeled samples. To address these problems, we propose semi-supervised distance metric learning using Random Forests with label propagation (SRF-LP) which incorporates labeled data for obtaining good metrics and propagates labels based on them. Experimental results showed that SRF-LP outperformed standard supervised learning algorithms, i.e., RF, SVM, Adaboost and CART and reached 93.1% accuracy at a maximum. Especially, SRF-LP largely outperformed when the number of training samples is very small. Our results also suggested that SRF-LP exhibits a synergistic effect of semi-supervised distance metric learning and label propagation.  相似文献   

2.
With the application of new high throughput sequencing technology, a large number of protein sequences is becoming available. Determination of the functional characteristics of these proteins by experiments is an expensive endeavor that requires a lot of time. Furthermore, at the organismal level, such kind of experimental functional analyses can be conducted only for a very few selected model organisms. Computational function prediction methods can be used to fill this gap. The functions of proteins are classified by Gene Ontology (GO), which contains more than 40,000 classifications in three domains, Molecular Function (MF), Biological Process (BP), and Cellular Component (CC). Additionally, since proteins have many functions, function prediction represents a multi-label and multi-class problem. We developed a new method to predict protein function from sequence. To this end, natural language model was used to generate word embedding of sequence and learn features from it by deep learning, and additional features to locate every protein. Our method uses the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and have noticeable improvement over several algorithms, such as FFPred, DeepGO, GoFDR and other methods compared on the CAFA3 datasets.  相似文献   

3.
As several structural proteomic projects are producing an increasing number of protein structures with unknown function, methods that can reliably predict protein functions from protein structures are in urgent need. In this paper, we present a method to explore the clustering patterns of amino acids on the 3-dimensional space for protein function prediction. First, amino acid residues on a protein structure are clustered into spatial groups using hierarchical agglomerative clustering, based on the distance between them. Second, the protein structure is represented using a graph, where each node denotes a cluster of amino acids. The nodes are labeled with an evolutionary profile derived from the multiple alignment of homologous sequences. Then, a shortest-path graph kernel is used to calculate similarities between the graphs. Finally, a support vector machine using this graph kernel is used to train classifiers for protein function prediction. We applied the proposed method to two separate problems, namely, prediction of enzymes and prediction of DNA-binding proteins. In both cases, the results showed that the proposed method outperformed other state-of-the-art methods.  相似文献   

4.
The question of molecular similarity is core in cheminformatics and is usually assessed via a pairwise comparison based on vectors of properties or molecular fingerprints. We recently exploited variational autoencoders to embed 6M molecules in a chemical space, such that their (Euclidean) distance within the latent space so formed could be assessed within the framework of the entire molecular set. However, the standard objective function used did not seek to manipulate the latent space so as to cluster the molecules based on any perceived similarity. Using a set of some 160,000 molecules of biological relevance, we here bring together three modern elements of deep learning to create a novel and disentangled latent space, viz transformers, contrastive learning, and an embedded autoencoder. The effective dimensionality of the latent space was varied such that clear separation of individual types of molecules could be observed within individual dimensions of the latent space. The capacity of the network was such that many dimensions were not populated at all. As before, we assessed the utility of the representation by comparing clozapine with its near neighbors, and we also did the same for various antibiotics related to flucloxacillin. Transformers, especially when as here coupled with contrastive learning, effectively provide one-shot learning and lead to a successful and disentangled representation of molecular latent spaces that at once uses the entire training set in their construction while allowing “similar” molecules to cluster together in an effective and interpretable way.  相似文献   

5.
Machine learning promises to accelerate materials discovery by allowing computational efficient property predictions from a small number of reference calculations. As a result, the literature has spent a considerable effort in designing representations that capture basic physical properties. Our work focuses on the less-studied learning formulations in this context in order to exploit inner structures in the prediction errors. In particular, we propose to directly optimize basic loss functions of the prediction error metrics typically used in the literature, such as the mean absolute error or the worst case error. In some instances, a proper choice of the loss function can directly reduce reasonably the prediction performance in the desired metric, albeit at the cost of additional computations during training. To support this claim, we describe the statistical learning theoretic foundations, and provide supporting numerical evidence with the prediction of atomization energies for a database of small organic molecules.  相似文献   

6.
Covalent labeling along with mass spectrometry is finding more use as a means of studying the higher order structure of proteins and protein complexes. Diethylpyrocarbonate (DEPC) is an increasingly used reagent for these labeling experiments because it is capable of modifying multiple residues at the same time. Pinpointing DEPC-labeled sites on proteins is typically needed to obtain more resolved structural information, and tandem mass spectrometry after protein proteolysis is often used for this purpose. In this work, we demonstrate that in certain instances, scrambling of the DEPC label from one residue to another can occur during collision-induced dissociation (CID) of labeled peptide ions, resulting in ambiguity in label site identity. From a preliminary study of over 30 labeled peptides, we find that scrambling occurs in about 25% of the peptides and most commonly occurs when histidine residues are labeled. Moreover, this scrambling appears to occur more readily under non-mobile proton conditions, meaning that low charge-state peptide ions are more prone to this reaction. For all peptides, we find that scrambling does not occur during electron transfer dissociation, which suggests that this dissociation technique is a safe alternative to CID for correct label site identification. Graphical Abstract
?  相似文献   

7.
Disulfide bonds are primary covalent cross‐links formed between two cysteine residues in the same or different protein polypeptide chains, which play important roles in the folding and stability of proteins. However, computational prediction of disulfide connectivity directly from protein primary sequences is challenging due to the nonlocal nature of disulfide bonds in the context of sequences, and the number of possible disulfide patterns grows exponentially when the number of cysteine residues increases. In the previous studies, disulfide connectivity prediction was usually performed in high‐dimensional feature space, which can cause a variety of problems in statistical learning, such as the dimension disaster, overfitting, and feature redundancy. In this study, we propose an efficient feature selection technique for analyzing the importance of each feature component. On the basis of this approach, we selected the most important features for predicting the connectivity pattern of intra‐chain disulfide bonds. Our results have shown that the high‐dimensional features contain redundant information, and the prediction performance can be further improved when these high‐dimensional features are reduced to a lower but more compact dimensional space. Our results also indicate that the global protein features contribute little to the formation and prediction of disulfide bonds, while the local sequential and structural information play important roles. All these findings provide important insights for structural studies of disulfide‐rich proteins. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2010  相似文献   

8.
The protein disulfide bond is a covalent bond that forms during post-translational modification by the oxidation of a pair of cysteines. In protein, the disulfide bond is the most frequent covalent link between amino acids after the peptide bond. It plays a significant role in three-dimensional (3D) ab initio protein structure prediction (aiPSP), stabilizing protein conformation, post-translational modification, and protein folding. In aiPSP, the location of disulfide bonds can strongly reduce the conformational space searching by imposing geometrical constraints. Existing experimental techniques for the determination of disulfide bonds are time-consuming and expensive. Thus, developing sequence-based computational methods for disulfide bond prediction becomes indispensable. This study proposed a stacking-based machine learning approach for disulfide bond prediction (diSBPred). Various useful sequence and structure-based features are extracted for effective training, including conservation profile, residue solvent accessibility, torsion angle flexibility, disorder probability, a sequential distance between cysteines, and more. The prediction of disulfide bonds is carried out in two stages: first, individual cysteines are predicted as either bonding or non-bonding; second, the cysteine-pairs are predicted as either bonding or non-bonding by including the results from cysteine bonding prediction as a feature.The examination of the relevance of the features employed in this study and the features utilized in the existing nearest neighbor algorithm (NNA) method shows that the features used in this study improve about 7.39 % in jackknife validation balanced accuracy. Moreover, for individual cysteine bonding prediction and cysteine-pair bonding prediction, diSBPred provides a 10-fold cross-validation balanced accuracy of 82.29 % and 94.20 %, respectively. Altogether, our predictor achieves an improvement of 43.25 % based on balanced accuracy compared to the existing NNA based approach. Thus, diSBPred can be utilized to annotate the cysteine bonding residues of protein sequences whose structures are unknown as well as improve the accuracy of the aiPSP method, which can further aid in experimental studies of the disulfide bond and structure determination.  相似文献   

9.
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.  相似文献   

10.
杜卓锟  邵伟  秦伟捷 《色谱》2021,39(3):211-218
在基于液相色谱-质谱联用的蛋白质组学研究中,肽段的保留时间作为有效区分不同肽段的特征参数,可以根据肽段自身的序列等信息对其进行预测。使用预测得到的保留时间辅助质谱数据鉴定肽段序列可以提高鉴定的准确性,因此对保留时间预测的工作一直受到领域内的广泛关注。传统的保留时间预测方法通常是根据氨基酸序列计算肽段的理化性质,进而计算肽段在特定色谱条件下的保留时间。近年来,深度学习方法取得了极大的进展,在蛋白质组学研究中发挥着越来越重要的作用。目前已发展出了多种基于深度学习的保留时间预测方法,与传统的保留时间预测方法相比有着更高的准确度,易于跨平台使用,并且能对修饰肽段的保留时间进行预测。但对某些复杂的修饰,如糖基化修饰等的预测结果还不够准确。如何进一步提高对修饰肽段预测的准确性是基于深度学习的保留时间预测方法的重要研究方向。这些预测的保留时间被应用于肽段鉴定的质量控制和方法评估,以及与预测的二级质谱谱图结合,建立模拟谱图库等方面。该文综述了深度学习方法在保留时间预测领域的最新研究进展以及应用成果,同时对其发展趋势和未来的应用方向进行了展望,以期为保留时间预测研究以及蛋白质组鉴定工作提供参考。  相似文献   

11.
Literature contains over fifty years of accumulated methods proposed by researchers for predicting the secondary structures of proteins in silico. A large part of this collection is comprised of artificial neural network-based approaches, a field of artificial intelligence and machine learning that is gaining increasing popularity in various application areas. The primary objective of this paper is to put together the summary of works that are important but sparse in time, to help new researchers have a clear view of the domain in a single place. An informative introduction to protein secondary structure and artificial neural networks is also included for context. This review will be valuable in designing future methods to improve protein secondary structure prediction accuracy. The various neural network methods found in this problem domain employ varying architectures and feature spaces, and a handful stand out due to significant improvements in prediction. Neural networks with larger feature scope and higher architecture complexity have been found to produce better protein secondary structure prediction. The current prediction accuracy lies around the 84% marks, leaving much room for further improvement in the prediction of secondary structures in silico. It was found that the estimated limit of 88% prediction accuracy has not been reached yet, hence further research is a timely demand.  相似文献   

12.
基于小波系数的近红外光谱局部建模方法与应用研究   总被引:2,自引:0,他引:2  
局部建模方法使用与预测样本相似的样本建立模型,可解决光谱响应与浓度之间的非线性问题,扩大模型的适用范围,提高预测准确度。采用小波变换进行数据压缩并利用小波系数之间的欧氏距离作为光谱相似性的判据,实现了近红外光谱定量分析的局部建模方法,避免了样本之间的依赖性。将所建立的方法用于烟草样品中氯含量的测定,100次重复计算得到的预测集均方根误差(RMSEP)平均值为0.0665,标准偏差(σ)为0.0045,优于全局建模和基于主成分的局部建模方法。  相似文献   

13.
Deep learning methods for RNA secondary structure prediction have shown higher performance than traditional methods, but there is still much room to improve. It is known that the lengths of RNAs are very different, as are their secondary structures. However, the current deep learning methods all use length-independent models, so it is difficult for these models to learn very different secondary structures. Here, we propose a length-dependent model that is obtained by further training the length-independent model for different length ranges of RNAs through transfer learning. 2dRNA, a coupled deep learning neural network for RNA secondary structure prediction, is used to do this. Benchmarking shows that the length-dependent model performs better than the usual length-independent model.  相似文献   

14.
Several methods have been proposed for protein–sugar binding site prediction using machine learning algorithms. However, they are not effective to learn various properties of binding site residues caused by various interactions between proteins and sugars. In this study, we classified sugars into acidic and nonacidic sugars and showed that their binding sites have different amino acid occurrence frequencies. By using this result, we developed sugar-binding residue predictors dedicated to the two classes of sugars: an acid sugar binding predictor and a nonacidic sugar binding predictor. We also developed a combination predictor which combines the results of the two predictors. We showed that when a sugar is known to be an acidic sugar, the acidic sugar binding predictor achieves the best performance, and showed that when a sugar is known to be a nonacidic sugar or is not known to be either of the two classes, the combination predictor achieves the best performance. Our method uses only amino acid sequences for prediction. Support vector machine was used as a machine learning algorithm and the position-specific scoring matrix created by the position-specific iterative basic local alignment search tool was used as the feature vector. We evaluated the performance of the predictors using five-fold cross-validation. We have launched our system, as an open source freeware tool on the GitHub repository (https://doi.org/10.5281/zenodo.61513).  相似文献   

15.
Solid‐state nuclear magnetic resonance (NMR) spectroscopy has been successfully applied to elucidate the atomic‐resolution structures of insoluble proteins. The major bottleneck is the difficulty to obtain valuable long‐distance structural information. Here, we propose the use of distance restraints as long as 32 Å, obtained from the quantification of transverse proton relaxation induced by a methanethiosulfonate spin label (MTSL). Combined with dipolar proton–proton distance restraints, this method allows us to obtain protein structures with excellent precision from single spin‐labeled 1 mg protein samples using fast magic angle spinning.  相似文献   

16.
17.
In this paper, we evaluate three learning algorithms based on supervised projections for molecular activity prediction. Using an approach based on supervised projections of the input space to construct ensembles of classifiers, three algorithms were tested. We constructed the projections by considering only instances that were misclassified by a previous classifier using the hidden layer of an Artificial Neural Network. We applied a supervised linear projection of the input space using a Nonparametric Discriminant Analysis method. Finally, we projected onto a subspace that minimizes the weighted error for each step. Using these three methods to construct ensembles of classifiers for the in silico prediction of Ames mutagenicity, we demonstrated the improved behavior of our proposal compared to classical methods.  相似文献   

18.
Protein inference from the identified peptides is of primary importance in the shotgun proteomics. The target of protein inference is to identify whether each candidate protein is truly present in the sample. To date, many computational methods have been proposed to solve this problem. However, there is still no method that can fully utilize the information hidden in the input data.In this article, we propose a learning-based method named BagReg for protein inference. The method firstly artificially extracts five features from the input data, and then chooses each feature as the class feature to separately build models to predict the presence probabilities of proteins. Finally, the weak results from five prediction models are aggregated to obtain the final result. We test our method on six public available data sets. The experimental results show that our method is superior to the state-of-the-art protein inference algorithms.  相似文献   

19.
Protein function prediction is a crucial task in the post-genomics era due to their diverse irreplaceable roles in a biological system. Traditional methods involved cost-intensive and time-consuming molecular biology techniques but they proved to be ineffective after the outburst of sequencing data through the advent of cost-effective and advanced sequencing techniques. To manage the pace of annotation with that of data generation, there is a shift to computational approaches which are based on homology, sequence and structure-based features, protein-protein interaction networks, phylogenetic profiles, and physicochemical properties, etc. A combination of these features has proven to be promising for protein function prediction in terms of improving prediction accuracy. In the present work, we have employed a combination of features based on sequence, physicochemical property, subsequence and annotation features with a total of 9890 features extracted and/or calculated for 171,212 reviewed prokaryotic proteins of 9 bacterial phyla from UniProtKB, to train a supervised deep learning ensemble model with the aim to categorize a bacterial hypothetical/unreviewed protein’s function into 1739 GO terms as functional classes. The proposed system being fully dedicated to bacterial organisms is a novel attempt amongst various existing machine learning based protein function prediction systems based on mixed organisms. Experimental results demonstrate the success of the proposed deep learning ensemble model based on deep neural network method with F1 measure of 0.7912 on the prepared Test dataset 1 of reviewed proteins.  相似文献   

20.
It has long been realized that connected graphs have some sort of geometric structure, in that there is a natural distance function (or metric), namely, the shortest-path distance function. In fact, there are several other natural yet intrinsic distance functions, including: the resistance distance, correspondent “square-rooted” distance functions, and a so‐called “quasi‐Euclidean” distance function. Some of these distance functions are introduced here, and some are noted not only to satisfy the usual triangle inequality but also other relations such as the “tetrahedron inequality”. Granted some (intrinsic) distance function, there are different consequent graph-invariants. Here attention is directed to a sequence of graph invariants which may be interpreted as: the sum of a power of the distances between pairs of vertices of G, the sum of a power of the “areas” between triples of vertices of G, the sum of a power of the “volumes” between quartets of vertices of G, etc. The Cayley–Menger formula for n-volumes in Euclidean space is taken as the defining relation for so-called “n-volumina” in terms of graph distances, and several theorems are here established for the volumina-sum invariants (when the mentioned power is 2). This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号