首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 953 毫秒
1.
MicroRNAs (miRNAs) have been proved to play an indispensable role in many fundamental biological processes, and the dysregulation of miRNAs is closely correlated with human complex diseases. Many studies have focused on the prediction of potential miRNA-disease associations. Considering the insufficient number of known miRNA-disease associations and the poor performance of many existing prediction methods, a novel model combining gradient boosting decision tree with logistic regression (GBDT-LR) is proposed to prioritize miRNA candidates for diseases. To balance positive and negative samples, GBDT-LR firstly adopted k-means clustering to screen negative samples from unknown miRNA-disease associations. Then, the gradient boosting decision tree (GBDT) model, which has an intrinsic advantage in finding many distinguishing features and feature combinations is applied to extract features. Finally, the new features extracted by the GBDT model are input into a logistic regression (LR) model for predicting the final miRNA-disease association score. The experimental results show that the average AUC of GBDT-LR in 5-fold cross-validation (CV) can achieve 0.9274. Besides, in the case studies, 90 %, 94 % and 88 % of the top 50 miRNAs potentially associated with colon cancer, gastric cancer, and pancreatic cancer were confirmed by databases, respectively. Compared with the other three state-of-the-art methods, GBDT-LR can achieve the best prediction performance. The source code and dataset of GBDT-LR are freely available at https://github.com/Pualalala/GBDT-LR.  相似文献   

2.
As a large group of small non-coding RNAs (ncRNAs), Piwi-interacting RNAs (piRNAs) have been detected to be associated with various diseases. Identifying disease associated piRNAs can provide promising candidate molecular targets to promote the drug design. Although, a few computational ensemble methods have been developed for identifying piRNA-disease associations, the low-quality negative associations even with positive associations used during the training process prevent the predictive performance improvement. In this study, we proposed a new computational predictor named iPiDA-sHN to predict potential piRNA-disease associations. iPiDA-sHN presented the piRNA-disease pairs by incorporating piRNA sequence information, the known piRNA-disease association network, and the disease semantic graph. High-level features of piRNA-disease associations were extracted by the Convolutional Neural Network (CNN). Two-step positive-unlabeled learning strategy based on Support Vector Machine (SVM) was employed to select the high quality negative samples from the unknown piRNA-disease pairs. Finally, the SVM predictor trained with the known piRNA-disease associations and the high quality negative associations was used to predict new piRNA-disease associations. The experimental results showed that iPiDA-sHN achieved superior predictive ability compared with other state-of-the-art predictors.  相似文献   

3.
In patients with depression, the use of 5-HT reuptake inhibitors can improve the condition. Machine learning methods can be used in ligand-based activity prediction processes. In order to predict SERT inhibitors, the SERT inhibitor data from the ChEMBL database was screened and pre-processed. Then 4 machine learning methods (LR, SVM, RF, and KNN) and 4 molecular fingerprints (CDK, Graph, MACCS, and PubChem) were used to build 16 prediction models. The top 5 models of accuracy (Q) in the cross-validation of training set were used to build three different ensemble learning models. In the test1 set, the VOT_CLF3 model had the largest SP (0.871), Q (0.869), AUC (0.919), and MCC (0.728). In the unbalanced test2 set, VOT_CLF3 had the largest SE (0.857), SP (0.867), Q (0.865) and MCC (0.639). VOT_CLF3 was recommended for the virtual screening process of SERT inhibitors. In addition, 12 molecular structural alerts that frequently appear in SERT inhibitors were found (P < 0.05), which provided important reference value for the design work of SERT inhibitors.  相似文献   

4.
Parkinson’s disease(PD) is a complex neurological disorder that typically worsens with age. A wide range of pathologies makes PD a very heterogeneous condition, and there are currently no reliable diagnostic tests for this disease. The application of metabolomics to the study of PD has the potential to identify disease biomarkers through the systematic evaluation of metabolites. In this study, urine metabolic profiles of 215 urine samples from 104 PD patients and 111 healthy individuals were ass...  相似文献   

5.
Identification of disease genes, using computational methods, is an important issue in biomedical and bioinformatics research. According to observations that diseases with the same or similar phenotype have the same biological characteristics, researchers have tried to identify genes by using machine learning tools. In recent attempts, some semi-supervised learning methods, called positive-unlabeled learning, is used for disease gene identification. In this paper, we present a Perceptron ensemble of graph-based positive-unlabeled learning (PEGPUL) on three types of biological attributes: gene ontologies, protein domains and protein-protein interaction networks. In our method, a reliable set of positive and negative genes are extracted using co-training schema. Then, the similarity graph of genes is built using metric learning by concentrating on multi-rank-walk method to perform inference from labeled genes. At last, a Perceptron ensemble is learned from three weighted classifiers: multilevel support vector machine, k-nearest neighbor and decision tree. The main contributions of this paper are: (i) incorporating the statistical properties of gene data through choosing proper metrics, (ii) statistical evaluation of biological features, and (iii) noise robustness characteristic of PEGPUL via using multilevel schema. In order to assess PEGPUL, we have applied it on 12950 disease genes with 949 positive genes from six class of diseases and 12001 unlabeled genes. Compared with some popular disease gene identification methods, the experimental results show that PEGPUL has reasonable performance.  相似文献   

6.
7.
The interactions between miRNAs and long non-coding RNAs (lncRNAs) are subject to intensive recent studies due to its critical role in gene regulations. Computational prediction of lncRNA-miRNA interactions has become a popular alternative strategy to the experimental methods for identification of underlying interactions. It is desirable to develop the machine learning-based models for prediction of lncRNA-miRNA based on the experimentally validated interactions between lncRNAs and miRNAs. The accuracy and robustness of existing models based on machine learning techniques are subject to further improvement.Considering that the attributes of lncRNA and miRNA contribute key importance in the interaction between these two RNAs, a deep learning model, named LMI-DForest, is proposed here by combining the deep forest and autoencoder strategies. Systematic comparison on the experiment validated datasets for lncRNA-miRNA interaction datasets demonstrates that the proposed method consistently shows superior performance over the other machine learning models in the lncRNA-miRNA interaction prediction.  相似文献   

8.
Many complex natural or synthetic products are analysed either by the GC–MS (gas chromatography–mass spectrometry) or HPLC–DAD (high performance liquid chromatography–diode-array detector) technique, each of which produces a one-dimensional fingerprint for a given sample. This may be used for classification of different batches of a product. GC–MS and HPLC–DAD analyses of complex, similar substances represented by the three common types of the TCM (traditional Chinese medicine), Rhizoma Curcumae were analysed in the form of one- and two-dimensional matrices firstly with the use of PCA (Principal component analysis), which showed a reasonable separation of the samples for each technique. However, the separation patterns were rather different for each analytical method, and PCA of the combined data matrix showed improved discrimination of the three types of object; close associations between the GC–MS and HPLC–DAD variables were observed. LDA (linear discriminant analysis), BP-ANN (back propagation-artificial neural networks) and LS-SVM (least squares-support vector machine) chemometrics methods were then applied to classify the training and prediction sets. For one-dimensional matrices, all training models indicated that several samples would be misclassified; the same was observed for each prediction set. However, by comparison, in the analysis of the combined matrix, all models gave 100% classification with the training set, and the LS-SVM calibration also produced a 100% result for prediction, with the BP-ANN calibration closely behind. This has important implications for comparing complex substances such as the TCMs because clearly the one-dimensional data matrices alone produce inferior results for training and prediction as compared to the combined data matrix models. Thus, product samples may be misclassified with the use of the one-dimensional data because of insufficient information.  相似文献   

9.
Accumulating studies have indicated that long non-coding RNAs (lncRNAs) play crucial roles in large amount of biological processes. Predicting lncRNA-disease associations can help biologist to understand the molecular mechanism of human disease and benefit for disease diagnosis, treatment and prevention. In this paper, we introduce a computational framework based on graph autoencoder matrix completion (GAMCLDA) to identify lncRNA-disease associations. In our method, the graph convolutional network is utilized to encode local graph structure and features of nodes for learning latent factor vectors of lncRNA and disease. Further, the inner product of lncRNA factor vector and disease factor vector is used as decoder to reconstruct the lncRNA-disease association matrix. In addition, the cost-sensitive neural network is utilized to deal with the imbalance between positive and negative samples. The experimental results show GAMLDA outperforms other state-of-the-art methods in prediction performance which is evaluated by AUC value, AUPR value, PPV and F1-score. Moreover, the case study shows our method is the effectively tool for potential lncRNA-disease prediction.  相似文献   

10.
11.
RNA secondary structure prediction is a key technology in RNA bioinformatics. Most algorithms for RNA secondary structure prediction use probabilistic models, in which the model parameters are trained with reliable RNA secondary structures. Because of the difficulty of determining RNA secondary structures by experimental procedures, such as NMR or X-ray crystal structural analyses, there are still many RNA sequences that could be useful for training whose secondary structures have not been experimentally determined. In this paper, we introduce a novel semi-supervised learning approach for training parameters in a probabilistic model of RNA secondary structures in which we employ not only RNA sequences with annotated secondary structures but also ones with unknown secondary structures. Our model is based on a hybrid of generative (stochastic context-free grammars) and discriminative models (conditional random fields) that has been successfully applied to natural language processing. Computational experiments indicate that the accuracy of secondary structure prediction is improved by incorporating RNA sequences with unknown secondary structures into training. To our knowledge, this is the first study of a semi-supervised learning approach for RNA secondary structure prediction. This technique will be useful when the number of reliable structures is limited.  相似文献   

12.
Protein function prediction is a crucial task in the post-genomics era due to their diverse irreplaceable roles in a biological system. Traditional methods involved cost-intensive and time-consuming molecular biology techniques but they proved to be ineffective after the outburst of sequencing data through the advent of cost-effective and advanced sequencing techniques. To manage the pace of annotation with that of data generation, there is a shift to computational approaches which are based on homology, sequence and structure-based features, protein-protein interaction networks, phylogenetic profiles, and physicochemical properties, etc. A combination of these features has proven to be promising for protein function prediction in terms of improving prediction accuracy. In the present work, we have employed a combination of features based on sequence, physicochemical property, subsequence and annotation features with a total of 9890 features extracted and/or calculated for 171,212 reviewed prokaryotic proteins of 9 bacterial phyla from UniProtKB, to train a supervised deep learning ensemble model with the aim to categorize a bacterial hypothetical/unreviewed protein’s function into 1739 GO terms as functional classes. The proposed system being fully dedicated to bacterial organisms is a novel attempt amongst various existing machine learning based protein function prediction systems based on mixed organisms. Experimental results demonstrate the success of the proposed deep learning ensemble model based on deep neural network method with F1 measure of 0.7912 on the prepared Test dataset 1 of reviewed proteins.  相似文献   

13.
流感是一种主要的呼吸道传染病, 在普通人群中有着较高的发病率, 而对于一些年老和高危病人还有较高的死亡率. 研究显示抑制神经氨酸苷酶(NA)可以阻断病毒RNA复制, 因此NA是有效治疗H1N1型流感病毒的重要药物靶标. 通过计算机方法进行虚拟筛选和预测NA抑制剂已经变得越来越重要. 针对酶活性位点进行基于结构的合理药物设计, 开发H1N1 病毒神经氨酸苷酶抑制剂, 已成为药物研究的热点之一. 本文通过多种机器学习方法(支持向量机(SVM)、k-最近相邻法(k-NN)和C4.5决策树(C4.5DT))对已知的神经氨酸苷酶抑制剂(NAIs)与非神经氨酸苷酶抑制剂(non-NAIs)建立分类预测模型. 其中227个结构多样性化合物(72个NAIs与155个non-NAIs)被用于测试分类预测系统, 并用递归变量消除法选择与神经氨酸苷酶抑制剂分类相关的性质描述符以提高预测精度. 本研究对独立验证集的总预测精度为75.9%-92.6%, NA 抑制剂的预测精度为64.3%-78.6%, 非H1N1抑制剂的预测精度为77.5%-97.5%. SVM法给出最好的总预测精度(92.6%). 本研究表明支持向量机等机器学习方法可以有效预测未知数据集中潜在的NA抑制剂, 并有助于发现与其相关的分子描述符.  相似文献   

14.
Drug-target interaction (DTI) prediction through in vitro methods is expensive and time-consuming. On the other hand, computational methods can save time and money while enhancing drug discovery efficiency. Most of the computational methods frame DTI prediction as a binary classification task. One important challenge is that the number of negative interactions in all DTI-related datasets is far greater than the number of positive interactions, leading to the class imbalance problem. As a result, a classifier is trained biased towards the majority class (negative class), whereas the minority class (interacting pairs) is of interest. This class imbalance problem is not widely taken into account in DTI prediction studies, and the few previous studies considering balancing in DTI do not focus on the imbalance issue itself. Additionally, they do not benefit from deep learning models and experimental validation. In this study, we propose a computational framework along with experimental validations to predict drug-target interaction using an ensemble of deep learning models to address the class imbalance problem in the DTI domain. The objective of this paper is to mitigate the bias in the prediction of DTI by focusing on the impact of balancing and maintaining other involved parameters at a constant value. Our analysis shows that the proposed model outperforms unbalanced models with the same architecture trained on the BindingDB both computationally and experimentally. These findings demonstrate the significance of balancing, which reduces the bias towards the negative class and leads to better performance. It is important to note that leaning on computational results without experimentally validating them and by relying solely on AUROC and AUPRC metrics is not credible, particularly when the testing set remains unbalanced.  相似文献   

15.
γ‐Secretase inhibitors have been explored for the prevention and treatment of Alzheimer's disease (AD). Methods for prediction and screening of γ‐secretase inhibitors are highly desired for facilitating the design of novel therapeutic agents against AD, especially when incomplete knowledge about the mechanism and three‐dimensional structure of γ‐secretase. We explored two machine learning methods, support vector machine (SVM) and random forest (RF), to develop models for predicting γ‐secretase inhibitors of diverse structures. Quantitative analysis of the receiver operating characteristic (ROC) curve was performed to further examine and optimize the models. Especially, the Youden index (YI) was initially introduced into the ROC curve of RF so as to obtain an optimal threshold of probability for prediction. The developed models were validated by an external testing set with the prediction accuracies of SVM and RF 96.48 and 98.83% for γ‐secretase inhibitors and 98.18 and 99.27% for noninhibitors, respectively. The different feature selection methods were used to extract the physicochemical features most relevant to γ‐secretase inhibition. To the best of our knowledge, the RF model developed in this work is the first model with a broad applicability domain, based on which the virtual screening of γ‐secretase inhibitors against the ZINC database was performed, resulting in 368 potential hit candidates. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2010  相似文献   

16.
17.
To improve the prediction accuracy of O-glycosylation sites, and analyze the structure of the O-glycosylation sites, factor analysis based prediction is proposed in this study. Our studies show that factor analysis strongly boosts machine learning algorithms’ performance in glycosylation site prediction besides demonstrates advantages compared to principal component analysis and nonnegative matrix factorization. In addition, we have found that factor analysis based linear discriminant analysis seem to be a desirable method in O-glycosylation site prediction for its advantage in both accuracy and time complexity than other machine learning methods. To the best of our knowledge, it is the first work to employ factor analysis in glycosylation site prediction and will inspire more future work in this topic.  相似文献   

18.
机器学习方法用于建立乙酰胆碱酯酶抑制剂的分类模型   总被引:1,自引:0,他引:1  
我们构建了表征乙酰胆碱酯酶抑制剂分子组成、电荷、拓扑、几何结构及物理化学性质等特征的1559个描述符,通过Fischer Score排序过滤和Monte Carlo模拟退火法相结合进行变量筛选得到37个描述符,然后分别用支持向量学习机(SVM)、人工神经网络(ANN)和k-近邻(k-NN)等机器学习方法建立了乙酰胆碱酯酶抑制剂的分类预测模型.对于训练集的515个样本,通过五重交叉验证,各机器学习方法对正样本,负样本和总样本的平均预测精度分别为87.3%-92.7%,67.0%-81.0%和79.4%-88.2%;通过y-scrambling方法验证SVM模型是否偶然相关,结果正样本,负样本和总样本的平均预测精度分别为72.7%-82.5%,41.0%-53.0%和62.1%-69.1%,明显低于实际所建模型的预测精度,表明所建模型不存在偶然相关;对172个没有参与建模的外部独立测试样本,各机器学习方法对正样本,负样本和总样本的预测精度分别为93.3%-100.0%,74.6%-89.6%和86.1%-95.9%.所建模型中,SVM模型预测精度最好,且明显高于其它文献报道结果.  相似文献   

19.
20.
This study unites six popular machine learning approaches to enhance the prediction of a molecular binding affinity between receptors (large protein molecules) and ligands (small organic molecules). Here we examine a scheme where affinity of ligands is predicted against a single receptor – human thrombin, thus, the models consider ligand features only. However, the suggested approach can be repurposed for other receptors. The methods include Support Vector Machine, Random Forest, CatBoost, feed-forward neural network, graph neural network, and Bidirectional Encoder Representations from Transformers. The first five methods use input features based on physico-chemical properties of molecules, while the last one is based on textual molecular representations. All approaches do not rely on atomic spatial coordinates, avoiding a potential bias from known structures, and are capable of generalizing for compounds with unknown conformations. Within each of the methods, we have trained two models that solve classification and regression tasks. Then, all models are grouped into a pipeline of two subsequent ensembles. The first ensemble aggregates six classification models which vote whether a ligand binds to a receptor or not. If a ligand is classified as active (i.e., binds), the second ensemble predicts its binding affinity in terms of the inhibition constant Ki.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号