首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present results of a new computational learning algorithm combining favorable elements of two well-known techniques: K nearest neighbors and recursive partitioning. Like K nearest neighbors, the method provides an independent prediction for each test sample under consideration, while like recursive partitioning, it incorporates an automatic selection of important input variables for model construction. The new method is applied to the problem of correctly classifying a set of chemical data samples designated as being either active or inactive in a biological screen. Training is performed at varying levels of intrinsic model complexity, and classification performance is compared to that of both K nearest neighbor and recursive partitioning models trained using the identical protocol. We find that the cross-validated performance of the new method outperforms both of these standard techniques over a considerable range of user parameters. We discuss advantages and drawbacks of the new method, with particular emphasis on its parameter robustness, required training time, and performance with respect to chemical structural class.  相似文献   

2.
High dimensional datasets contain up to thousands of features, and can result in immense computational costs for classification tasks. Therefore, these datasets need a feature selection step before the classification process. The main idea behind feature selection is to choose a useful subset of features to significantly improve the comprehensibility of a classifier and maximize the performance of a classification algorithm. In this paper, we propose a one-per-class model for high dimensional datasets. In the proposed method, we extract different feature subsets for each class in a dataset and apply the classification process on the multiple feature subsets. Finally, we merge the prediction results of the feature subsets and determine the final class label of an unknown instance data. The originality of the proposed model is to use appropriate feature subsets for each class. To show the usefulness of the proposed approach, we have developed an application method following the proposed model. From our results, we confirm that our method produces higher classification accuracy than previous novel feature selection and classification methods.  相似文献   

3.
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem.  相似文献   

4.
Drug discovery processes require drug-target interaction (DTI) prediction by virtual screenings with high accuracy. Compared with traditional methods, the deep learning method requires less time and domain expertise, while achieving higher accuracy. However, there is still room for improvement for higher performance with simplified structures. Meanwhile, this field is calling for multi-task models to solve different tasks. Here we report the GanDTI, an end-to-end deep learning model for both interaction classification and binding affinity prediction tasks. This model employs the compound graph and protein sequence data. It only consists of a graph neural network, an attention module and a multiple-layer perceptron, yet outperforms the state-of-the art methods to predict binding affinity and interaction classification on the DUD-E, human, and bindingDB benchmark datasets. This demonstrates our refined model is highly effective and efficient for DTI prediction and provides a new strategy for performance improvement.  相似文献   

5.
Untargeted metabolomics based on liquid chromatography coupled with mass spectrometry (LC–MS) can detect thousands of features in samples and produce highly complex datasets. The accurate extraction of meaningful features and the building of discriminant models are two crucial steps in the data analysis pipeline of untargeted metabolomics. In this study, pure ion chromatograms were extracted from a liquor dataset and left-sided colon cancer (LCC) dataset by K-means-clustering-based Pure Ion Chromatogram extraction method version 2.0 (KPIC2). Then, the nonlinear low-dimensional embedding by uniform manifold approximation and projection (UMAP) showed the separation of samples from different groups in reduced dimensions. The discriminant models were established by extreme gradient boosting (XGBoost) based on the features extracted by KPIC2. Results showed that features extracted by KPIC2 achieved 100% classification accuracy on the test sets of the liquor dataset and the LCC dataset, which demonstrated the rationality of the XGBoost model based on KPIC2 compared with the results of XCMS (92% and 96% for liquor and LCC datasets respectively). Finally, XGBoost can achieve better performance than the linear method and traditional nonlinear modeling methods on these datasets. UMAP and XGBoost are integrated into KPIC2 package to extend its performance in complex situations, which are not only able to effectively process nonlinear dataset but also can greatly improve the accuracy of data analysis in non-target metabolomics.  相似文献   

6.
7.
为了实现对法庭科学领域重质矿物油物证的快速、准确、无损的鉴定,该文基于光谱分析技术提出了一种多阶导数光谱数据组合分析的方法。收集了80种不同型号、不同厂家的重质矿物油样本,利用傅里叶变换拉曼光谱分析法采集样本的原始光谱数据和导数光谱数据,并通过结合化学计量学构建分类模型。在构建的主成分分析(PCA)结合径向基函数神经网络(RBF)分类模型中,对单独的原始光谱、一阶导数谱和二阶导数谱数据的训练集准确率分别为80.0%、86.7%和86.2%,测试集准确率分别为73.3%、80.0%和72.7%;对组合后的原始光谱+一阶导数谱、原始光谱+二阶导数谱和一阶导数谱+二阶导数谱数据的分类中,训练集准确率分别为97.0%、96.7%和100%,测试集准确率分别为85.7%、90.0%和100%。结果表明,对组合后的导数光谱与原始光谱构建分类模型,准确率更高。其中,基于一阶导数谱+二阶导数谱数据构建的PCA结合RBF分类模型的结果最为理想,准确率达100%。而K最近邻算法模型由于受到样本不均匀的影响,整体分类准确率均较低。利用组合的导数光谱与原始光谱数据构建分类模型能够实现对重质矿物油样本的快速、准确、无损鉴别,可为光谱组合技术在法庭科学及其他分析测试领域的应用提供一定的借鉴和参考。  相似文献   

8.
Naturally inspired evolutionary algorithms prove effectiveness when used for solving feature selection and classification problems. Artificial Bee Colony (ABC) is a relatively new swarm intelligence method. In this paper, we propose a new hybrid gene selection method, namely Genetic Bee Colony (GBC) algorithm. The proposed algorithm combines the used of a Genetic Algorithm (GA) along with Artificial Bee Colony (ABC) algorithm. The goal is to integrate the advantages of both algorithms. The proposed algorithm is applied to a microarray gene expression profile in order to select the most predictive and informative genes for cancer classification. In order to test the accuracy performance of the proposed algorithm, extensive experiments were conducted. Three binary microarray datasets are use, which include: colon, leukemia, and lung. In addition, another three multi-class microarray datasets are used, which are: SRBCT, lymphoma, and leukemia. Results of the GBC algorithm are compared with our recently proposed technique: mRMR when combined with the Artificial Bee Colony algorithm (mRMR-ABC). We also compared the combination of mRMR with GA (mRMR-GA) and Particle Swarm Optimization (mRMR-PSO) algorithms. In addition, we compared the GBC algorithm with other related algorithms that have been recently published in the literature, using all benchmark datasets. The GBC algorithm shows superior performance as it achieved the highest classification accuracy along with the lowest average number of selected genes. This proves that the GBC algorithm is a promising approach for solving the gene selection problem in both binary and multi-class cancer classification.  相似文献   

9.
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.  相似文献   

10.
The docking performance of the FRED and HYBRID programs are evaluated on two standardized datasets from the Docking and Scoring Symposium of the ACS Spring 2011 national meeting. The evaluation includes cognate docking and virtual screening performance. FRED docks 70?% of the structures to within 2?? in the cognate docking test. In the virtual screening test, FRED is found to have a mean AUC of 0.75. The HYBRID program uses a modified version of FRED's algorithm that uses both ligand- and structure-based information to dock molecules, which increases its mean AUC to 0.78. HYBRID can also implicitly account for protein flexibility by making use of multiple crystal structures. Using multiple crystal structures improves HYBRID's performance (mean AUC 0.80) with a negligible increase in docking time (~15?%).  相似文献   

11.
Class prediction based on DNA microarray data has been emerged as one of the most important application of bioinformatics for diagnostics/prognostics. Robust classifiers are needed that use most biologically relevant genes embedded in the data. A consensus approach that combines multiple classifiers has attributes that mitigate this difficulty compared to a single classifier. A new classification method named as consensus analysis of multiple classifiers using non-repetitive variables (CAMCUN) was proposed for the analysis of hyper-dimensional gene expression data. The CAMCUN method combined multiple classifiers, each of which was built from distinct, non-repeated genes that were selected for effectiveness in class differentiation. Thus, the CAMCUN utilized most biologically relevant genes in the final classifier. The CAMCUN algorithm was demonstrated to give consistently more accurate predictions for two well-known datasets for prostate cancer and leukemia. Importantly, the CAMCUN algorithm employed an integrated 10-fold cross-validation and randomization test to assess the degree of confidence of the predictions for unknown samples.  相似文献   

12.
There is currently far more sequence information than structural information available, and the ability to use homology models for virtual screening applications is desirable in many cases where structures have not yet been solved. This review focuses on the application of protein kinase homology models for virtual screening use. In addition to reviewing previous cases in which kinase homology models have been used in inhibitor design, we present new data - useful for template selection in homology modeling applications - indicating that the template structure with the highest sequence or structural similarity with the target structure may not always be the best choice. This new work explored the simple hypothesis that better results might be obtained for docking a ligand to a target receptor using a homology model of the target created from a different kinase template co-crystallized with the ligand, than from a crystal structure of the actual kinase target that is unliganded or bound to an unrelated ligand. This hypothesis was tested in docking studies of staurosporine with eight different kinases: AutoDock was used to dock staurosporine to homology models of each kinase created from staurosporine-bound template structures, and the results were compared with docking staurosporine to crystal structures of the target kinase that were obtained in complex with a non-staurosporine ligand or no ligand. It was found that the homology models performed as well as or better than the crystal structures, suggesting that using a homology model created from a template crystallized with a representative ligand may in some cases be a preferred approach, especially in virtual screening experiments that focus on enriching for members of a particular inhibitor class.  相似文献   

13.
气溶胶是大气中的重要组分,对气候、生态环境等均有重要的影响。激光诱导击透光谱(LIBS) 在用于气溶胶检测时,由于气溶胶的离散分布,导致采集到大量无效光谱。本文提出一种结合字典学习对有效光谱数据进行筛选的方法——K-SVD-SVM。通过制备7种不同浓度的NaCl气溶胶样品,选取10% NaCl溶液的5000条光谱数据进行分类,其中70%作为训练集,30%作为测试集。当字典基向量数设置为3时,模型分类性能最优,准确率(accuracy),精确率(precision),召回率(recall),精确率和召回率的调和平均(F1)分别达到96%,95%,95%,0.95。此外,采用K-SVD-SVM方法对7种不同浓度的气溶胶样品进行筛选后,输入GA-ELM模型开展定量分析,同时将未筛选的原始光谱数据输入定量模型进行对比。未筛选的原始数据测试集RMSE和R2分别是0.0303和0.8726,筛选光谱后,分别提升至0.0187和0.9809。结果表明,K-SVD-SVM方法有着较好的分类性能,且采用此方法筛选出的有效数据可以为气溶胶中元素定量分析提供数据支撑。  相似文献   

14.
Gene expression data are characterized by thousands even tens of thousands of measured genes on only a few tissue samples. This can lead either to possible overfitting and dimensional curse or even to a complete failure in analysis of microarray data. Gene selection is an important component for gene expression-based tumor classification systems. In this paper, we develop a hybrid particle swarm optimization (PSO) and tabu search (HPSOTS) approach for gene selection for tumor classification. The incorporation of tabu search (TS) as a local improvement procedure enables the algorithm HPSOTS to overleap local optima and show satisfactory performance. The proposed approach is applied to three different microarray data sets. Moreover, we compare the performance of HPSOTS on these datasets to that of stepwise selection, the pure TS and PSO algorithm. It has been demonstrated that the HPSOTS is a useful tool for gene selection and mining high dimension data.  相似文献   

15.
16.
BackgroundDiscover possible Drug Target Interactions (DTIs) is a decisive step in the detection of the effects of drugs as well as drug repositioning. There is a strong incentive to develop effective computational methods that can effectively predict potential DTIs, as traditional DTI laboratory experiments are expensive, time-consuming, and labor-intensive. Some technologies have been developed for this purpose, however large numbers of interactions have not yet been detected, the accuracy of their prediction still low, and protein sequences and structured data are rarely used together in the prediction process.MethodsThis paper presents DTIs prediction model that takes advantage of the special capacity of the structured form of proteins and drugs. Our model obtains features from protein amino-acid sequences using physical and chemical properties, and from drugs smiles (Simplified Molecular Input Line Entry System) strings using encoding techniques. Comparing the proposed model with different existing methods under K-fold cross validation, empirical results show that our model based on ensemble learning algorithms for DTI prediction provide more accurate results from both structures and features data.ResultsThe proposed model is applied on two datasets:Benchmark (feature only) datasets and DrugBank (Structure data) datasets. Experimental results obtained by Light-Boost and ExtraTree using structures and feature data results in 98 % accuracy and 0.97 f-score comparing to 94 % and 0.92 achieved by the existing methods. Moreover, our model can successfully predict more yet undiscovered interactions, and hence can be used as a practical tool to drug repositioning.A case study of applying our prediction model on the proteins that are known to be affected by Corona viruses in order to predict the possible interactions among these proteins and existing drugs is performed. Also, our model is applied on Covid-19 related drugs announced on DrugBank. The results show that some drugs like DB00691 and DB05203 are predicted with 100 % accuracy to interact with ACE2 protein. This protein is a self-membrane protein that enables Covid-19 infection. Hence, our model can be used as an effective tool in drug reposition to predict possible drug treatments for Covid-19.  相似文献   

17.
The objective of this study was to utilize linear discriminant analysis (LDA) in the interpretation of capillary electrophoresis-sodium dodecyl sulfate polymer-filled capillary gel electrophoresis (CE-SDS) meat protein profiles for the identification of meat species. The specific objectives were 1) to collect quantitative data on water-soluble and saline-soluble proteins of different meat species obtained by CE-SDS and 2) to apply LDA on collected CE-SDS protein data for the development of a pattern recognition statistical model useful in the differentiation of meat species. Samples were raw beef top and eye round, boneless fresh pork ham and loin, turkey leg and breast meat, and mechanically deboned turkey meat collected on six different occasions, making a total of 42 samples. Additionally, 14 samples were used as test samples to determine the classification ability of the procedure. Quantitative protein data obtained by CE-SDS was used to generate separate LDA models for either water- or saline-soluble protein extracts. Although a saline solution was a more efficient meat protein-extracting agent, as shown by a higher total protein concentration and a larger number of peaks, water-soluble CE-SDS protein profiles gave more distinctive discrimination among meat species. The correct classification given by LDA on water-soluble protein data was 100% for all meat species, except pork (94%). Conversely, the correct classification on saline-soluble protein data was 88% for beef and mechanically deboned turkey meat, and 94% and 100% for turkey and pork meat, respectively. LDA proved to be a useful pattern recognition procedure in the interpretation of CE-SDS protein profiles for the identification of meat species.  相似文献   

18.
Homology modeling plays a central role in determining protein structure in the structural genomics project. The importance of homology modeling has been steadily increasing because of the large gap that exists between the overwhelming number of available protein sequences and experimentally solved protein structures, and also, more importantly, because of the increasing reliability and accuracy of the method. In fact, a protein sequence with over 30% identity to a known structure can often be predicted with an accuracy equivalent to a low-resolution X-ray structure. The recent advances in homology modeling, especially in detecting distant homologues, aligning sequences with template structures, modeling of loops and side chains, as well as detecting errors in a model, have contributed to reliable prediction of protein structure, which was not possible even several years ago. The ongoing efforts in solving protein structures, which can be time-consuming and often difficult, will continue to spur the development of a host of new computational methods that can fill in the gap and further contribute to understanding the relationship between protein structure and function.  相似文献   

19.
With the aim of obtaining a monitoring tool to assess the quality of water, a multivariate statistical procedure based on cluster analysis (CA) coupled with soft independent modelling class analogy (SIMCA) algorithm, providing an effective classification method, is proposed. The experimental data set, carried out throughout the year 2004, was composed of analytical parameters from 68 water sources in a vast southwest area of Paris. Nine variables carrying the most useful information were selected and investigated (nitrate, sulphate, chloride, turbidity, conductivity, hardness, alkalinity, coliforms and Escherichia coli). Principal component analysis provided considerable data reduction, gathering in the first two principal components the majority of information representing about 92.2% of the total variance. CA grouped samples belonging to different sites, distinctly correlating them with chemical variables, and a classification model was built by SIMCA. This model was optimised and validated and then applied to a new data matrix, consisting of the parameters measured during the year 2005 from the same objects, providing a fast and accurate classification of all the samples. The most of the examined sources appeared unchanged during the 2-year period, but five sources resulted distributed in different classes, due to statistical significant changes of some characteristic analytical parameters.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号