首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present results of a new computational learning algorithm combining favorable elements of two well-known techniques: K nearest neighbors and recursive partitioning. Like K nearest neighbors, the method provides an independent prediction for each test sample under consideration, while like recursive partitioning, it incorporates an automatic selection of important input variables for model construction. The new method is applied to the problem of correctly classifying a set of chemical data samples designated as being either active or inactive in a biological screen. Training is performed at varying levels of intrinsic model complexity, and classification performance is compared to that of both K nearest neighbor and recursive partitioning models trained using the identical protocol. We find that the cross-validated performance of the new method outperforms both of these standard techniques over a considerable range of user parameters. We discuss advantages and drawbacks of the new method, with particular emphasis on its parameter robustness, required training time, and performance with respect to chemical structural class.  相似文献   

2.
High dimensional datasets contain up to thousands of features, and can result in immense computational costs for classification tasks. Therefore, these datasets need a feature selection step before the classification process. The main idea behind feature selection is to choose a useful subset of features to significantly improve the comprehensibility of a classifier and maximize the performance of a classification algorithm. In this paper, we propose a one-per-class model for high dimensional datasets. In the proposed method, we extract different feature subsets for each class in a dataset and apply the classification process on the multiple feature subsets. Finally, we merge the prediction results of the feature subsets and determine the final class label of an unknown instance data. The originality of the proposed model is to use appropriate feature subsets for each class. To show the usefulness of the proposed approach, we have developed an application method following the proposed model. From our results, we confirm that our method produces higher classification accuracy than previous novel feature selection and classification methods.  相似文献   

3.
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem.  相似文献   

4.
Untargeted metabolomics based on liquid chromatography coupled with mass spectrometry (LC–MS) can detect thousands of features in samples and produce highly complex datasets. The accurate extraction of meaningful features and the building of discriminant models are two crucial steps in the data analysis pipeline of untargeted metabolomics. In this study, pure ion chromatograms were extracted from a liquor dataset and left-sided colon cancer (LCC) dataset by K-means-clustering-based Pure Ion Chromatogram extraction method version 2.0 (KPIC2). Then, the nonlinear low-dimensional embedding by uniform manifold approximation and projection (UMAP) showed the separation of samples from different groups in reduced dimensions. The discriminant models were established by extreme gradient boosting (XGBoost) based on the features extracted by KPIC2. Results showed that features extracted by KPIC2 achieved 100% classification accuracy on the test sets of the liquor dataset and the LCC dataset, which demonstrated the rationality of the XGBoost model based on KPIC2 compared with the results of XCMS (92% and 96% for liquor and LCC datasets respectively). Finally, XGBoost can achieve better performance than the linear method and traditional nonlinear modeling methods on these datasets. UMAP and XGBoost are integrated into KPIC2 package to extend its performance in complex situations, which are not only able to effectively process nonlinear dataset but also can greatly improve the accuracy of data analysis in non-target metabolomics.  相似文献   

5.
6.
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.  相似文献   

7.
Drug discovery processes require drug-target interaction (DTI) prediction by virtual screenings with high accuracy. Compared with traditional methods, the deep learning method requires less time and domain expertise, while achieving higher accuracy. However, there is still room for improvement for higher performance with simplified structures. Meanwhile, this field is calling for multi-task models to solve different tasks. Here we report the GanDTI, an end-to-end deep learning model for both interaction classification and binding affinity prediction tasks. This model employs the compound graph and protein sequence data. It only consists of a graph neural network, an attention module and a multiple-layer perceptron, yet outperforms the state-of-the art methods to predict binding affinity and interaction classification on the DUD-E, human, and bindingDB benchmark datasets. This demonstrates our refined model is highly effective and efficient for DTI prediction and provides a new strategy for performance improvement.  相似文献   

8.
Naturally inspired evolutionary algorithms prove effectiveness when used for solving feature selection and classification problems. Artificial Bee Colony (ABC) is a relatively new swarm intelligence method. In this paper, we propose a new hybrid gene selection method, namely Genetic Bee Colony (GBC) algorithm. The proposed algorithm combines the used of a Genetic Algorithm (GA) along with Artificial Bee Colony (ABC) algorithm. The goal is to integrate the advantages of both algorithms. The proposed algorithm is applied to a microarray gene expression profile in order to select the most predictive and informative genes for cancer classification. In order to test the accuracy performance of the proposed algorithm, extensive experiments were conducted. Three binary microarray datasets are use, which include: colon, leukemia, and lung. In addition, another three multi-class microarray datasets are used, which are: SRBCT, lymphoma, and leukemia. Results of the GBC algorithm are compared with our recently proposed technique: mRMR when combined with the Artificial Bee Colony algorithm (mRMR-ABC). We also compared the combination of mRMR with GA (mRMR-GA) and Particle Swarm Optimization (mRMR-PSO) algorithms. In addition, we compared the GBC algorithm with other related algorithms that have been recently published in the literature, using all benchmark datasets. The GBC algorithm shows superior performance as it achieved the highest classification accuracy along with the lowest average number of selected genes. This proves that the GBC algorithm is a promising approach for solving the gene selection problem in both binary and multi-class cancer classification.  相似文献   

9.
The docking performance of the FRED and HYBRID programs are evaluated on two standardized datasets from the Docking and Scoring Symposium of the ACS Spring 2011 national meeting. The evaluation includes cognate docking and virtual screening performance. FRED docks 70?% of the structures to within 2?? in the cognate docking test. In the virtual screening test, FRED is found to have a mean AUC of 0.75. The HYBRID program uses a modified version of FRED's algorithm that uses both ligand- and structure-based information to dock molecules, which increases its mean AUC to 0.78. HYBRID can also implicitly account for protein flexibility by making use of multiple crystal structures. Using multiple crystal structures improves HYBRID's performance (mean AUC 0.80) with a negligible increase in docking time (~15?%).  相似文献   

10.
Computational models to predict the developmental toxicity of compounds are built on imbalanced datasets wherein the toxicants outnumber the non-toxicants. Consequently, the results are biased towards the majority class (toxicants). To overcome this problem and to obtain sensitive but also accurate classifiers, we followed an integrated approach wherein (i) Synthetic Minority Over Sampling (SMOTE) is used for re-sampling, (ii) genetic algorithm (GA) is used for variable selection and (iii) support vector machines (SVM) is used for model development. The best model, M3, has (i) sensitivity (SE) = 85.54% and specificity (SP) = 85.62% in leave-one-out validation, (ii) classification accuracy of the training set = 99.67%, (iii) classification accuracy of the test set = 92.59%; and (iv) sensitivity = 92.68, specificity = 92.31 on the test set. Consensus prediction based on models M3–M5 improved these percentages by 5% over M3. From the analysis of results we infer that data imbalance in toxicity studies can be effectively addressed by the application of re-sampling techniques.  相似文献   

11.
There is currently far more sequence information than structural information available, and the ability to use homology models for virtual screening applications is desirable in many cases where structures have not yet been solved. This review focuses on the application of protein kinase homology models for virtual screening use. In addition to reviewing previous cases in which kinase homology models have been used in inhibitor design, we present new data - useful for template selection in homology modeling applications - indicating that the template structure with the highest sequence or structural similarity with the target structure may not always be the best choice. This new work explored the simple hypothesis that better results might be obtained for docking a ligand to a target receptor using a homology model of the target created from a different kinase template co-crystallized with the ligand, than from a crystal structure of the actual kinase target that is unliganded or bound to an unrelated ligand. This hypothesis was tested in docking studies of staurosporine with eight different kinases: AutoDock was used to dock staurosporine to homology models of each kinase created from staurosporine-bound template structures, and the results were compared with docking staurosporine to crystal structures of the target kinase that were obtained in complex with a non-staurosporine ligand or no ligand. It was found that the homology models performed as well as or better than the crystal structures, suggesting that using a homology model created from a template crystallized with a representative ligand may in some cases be a preferred approach, especially in virtual screening experiments that focus on enriching for members of a particular inhibitor class.  相似文献   

12.
Sankaran S  Ehsani R  Etxeberria E 《Talanta》2010,83(2):574-581
In recent years, Huanglongbing (HLB) also known as citrus greening has greatly affected citrus orchards in Florida. This disease has caused significant economic and production losses costing about $750/acre for HLB management. Early and accurate detection of HLB is a critical management step to control the spread of this disease. This work focuses on the application of mid-infrared spectroscopy for the detection of HLB in citrus leaves. Leaf samples of healthy, nutrient-deficient, and HLB-infected trees were processed in two ways (process-1 and process-2) and analyzed using a rugged, portable mid-infrared spectrometer. Spectral absorbance data from the range of 5.15-10.72 μm (1942-933 cm−1) were preprocessed (baseline correction, negative offset correction, and removal of water absorbance band) and used for data analysis. The first and second derivatives were calculated using the Savitzky-Golay method. The preprocessed raw dataset, first derivatives dataset, and second derivatives dataset were first analyzed by principal component analysis. Then, the selected principal component scores were classified using two classification algorithms, quadratic discriminant analysis (QDA) and k-nearest neighbor (kNN). When the spectral data from leaf samples processed using process-1 were used for data analysis, the kNN-based algorithm yielded higher classification accuracies (especially nutrient-deficient leaf class) than that of the other spectral data (process-2). The performance of the kNN-based algorithm (higher than 95%) was better than the QDA-based algorithm. Moreover, among different types of datasets, preprocessed raw dataset resulted in higher classification accuracies than first and second derivatives datasets. The spectral peak in the region of 9.0-10.5 μm (952-1112 cm−1) was found to be distinctly different between the healthy and HLB-infected leaf samples. This carbohydrate peak could be attributed to the starch accumulation in the HLB-infected citrus leaves. Thus, this study demonstrates the applicability of mid-infrared spectroscopy for HLB detection in citrus.  相似文献   

13.
14.
Homology modeling plays a central role in determining protein structure in the structural genomics project. The importance of homology modeling has been steadily increasing because of the large gap that exists between the overwhelming number of available protein sequences and experimentally solved protein structures, and also, more importantly, because of the increasing reliability and accuracy of the method. In fact, a protein sequence with over 30% identity to a known structure can often be predicted with an accuracy equivalent to a low-resolution X-ray structure. The recent advances in homology modeling, especially in detecting distant homologues, aligning sequences with template structures, modeling of loops and side chains, as well as detecting errors in a model, have contributed to reliable prediction of protein structure, which was not possible even several years ago. The ongoing efforts in solving protein structures, which can be time-consuming and often difficult, will continue to spur the development of a host of new computational methods that can fill in the gap and further contribute to understanding the relationship between protein structure and function.  相似文献   

15.
A spare representation classification method for tobacco leaves based on near-infrared spectroscopy and deep learning algorithm is reported in this paper. All training samples were used to make up a data dictionary of the sparse representation and the test samples were represented by the sparsest linear combinations of the dictionary by sparse coding. The regression residual of the test sample to each class was computed and finally assigned to the class with the minimum residual. The effectiveness of spare representation classification method was compared with K-nearest neighbor and particle swarm optimization–support vector machine algorithms. The results show that the classification accuracy of the proposed method is higher and it is more efficient. The results suggest that near-infrared spectroscopy with spare representation classification algorithm may be an alternative method to traditional methods for discriminating classes of tobacco leaves.  相似文献   

16.
基于分步相关成分分析的中药材质量鉴别神经元分类器   总被引:1,自引:0,他引:1  
提出并构建了一种基于分步相关成分分析的神经元分类器(SCCA-HBP),并将其用于中药材质量模式分类.通过从色谱分析所得到的高维数据集中分步提取分类相关成分,获取化学模式特征向量,使神经元分类器输入模式向量的维数降低.此外,提出用带输出误差死区的混合BP算法训练神经元分类器,提高了网络学习训练速度和分类准确性.以32个当归样品质量等级分类鉴别为例考察本方法,分类正确率为100%,优于PCA-BP(84.4%)和SCCA-BP(90.6%)方法;且训练时间仅为BP算法的54.2%.  相似文献   

17.
为了实现对法庭科学领域重质矿物油物证的快速、准确、无损的鉴定,该文基于光谱分析技术提出了一种多阶导数光谱数据组合分析的方法。收集了80种不同型号、不同厂家的重质矿物油样本,利用傅里叶变换拉曼光谱分析法采集样本的原始光谱数据和导数光谱数据,并通过结合化学计量学构建分类模型。在构建的主成分分析(PCA)结合径向基函数神经网络(RBF)分类模型中,对单独的原始光谱、一阶导数谱和二阶导数谱数据的训练集准确率分别为80.0%、86.7%和86.2%,测试集准确率分别为73.3%、80.0%和72.7%;对组合后的原始光谱+一阶导数谱、原始光谱+二阶导数谱和一阶导数谱+二阶导数谱数据的分类中,训练集准确率分别为97.0%、96.7%和100%,测试集准确率分别为85.7%、90.0%和100%。结果表明,对组合后的导数光谱与原始光谱构建分类模型,准确率更高。其中,基于一阶导数谱+二阶导数谱数据构建的PCA结合RBF分类模型的结果最为理想,准确率达100%。而K最近邻算法模型由于受到样本不均匀的影响,整体分类准确率均较低。利用组合的导数光谱与原始光谱数据构建分类模型能够实现对重质矿物油样本的快速、准确、无损鉴别,可为光谱组合技术在法庭科学及其他分析测试领域的应用提供一定的借鉴和参考。  相似文献   

18.
Drug-likeness prediction is important for the virtual screening of drug candidates. It is challenging because the drug-likeness is presumably associated with the whole set of necessary properties to pass through clinical trials, and thus no definite data for regression is available. Recently, binary classification models based on graph neural networks have been proposed but with strong dependency of their performances on the choice of the negative set for training. Here we propose a novel unsupervised learning model that requires only known drugs for training. We adopted a language model based on a recurrent neural network for unsupervised learning. It showed relatively consistent performance across different datasets, unlike such classification models. In addition, the unsupervised learning model provides drug-likeness scores that well separate distributions with increasing mean values in the order of datasets composed of molecules at a later step in a drug development process, whereas the classification model predicted a polarized distribution with two extreme values for all datasets presumably due to the overconfident prediction for unseen data. Thus, this new concept offers a pragmatic tool for drug-likeness scoring and further can be applied to other biochemical applications.

A new quantification method of drug-likeness based on unsupervised learning. The method only uses drug molecules as training set without any non-drug-like molecules.  相似文献   

19.
A first step toward predicting the structure of a protein is to determine its secondary structure. The secondary structure information is generally used as starting point to solve protein crystal structures. In the present study, a machine learning approach based on a complete set of two-class scoring functions was used. Such functions discriminate between two specific structural classes or between a single specific class and the rest. The approach uses a hierarchical scheme of scoring functions and a neural network. The parameters are determined by optimizing the recall of learning data. Quality control is performed by predicting separate independent test data. A first set of scoring functions is trained to correlate the secondary structures of residues with profiles of sequence windows of width 15, centered at these residues. The sequence profiles are obtained by multiple sequence alignment with PSI-BLAST. A second set of scoring functions is trained to correlate the secondary structures of the center residues with the secondary structures of all other residues in the sequence windows used in the first step. Finally, a neural network is trained using the results from the second set of scoring functions as input to make a decision on the secondary structure class of the residue in the center of the sequence window. Here, we consider the three-class problem of helix, strand, and other secondary structures. The corresponding prediction scheme "SPARROW" was trained with the ASTRAL40 database, which contains protein domain structures with less than 40% sequence identity. The secondary structures were determined with DSSP. In a loose assignment, the helix class contains all DSSP helix types (α, 3-10, π), the strand class contains β-strand and β-bridge, and the third class contains the other structures. In a tight assignment, the helix and strand classes contain only α-helix and β-strand classes, respectively. A 10-fold cross validation showed less than 0.8% deviation in the fraction of correct structure assignments between true prediction and recall of data used for training. Using sequences of 140,000 residues as a test data set, 80.46% ± 0.35% of secondary structures are predicted correctly in the loose assignment, a prediction performance, which is very close to the best results in the field. Most applications are done with the loose assignment. However, the tight assignment yields 2.25% better prediction performance. With each individual prediction, we also provide a confidence measure providing the probability that the prediction is correct. The SPARROW software can be used and downloaded on the Web page http://agknapp.chemie.fu-berlin.de/sparrow/ .  相似文献   

20.
A new method has been developed for prediction of homology model quality directly from the sequence alignment, using multivariate regression. Hence, the expected quality of future homology models can be estimated using only information about the primary structure. This method has been applied to protein kinases and can easily be extended to other protein families. Homology model quality for a reference set of homology models was verified by comparison to experimental structures, by calculation of root-mean-square deviations (RMSDs) and comparison of interresidue contact areas. The homology model quality measures were then used as dependent variables in a Partial Least Squares (PLS) regression, using a matrix of alignment score profiles found from the Point Accepted Mutation (PAM) 250 similarity matrix as independent variables. This resulted in a regression model that can be used to predict the accuracy of future homology models from the sequence alignment. Using this method, one can identify the target-template combinations that are most likely to give homology models of sufficient quality. Hence, this method can be used to effectively choose the optimal templates to use for the homology modeling. The method's ability to guide the choice of homology modeling templates was verified by comparison of success rates to those obtained using BLAST scores and target-template sequence identities, respectively. The results indicate that the method presented here performs best in choosing the optimal homology modeling templates. Using this method, the optimal template was chosen in 86% of the cases, as compared to 62% using BLAST scores, and 57% using sequence identities. The method presented here can also be used to identify regions of the protein structure that are difficult to model, as well as alignment errors. Hence, this method is a useful tool for ensuring that the best possible homology model is generated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号