首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 4 毫秒
1.
Li-Juan Tang  Hai-Long Wu 《Talanta》2009,79(2):260-1694
One problem with discriminant analysis of microarray data is representation of each sample by a large number of genes that are possibly irrelevant, insignificant or redundant. Methods of variable selection are, therefore, of great significance in microarray data analysis. To circumvent the problem, a new gene mining approach is proposed based on the similarity between probability density functions on each gene for the class of interest with respect to the others. This method allows the ascertainment of significant genes that are informative for discriminating each individual class rather than maximizing the separability of all classes. Then one can select genes containing important information about the particular subtypes of diseases. Based on the mined significant genes for individual classes, a support vector machine with local kernel transform is constructed for the classification of different diseases. The combination of the gene mining approach with support vector machine is demonstrated for cancer classification using two public data sets. The results reveal that significant genes are identified for each cancer, and the classification model shows satisfactory performance in training and prediction for both data sets.  相似文献   

2.
Active learning with support vector machines in the drug discovery process   总被引:6,自引:0,他引:6  
We investigate the following data mining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from Machine Learning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector Machines". This hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones.  相似文献   

3.
Qi Shen  Wei-Min Shi  Bao-Xian Ye 《Talanta》2007,71(4):1679-1683
In the analysis of gene expression profiles, the number of tissue samples with genes expression levels available is usually small compared with the number of genes. This can lead either to possible overfitting or even to a complete failure in analysis of microarray data. The selection of genes that are really indicative of the tissue classification concerned is becoming one of the key steps in microarray studies. In the present paper, we have combined the modified discrete particle swarm optimization (PSO) and support vector machines (SVM) for tumor classification. The modified discrete PSO is applied to select genes, while SVM is used as the classifier or the evaluator. The proposed approach is used to the microarray data of 22 normal and 40 colon tumor tissues and showed good prediction performance. It has been demonstrated that the modified PSO is a useful tool for gene selection and mining high dimension data.  相似文献   

4.
Proteins are the macromolecules responsible for almost all biological processes in a cell. With the availability of large number of protein sequences from different sequencing projects, the challenge with the scientist is to characterize their functions. As the wet lab methods are time consuming and expensive, many computational methods such as FASTA, PSI-BLAST, DNA microarray clustering, and Nearest Neighborhood classification on protein–protein interaction network have been proposed. Support vector machine is one such method that has been used successfully for several problems such as protein fold recognition, protein structure prediction etc. Cai et al. in 2003 have used SVM for classifying proteins into different functional classes and to predict their function. They used the physico-chemical properties of proteins to represent the protein sequences. In this paper a model comprising of feature subset selection followed by multiclass Support Vector Machine is proposed to determine the functional class of a newly generated protein sequence. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 enzyme classes are considered. To determine the features that contribute significantly for functional classification, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS), and SVM Recursive Feature Elimination (SVM-RFE) algorithms are used and it is observed that out of 32 properties considered initially, only 20 features are sufficient to classify the proteins into its functional classes with an accuracy ranging from 91% to 94%. On comparison it is seen that, OFS followed by SVM performs better than other methods. Our model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function.  相似文献   

5.
6.
7.
Classification problems have received considerable attention in biological and medical applications. In particular, classification methods combining to microarray technology play an important role in diagnosing and predicting disease, such as cancer, in medical research. Primary objective in classification is to build an optimal classifier based on the training sample in order to predict unknown class in the test sample. In this paper, we propose a unified approach for optimal gene classification with conjunction with functional principal component analysis (FPCA) in functional data analysis (FNDA) framework to classify time-course gene expression profiles based on information from the patterns. To derive an optimal classifier in FNDA, we also propose to find optimal number of bases in the smoothing step and functional principal components in FPCA using a cross-validation technique, and compare the performance of some popular classification techniques in the proposed setting. We illustrate the propose method with a simulation study and a real world data analysis.  相似文献   

8.
9.
Most studies concerning expression data analyses usually exploit information on the variability of gene intensity across samples. This information is sensitive to initial data processing, which affects the final conclusions. However expression data contains scale-free information, which is directly comparable between different samples. We propose to use the pairwise ratio of gene expression values rather than their absolute intensities for a classification of expression data. This information is stable to data processing and thus more attractive for classification analyses. In proposed schema of data analyses only information on relative gene expression levels in each sample is exploited. Testing on publicly available datasets leads to superior classification results.  相似文献   

10.
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.  相似文献   

11.
In this work, we introduce an active learning approach for the estimation of chemical concentrations from spectroscopic data. Its main objective is to opportunely collect training samples in such a way as to minimize the error of the regression process while minimizing the number of training samples used, and thus to reduce the costs related to training sample collection. In particular, we propose two different active learning strategies developed for regression approaches based on partial least squares regression, ridge regression, kernel ridge regression, and support vector regression. The first strategy uses a pool of regressors in order to select the samples with the greatest disagreements among the different regressors of the pool, while the second one is based on adding samples that are distant from the current training samples in the feature space. For support vector regression, a specific strategy based on the selection of the samples distant from the support vectors is proposed. Experimental results on three different real data sets are reported and discussed. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

12.
We explore automation of protein structural classification using supervised machine learning methods on a set of 11,360 pairs of protein domains (up to 35% sequence identity) consisting of three secondary structure elements. Fifteen algorithms from five categories of supervised algorithms are evaluated for their ability to learn for a pair of protein domains, the deepest common structural level within the SCOP hierarchy, given a one-dimensional representation of the domain structures. This representation encapsulates evolutionary information in terms of sequence identity and structural information characterising the secondary structure elements and lengths of the respective domains. The evaluation is performed in two steps, first selecting the best performing base learners and subsequently evaluating boosted and bagged meta learners. The boosted random forest, a collection of decision trees, is found to be the most accurate, with a cross-validated accuracy of 97.0% and F-measures of 0.97, 0.85, 0.93 and 0.98 for classification of proteins to the Class, Fold, Super-Family and Family levels in the SCOP hierarchy. The meta learning regime, especially boosting, improved performance by more accurately classifying the instances from less populated classes.  相似文献   

13.
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Nai?ve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.  相似文献   

14.
Multivariate spectral analysis has been widely applied in chemistry and other fields. Spectral data consisting of measurements at hundreds and even thousands of analytical channels can now be obtained in a few seconds. It is widely accepted that before a multivariate regression model is built, a well-performed variable selection can be helpful to improve the predictive ability of the model. In this paper, the concept of traditional wavelength variable selection has been extended and the idea of variable weighting is incorporated into least-squares support vector machine (LS-SVM). A recently proposed global optimization method, particle swarm optimization (PSO) algorithm is used to search for the weights of variables and the hyper-parameters involved in LS-SVM optimizing the training of a calibration set and the prediction of an independent validation set. All the computation process of this method is automatic. Two real data sets are investigated and the results are compared those of PLS, uninformative variable elimination-PLS (UVE-PLS) and LS-SVM models to demonstrate the advantages of the proposed method.  相似文献   

15.
支持向量机方法预测有机物的亨利常数   总被引:4,自引:2,他引:4  
以有机物摩尔体积V、偶极项π*、氢键给予体的酸性am、氢键接受体的碱性βm等四种理化参数为输入变量,利用支持向量机方法对72种有机物的亨利常数值进行了定量预测研究。研究发现,采用支持向量机方法可以实现使用较少样本数据建模,并达到较好的预测结果。支持向量方法的预测结果远优于线性回归法预测结果。  相似文献   

16.
Polymerase chain reaction (PCR) is one of the most popular molecular biological techniques and has been widely applied in many areas. However, PCR still faces challenges nowadays. During recent decades, the experimental procedure of PCR, including the primer design, was always the focus of attention, while little attention was paid to the analysis of the PCR template, and still nobody can accurately predict whether or not a DNA sequence can be simply amplified using conventional Taq DNA polymerase-based PCR protocol. In this study, we focus on the DNA template, the subject of PCR experiment, and introduce a support vector machine (SVM) based method to help evaluate PCR result. Through the Jackknife cross-validation test, our method achieves an accuracy of 92.06%, with 93.62% sensitivity and 90.53% specificity.  相似文献   

17.
The qualitative evaluation of chromatographic data in the framework of external quality assurance schemes is considered in this paper. The homogeneity in the evaluation of chromatographic data among human experts in samples with analytes close to the limit of detection of analytical methods was examined and also a Support Vector Machine (SVM) was developed as an alternative to experts for a more homogeneous and automatic evaluation. A set of 105 ion chromatograms obtained by anti-doping control laboratories was used in this study. The quality of the ion chromatograms was evaluated qualitatively by nine independent experts (associating a score from 0 to 4) and also more objectively taking into account chromatographic parameters (peak width, asymmetry, resolution and S/N ratio). Results obtained showed a high degree of variability among experts when judging ion chromatograms. Experts applying extremely outlying evaluation criteria were identified and excluded from the data used to develop the SVM. This machine was built providing the system with qualitative information (scores assigned by experts) and with objective data (parameters) of the ion chromatograms. A seven-fold cross-validation approach was used to train and to evaluate the predictive ability of the machine. According to the results obtained, the SVM developed was found to be close to the reasoning process followed by the homogeneous human expert group. This machine also could provide a scoring system to sort laboratories according to the quality of their results. The qualitative evaluation of analytical records using a scoring system allowed the identification of the main factors affecting the quality of chromatographic analytical data, such as the specific analytical technique applied and the adherence to guidelines for reporting positive results.  相似文献   

18.
19.
Tyrosine sulfation is a post‐translational modification of many secreted and membrane‐bound proteins. It governs protein‐protein interactions that are involved in leukocyte adhesion, hemostasis, and chemokine signaling. However, the intrinsic feature of sulfated protein remains elusive and remains to be delineated. This investigation presents SulfoSite, which is a computational method based on a support vector machine (SVM) for predicting protein sulfotyrosine sites. The approach was developed to consider structural information such as concerning the secondary structure and solvent accessibility of amino acids that surround the sulfotyrosine sites. One hundred sixty‐two experimentally verified tyrosine sulfation sites were identified using UniProtKB/SwissProt release 53.0. The results of a five‐fold cross‐validation evaluation suggest that the accessibility of the solvent around the sulfotyrosine sites contributes substantially to predictive accuracy. The SVM classifier can achieve an accuracy of 94.2% in five‐fold cross validation when sequence positional weighted matrix (PWM) is coupled with values of the accessible surface area (ASA). The proposed method significantly outperforms previous methods for accurately predicting the location of tyrosine sulfation sites. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2009  相似文献   

20.
By using the composite vector with increment of diversity, position conservation scoring function, and predictive secondary structures to express the information of sequence, a support vector machine (SVM) algorithm for predicting beta- and gamma-turns in the proteins is proposed. The 426 and 320 nonhomologous protein chains described by Guruprasad and Rajkumar (Guruprasad and Rajkumar J. Biosci 2000, 25,143) are used for training and testing the predictive model of the beta- and gamma-turns, respectively. The overall prediction accuracy and the Matthews correlation coefficient in 7-fold cross-validation are 79.8% and 0.47, respectively, for the beta-turns. The overall prediction accuracy in 5-fold cross-validation is 61.0% for the gamma-turns. These results are significantly higher than the other algorithms in the prediction of beta- and gamma-turns using the same datasets. In addition, the 547 and 823 nonhomologous protein chains described by Fuchs and Alix (Fuchs and Alix Proteins: Struct Funct Bioinform 2005, 59, 828) are used for training and testing the predictive model of the beta- and gamma-turns, and better results are obtained. This algorithm may be helpful to improve the performance of protein turns' prediction. To ensure the ability of the SVM method to correctly classify beta-turn and non-beta-turn (gamma-turn and non-gamma-turn), the receiver operating characteristic threshold independent measure curves are provided.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号