首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
In the present paper we combine the Winnow algorithm and an advanced scheme for feature generation into a tool for multiclass classification. The Winnow algorithm, specifically designed in the late 1980s to work well with high-dimensional data, by design ignores most of the irrelevant features for the scoring of each single training/test case. To augment the pool of available molecular features we use the Winnow algorithm in conjunction with a process that creates additional features from a set of given ones. We adapt a technique formerly employed in text classification termed "orthogonal sparse bigrams" and extend the use of that method to the domain of cheminformatics. Using circular molecular fingerprints as initial features, we create "molecular orthogonal sparse bigrams" (MOSBs) and report their successful application to the task of classification of bioactive molecules. Additionally, we introduce a memory-efficient way of bagging individual classifiers, avoiding the need to hold the complete training data set in memory. To compare the performance of our method with published results, we use the Hert data set of 8293 active molecules in 11 classes. We compare our method to Random Forest and find that our method not only is comparable or better in classification accuracy (up to 50% higher in MCC [Matthews correlation coefficient], 98% higher in fraction of correct predictions) but also is quicker to train (by a factor between 2 and 18, depending on the feature generation), more memory efficient, and able to cope more easily with large data sets when we seeded the actives into a pool of 94290 inactive molecules. It is shown that this method can be used with different fingerprints.  相似文献   

3.
4.
In this paper, we study the classifications of unbalanced data sets of drugs. As an example we chose a data set of 2D6 inhibitors of cytochrome P450. The human cytochrome P450 2D6 isoform plays a key role in the metabolism of many drugs in the preclinical drug discovery process. We have collected a data set from annotated public data and calculated physicochemical properties with chemoinformatics methods. On top of this data, we have built classifiers based on machine learning methods. Data sets with different class distributions lead to the effect that conventional machine learning methods are biased toward the larger class. To overcome this problem and to obtain sensitive but also accurate classifiers we combine machine learning and feature selection methods with techniques addressing the problem of unbalanced classification, such as oversampling and threshold moving. We have used our own implementation of a support vector machine algorithm as well as the maximum entropy method. Our feature selection is based on the unsupervised McCabe method. The classification results from our test set are compared structurally with compounds from the training set. We show that the applied algorithms enable the effective high throughput in silico classification of potential drug candidates.  相似文献   

5.
Class prediction based on DNA microarray data has been emerged as one of the most important application of bioinformatics for diagnostics/prognostics. Robust classifiers are needed that use most biologically relevant genes embedded in the data. A consensus approach that combines multiple classifiers has attributes that mitigate this difficulty compared to a single classifier. A new classification method named as consensus analysis of multiple classifiers using non-repetitive variables (CAMCUN) was proposed for the analysis of hyper-dimensional gene expression data. The CAMCUN method combined multiple classifiers, each of which was built from distinct, non-repeated genes that were selected for effectiveness in class differentiation. Thus, the CAMCUN utilized most biologically relevant genes in the final classifier. The CAMCUN algorithm was demonstrated to give consistently more accurate predictions for two well-known datasets for prostate cancer and leukemia. Importantly, the CAMCUN algorithm employed an integrated 10-fold cross-validation and randomization test to assess the degree of confidence of the predictions for unknown samples.  相似文献   

6.
《Analytical letters》2012,45(18):2833-2842
Traditional gene expression programming for classification is designed for binary decisions. Herein, projection discriminant analysis for direct multiclass categorization using gene expression programming is described. Gene expression programming was first employed to examine new synthetic variables that were built as nonlinear combinations of the original features. The data were projected on planes spanned by these new synthetic variables and the nearest centroid was employed to classify new samples. A new objective function was formulated to determine optimum synthetic variables. Direct multiclass categorization using a gene expression programming algorithm was used to classify six tea varieties analyzed by near infrared spectroscopy. Compared with traditional gene expression programming, principal component analysis, and linear discriminant analysis, direct multiclass categorization with gene expression programming algorithm was more efficient. Visual inspection of high dimensional data by this approach also facilitated classification and comprehension of data.  相似文献   

7.
Improved binary PSO for feature selection using gene expression data   总被引:2,自引:0,他引:2  
Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. Compared to the number of genes involved, available training data sets generally have a fairly small sample size in cancer type classification. These training data limitations constitute a challenge to certain classification methodologies. A reliable selection method for genes relevant for sample classification is needed in order to speed up the processing rate, decrease the predictive error rate, and to avoid incomprehensibility due to the large number of genes investigated. Improved binary particle swarm optimization (IBPSO) is used in this study to implement feature selection, and the K-nearest neighbor (K-NN) method serves as an evaluator of the IBPSO for gene expression data classification problems. Experimental results show that this method effectively simplifies feature selection and reduces the total number of features needed. The classification accuracy obtained by the proposed method has the highest classification accuracy in nine of the 11 gene expression data test problems, and is comparative to the classification accuracy of the two other test problems, as compared to the best results previously published.  相似文献   

8.
Protein biomarkers in blood have been widely used in the early diagnosis of disease. However, simultaneous detection of many biomarkers in a single sample remains challenging. Herein, we show that the combination of a sandwich assay and DNA‐assisted nanopore sensing could unambiguously identify and quantify several antigens in a mixture. We use five barcode DNAs to label different gold nanoparticles that can selectively bind specific antigens. After the completion of the sandwich assay, barcode DNAs are released and subject to nanopore translocation tests. The distinct current signatures generated by each barcode DNA allow simultaneous quantification of biomarkers at picomolar level in clinical samples. This approach would be very useful for accurate and multiplexed quantification of cancer‐associated biomarkers within a very small sample volume, which is critical for non‐invasive early diagnosis of cancer.  相似文献   

9.
The paper presents a new method of qualitative identification of gas. It is based on a dynamic response of sensor array with the emphasis on the processing of discrete measurement data. The information needed for identification of test samples is obtained in course of profiling the data from calibration measurements. This operation consists of the following steps: classification of data sets, selection of representative data sets, parameterization of classifiers associated with representative data sets and determination of data records. In our work Discriminant Function Analysis was used for data classification. The information saved in data record describes: the sequential number of discrete measurement, combination of gas sensors in this measurement which are best for classification of calibration samples, and the parameters of associated classifier. They are identifiers of gas class. The procedure of data record determination itself is time consuming. However this operation will be performed only at the stage of the development of the measurement instrument and when its malfunction is diagnosed. The routine use of the instrument will be restricted to gas identification task, which only utilizes the results of profiling.The identification of unknown gas is performed on the base of data records and measurement data obtained for this gas. Data records guide the preparation of data sets, separately for each class of gases. These data sets are used as input of the discriminant functions which have parameter values also indicated by data records. It was shown in the present contribution, that the qualitative identification of nine test gas samples (vapors of ethanol, acetic acid and ethyl acetate in air) with our method was very accurate and fast.  相似文献   

10.
Early diagnosis is the key to the effective treatment of cancer. The detection of cancer biomarkers plays a critical role not only in cancer early diagnosis, but also in classification and staging tumor progression, or assessment prognosis and treatment response. Currently, various molecular diagnostic techniques have been developed for cancer biomarker studies, with many of the more effective approaches requiring a separation step before detection. Capillary electrophoresis (CE) can perform rapid and efficient separation with small samples, which is well-suited for analysis of both small- and macro- molecule biomarkers in complex samples. CE has different separation modes and can couple to different detectors into a variety of platforms, such as conducting studies on DNA/ RNA point mutation, protein misexpression, and metabolite abnormality. Similarly, microchip capillary electrophoresis (MCE) appears as a very important biomarker screening platform with the merits of high throughput, integration, and miniaturization, which makes it a promising clinical tool. By hyphenated different detectors, or integrated with immunoassay, PCR/LDR and related technologies, MCE can be constructed into diverse platforms used in genomics, proteomics, and metabolomics study for biomarkers discovery. The multiplex biomarker screening approach via CE- or MCE-based platforms is becoming a trend. This paper focuses on studies of cancer biomarkers via CE/MCE platforms, based on the studies published over the past 3 years. Some recent CE applications in the field of cancer study, such as cancer theranostics, are introduced.  相似文献   

11.
12.
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.  相似文献   

13.
This paper introduces the ant colony algorithm, a novel swarm intelligence based optimization method, to select appropriate wavelet coefficients from mass spectral data as a new feature selection method for ovarian cancer diagnostics. By determining the proper parameters for the ant colony algorithm (ACA) based searching algorithm, we perform the feature searching process for 100 times with the number of selected features fixed at 5. The results of this study show: (1) the classification accuracy based on the five selected wavelet coefficients can reach up to 100% for all the training, validating and independent testing sets; (2) the eight most popular selected wavelet coefficients of the 100 runs can provide 100% accuracy for the training set, 100% accuracy for the validating set, and 98.8% accuracy for the independent testing set, which suggests the robustness and accuracy of the proposed feature selection method; and (3) the mass spectral data corresponding to the eight popular wavelet coefficients can be located by reverse wavelet transformation and these located mass spectral data still maintain high classification accuracies (100% for the training set, 97.6% for the validating set, and 98.8% for the testing set) and also provide sufficient physical and medical meaning for future ovarian cancer mechanism studies. Furthermore, the corresponding mass spectral data (potential biomarkers) are in good agreement with other studies which have used the same sample set. Together these results suggest this feature extraction strategy will benefit the development of intelligent and real-time spectroscopy instrumentation based diagnosis and monitoring systems.  相似文献   

14.
Existing colorectal cancer biomarkers are insufficient for providing a quick and accurate diagnosis, which is critical for a good prognosis. More appropriate biomarkers are thus needed. To identify new colorectal cancer biomarker candidates, we conducted a comprehensive differential proteomic analysis of six cancer cell lines and a normal cell line, utilizing a fluorogenic derivatization–liquid chromatography–tandem mass spectrometry (FD‐LC‐MS/MS) approach. Two sets of intracellular biomarker candidates were identified: one for colorectal cancer, and the other for metastatic colorectal cancer. Our results suggest that cooperative expression of FABP5 and cyclophilin A might be linked to Her2 signaling. Upregulation of LDHB and downregulation of GAPDH suggest the existence of a specific nonglycolytic energy production pathway in metastatic colorectal cancer cells. Downregulation of 14‐3‐3ζ/δ, cystatin‐B, Ran and thioredoxin could be a result of their secretion, which then stimulates metastasis via activity in the sera and ascitic fluids. We propose a possible flow scheme to describe the dynamics of protein expression in colorectal cancer cells leading to tumor progression and metastasis via cell proliferation, angiogenesis, disorganization of actin filaments and epithelial–mesenchymal transition. Our results suggest that colorectal tumor progression may be regulated by signaling mediated by Her2, hypoxia, and TGFβ. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

15.
Point mutations can be used as biomarkers to perform diagnosis for diseases. In this study, a nanorobot for low-abundance point mutation enrichment was constructed using DNA origami. The novel design achieved limits of detection of 0.1% and 1% for synthesized DNA samples and clinical gene samples, respectively. Resettability was a key property of this method, which also involved a simpler process, lower cost and shorter detection duration than traditional enrichment methods. This novel DNA nanor...  相似文献   

16.
Li  MengyanMai  ChuoyingZou  Li 《分析试验室》2022,(7):842-850
Optical biosensors have been widely used in the detection of biomarkers due to their advantages of simple operationquick responsehigh sensitivity and visualization. When constructing optical biosensors nucleic acid amplification technology can be used to improve the analytical performance of optical biosensor which can further realize the highly sensitive detection of biomarkers and provide more accurate information for disease diagnosis. In this reviewrecent advances in nucleic acid amplification-based optical biosensors for disease diagnosis were reviewed the possible problems may exist in practical applications and future development trends were proposed. © 2022, Youke Publishing Co.,Ltd. All rights reserved.  相似文献   

17.
Docking scoring functions are notoriously weak predictors of binding affinity. They typically assign a common set of weights to the individual energy terms that contribute to the overall energy score; however, these weights should be gene family dependent. In addition, they incorrectly assume that individual interactions contribute toward the total binding affinity in an additive manner. In reality, noncovalent interactions often depend on one another in a nonlinear manner. In this paper, we show how the use of support vector machines (SVMs), trained by associating sets of individual energy terms retrieved from molecular docking with the known binding affinity of each compound from high-throughput screening experiments, can be used to improve the correlation between known binding affinities and those predicted by the docking program eHiTS. We construct two prediction models: a regression model trained using IC(50) values from BindingDB, and a classification model trained using active and decoy compounds from the Directory of Useful Decoys (DUD). Moreover, to address the issue of overrepresentation of negative data in high-throughput screening data sets, we have designed a multiple-planar SVM training procedure for the classification model. The increased performance that both SVMs give when compared with the original eHiTS scoring function highlights the potential for using nonlinear methods when deriving overall energy scores from their individual components. We apply the above methodology to train a new scoring function for direct inhibitors of Mycobacterium tuberculosis (M.tb) InhA. By combining ligand binding site comparison with the new scoring function, we propose that phosphodiesterase inhibitors can potentially be repurposed to target M.tb InhA. Our methodology may be applied to other gene families for which target structures and activity data are available, as demonstrated in the work presented here.  相似文献   

18.
The determination of the validity of a QSAR model when applied to new compounds is an important concern in the field of QSAR and QSPR modeling. Various scoring techniques can be applied to specific types of models. We present a technique with which we can state whether a new compound will be well predicted by a previously built QSAR model. In this study we focus on linear regression models only, though the technique is general and could also be applied to other types of quantitative models. Our technique is based on a classification method that divides regression residuals from a previously generated model into a good class and bad class and then builds a classifier based on this division. The trained classifier is then used to determine the class of the residual for a new compound. We investigated the performance of a variety of classifiers, both linear and nonlinear. The technique was tested on two data sets from the literature and a hand built data set. The data sets selected covered both physical and biological properties and also presented the methodology with quantitative regression models of varying quality. The results indicate that this technique can determine whether a new compound will be well or poorly predicted with weighted success rates ranging from 73% to 94% for the best classifier.  相似文献   

19.
The actual utility of capillary electrophoresis‐mass spectrometry (CE‐MS) for biomarker discovery using metabolomics still needs to be assessed. Therefore, a simulated comparative metabolic profiling study for biomarker discovery by CE‐MS was performed, using pooled human plasma samples with spiked biomarkers. Two studies have been carried out in this work. Focus of study I was on comparing two sets of plasma samples, in which one set (class I) was spiked with five isotope‐labeled compounds, whereas another set (class II) was spiked with six different isotope‐labeled compounds. In study II, focus was also on comparing two sets of plasma samples, however, the isotope‐labeled compounds were spiked to both class I and class II samples but with concentrations which differ by a factor two between both classes (with one compound absent in each class). The aim was to determine whether CEMS‐based metabolomics could reveal the spiked biomarkers as the main classifiers, applying two different data analysis software tools (MetaboAnalyst and Matlab). Unsupervised analysis of the recorded metabolic profiles revealed a clear distinction between class I and class II plasma samples in both studies. This classification was mainly attributed to the spiked isotope‐labeled compounds, thereby emphasizing the utility of CE‐MS for biomarker discovery.  相似文献   

20.
We describe the application of ensemble methods to binary classification problems on two pharmaceutical compound data sets. Several variants of single and ensembles models of k-nearest neighbors classifiers, support vector machines (SVMs), and single ridge regression models are compared. All methods exhibit robust classification even when more features are given than observations. On two data sets dealing with specific properties of drug-like substances (cytochrome P450 inhibition and "Frequent Hitters", i.e., unspecific protein inhibition), we achieve classification rates above 90%. We are able to reduce the cross-validated misclassification rate for the Frequent Hitters problem by a factor of 2 compared to previous results obtained for the same data set with different modeling techniques.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号