首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
With the rapid development of DNA microarray technology and next-generation technology, a large number of genomic data were generated. So how to extract more differentially expressed genes from genomic data has become a matter of urgency. Because Low-Rank Representation (LRR) has the high performance in studying low-dimensional subspace structures, it has attracted a chunk of attention in recent years. However, it does not take into consideration the intrinsic geometric structures in data.In this paper, a new method named Laplacian regularized Low-Rank Representation (LLRR) has been proposed and applied on genomic data, which introduces graph regularization into LRR. By taking full advantages of the graph regularization, LLRR method can capture the intrinsic non-linear geometric information among the data. The LLRR method can decomposes the observation matrix of genomic data into a low rank matrix and a sparse matrix through solving an optimization problem. Because the significant genes can be considered as sparse signals, the differentially expressed genes are viewed as the sparse perturbation signals. Therefore, the differentially expressed genes can be selected according to the sparse matrix. Finally, we use the GO tool to analyze the selected genes and compare the P-values with other methods.The results on the simulation data and two real genomic data illustrate that this method outperforms some other methods: in differentially expressed gene selection.  相似文献   

2.
A spare representation classification method for tobacco leaves based on near-infrared spectroscopy and deep learning algorithm is reported in this paper. All training samples were used to make up a data dictionary of the sparse representation and the test samples were represented by the sparsest linear combinations of the dictionary by sparse coding. The regression residual of the test sample to each class was computed and finally assigned to the class with the minimum residual. The effectiveness of spare representation classification method was compared with K-nearest neighbor and particle swarm optimization–support vector machine algorithms. The results show that the classification accuracy of the proposed method is higher and it is more efficient. The results suggest that near-infrared spectroscopy with spare representation classification algorithm may be an alternative method to traditional methods for discriminating classes of tobacco leaves.  相似文献   

3.
This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM.  相似文献   

4.
The High Dimensional Model Representation (HDMR) technique is a procedure for efficiently representing high-dimensional functions. A practical form of the technique, RS-HDMR, is based on randomly sampling the overall function and utilizing orthonormal polynomial expansions. The determination of expansion coefficients employs Monte Carlo integration, which controls the accuracy of RS-HDMR expansions. In this article, a correlation method is used to reduce the Monte Carlo integration error. The determination of the expansion coefficients becomes an iteration procedure, and the resultant RS-HDMR expansion has much better accuracy than that achieved by direct Monte Carlo integration. For an illustration in four dimensions a few hundred random samples are sufficient to construct an RS-HDMR expansion by the correlation method with an accuracy comparable to that obtained by direct Monte Carlo integration with thousands of samples.  相似文献   

5.
A Bayesian network (BN) is a knowledge representation formalism that has proven to be a promising tool for analyzing gene expression data. Several problems still restrict its successful applications. Typical gene expression databases contain measurements for thousands of genes and no more than several hundred samples, but most existing BNs learning algorithms do not scale more than a few hundred variables. Current methods result in poor quality BNs when applied in such high-dimensional datasets. We propose a hybrid constraint-based scored-searching method that is effective for learning gene networks from DNA microarray data. In the first phase of this method, a novel algorithm is used to generate a skeleton BN based on dependency analysis. Then the resulting BN structure is searched by a scoring metric combined with the knowledge learned from the first phase. Computational tests have shown that the proposed method achieves more accurate results than state-of-the-art methods. This method can also be scaled beyond datasets with several hundreds of variables.  相似文献   

6.
为了解决传统接触式疲劳驾驶检测方法影响驾驶、检测算法识别率较低等问题,本文提出一种基于稀疏表示的眼睛状态识别的方法。利用K-SVD(K均值奇异值分解)方法对输入的训练集构造过完备冗余字典,利用正交匹配追踪法对测试的图像进行稀疏表示,然后根据重构图像和测试图像之间的误差,确定测试图像所属的类别,判断出测试图像的状态。实验中将K-SVD和OMP(正交匹配追踪)方法与其它字典学习和稀疏表示方法进行对比,结果表明,利用K-SVD字典学习算法结合OMP算法获得了较好的识别效果。  相似文献   

7.
DNA microarray data has been widely used in cancer research due to the significant advantage helped to successfully distinguish between tumor classes. However, typical gene expression data usually presents a high-dimensional imbalanced characteristic, which poses severe challenge for traditional machine learning methods to construct a robust classifier performing well on both the minority and majority classes. As one of the most successful feature weighting techniques, Relief is considered to particularly suit to handle high-dimensional problems. Unfortunately, almost all relief-based methods have not taken the class imbalance distribution into account. This study identifies that existing Relief-based algorithms may underestimate the features with the discernibility ability of minority classes, and ignore the distribution characteristic of minority class samples. As a result, an additional bias towards being classified into the majority classes can be introduced. To this end, a new method, named imRelief, is proposed for efficiently handling high-dimensional imbalanced gene expression data. imRelief can correct the bias towards to the majority classes, and consider the scattered distributional characteristic of minority class samples in the process of estimating feature weights. This way, imRelief has the ability to reward the features which perform well at separating the minority classes from other classes. Experiments on four microarray gene expression data sets demonstrate the effectiveness of imRelief in both feature weighting and feature subset selection applications.  相似文献   

8.
9.
Principal component analysis (PCA) is a widespread technique for data analysis that relies on the covariance/correlation matrix of the analyzed data. However, to properly work with high-dimensional data sets, PCA poses severe mathematical constraints on the minimum number of different replicates, or samples, that must be included in the analysis. Generally, improper sampling is due to a small number of data respect to the number of the degrees of freedom that characterize the ensemble. In the field of life sciences it is often important to have an algorithm that can accept poorly dimensioned data sets, including degenerated ones. Here a new random projection algorithm is proposed, in which a random symmetric matrix surrogates the covariance/correlation matrix of PCA, while maintaining the data clustering capacity. We demonstrate that what is important for clustering efficiency of PCA is not the exact form of the covariance/correlation matrix, but simply its symmetry.  相似文献   

10.
The High-Dimensional Model Representation (HDMR) technique is a family of approaches to efficiently interpolate high-dimensional functions. RS(Random Sampling)-HDMR is a practical form of HDMR based on randomly sampling the overall function, and utilizing orthonormal polynomial expansions to approximate the RS-HDMR component functions. The determination of the expansion coefficients for the component functions employs Monte Carlo integration, which controls the accuracy of the RS-HDMR interpolation. The control variate method is an established approach to improve the accuracy of Monte Carlo integration. However, this method is often not practical for an arbitrary function f(x) because there is no general way to find the analytical control variate function h(x), which needs to be very similar to f(x). In this article, we show that truncated RS-HDMR expansions can be used as h(x) for arbitrary f(x), and a more stable iterative ratio control variate method was developed for the determination of the expansion coefficients for the RS-HDMR component functions. As the asymptotic error (standard deviation) of the estimator given by the ratio control variate method is proportional to 1/N(sample size), it is more efficient than the direct Monte Carlo integration, whose error is proportional to 1/square root(N). In an illustration of a four-dimensional atmospheric model a few hundred random samples are sufficient to construct an RS-HDMR expansion by the ratio control variate method with an accuracy comparable to that obtained by direct Monte Carlo integration with thousands of samples.  相似文献   

11.
We discuss the clustering of 234 environmental samples resulting from an extensive monitoring program concerning soil lead content, plant lead content, traffic density, and distance from the road at different sampling locations in former East Germany. Considering the structure of data and the unsatisfactory results obtained applying classical clustering and principal component analysis, it appeared evident that fuzzy clustering could be one of the best solutions. In the following order we used different fuzzy clustering algorithms, namely, the fuzzy c-means (FCM) algorithm, the Gustafson–Kessel (GK) algorithm, which may detect clusters of ellipsoidal shapes in data by introducing an adaptive distance norm for each cluster, and the fuzzy c-varieties (FCV) algorithm, which was developed for recognition of r-dimensional linear varieties in high-dimensional data (lines, planes or hyperplanes). Fuzzy clustering with convex combination of point prototypes and different multidimensional linear prototypes is also discussed and applied for the first time in analytical chemistry (environmetrics). The results obtained in this study show the advantages of the FCV and GK algorithms over the FCM algorithm. The performance of each algorithm is illustrated by graphs and evaluated by the values of some conventional cluster validity indices. The values of the validity indices are in very good agreement with the quality of the clustering results. Figure Projection of all samples on the plane defined by the membership degrees to cluster A2, and A4 obtained using Fuzzy c-varieties (FCV) algorithm (expression of objective function and distance enclosed)  相似文献   

12.
A data compression method is presented that is generally applicable to first-order convergent iterative procedures that employ subspace expansions or extrapolations based on successive correction vectors. This method is based on the truncation of insignificant information in successive correction vectors. Although the correction vectors themselves may be severely truncated with the proposed approach, the final solution vector may be represented to arbitrary accuracy. A feature of the proposed method is that more slowly convergent iterative procedures allow the correction vectors to be more severely truncated without affecting the overall convergence rate. The method is implemented and applied to the iterative Davidson diagonalization method. If the compressed representation of the expansion vectors can be held in main computer memory, then a significant reduction in the I/O requirements is achieved.  相似文献   

13.
Due to involved disease mechanisms, many complex diseases such as cancer, demonstrate significant heterogeneity with varying behaviors, including different survival time, treatment responses, and recurrence rates. The aim of tumor stratification is to identify disease subtypes, which is an important first step towards precision medicine. Recent advances in profiling a large number of molecular variables such as in The Cancer Genome Atlas (TCGA), have enabled researchers to implement computational methods, including traditional clustering and bi-clustering algorithms, to systematically analyze high-throughput molecular measurements to identify tumor subtypes as well as their corresponding associated biomarkers.In this study we discuss critical issues and challenges in existing computational approaches for tumor stratification. We show that the problem can be formulated as finding densely connected sub-graphs (bi-cliques) in a bipartite graph representation of genomic data. We propose a novel algorithm that takes advantage of prior biology knowledge through a gene–gene interaction network to find such sub-graphs, which helps simultaneously identify both tumor subtypes and their corresponding genetic markers. Our experimental results show that our proposed method outperforms current state-of-the-art methods for tumor stratification.  相似文献   

14.
The great size of chemical databases and the high computational cost required in the atom-atom comparison of molecular structures for the calculation of the similarity between two chemical compounds necessitate the proposal of new clustering models with the aim of reducing the time of recovery of a set of molecules from a database that satisfies a range of similarities with regard to a given molecule pattern. In this paper we make use of the information corresponding to the cycles existing in the structure of molecules as an approach for the classification of chemical databases. The clustering method here proposed is based on the representation of the topological structure of molecules stored in chemical databases through its corresponding cycle graph. This method presents a more appropriate behavior for others described in the bibliography in which the information corresponding to the cyclicity of the molecules is also used.  相似文献   

15.
差分拉曼光谱结合SVM对便签纸的鉴别分析   总被引:1,自引:0,他引:1  
刘津彤  张岚泽  姜红  陈相全  段斌  刘峰 《化学通报》2022,85(2):259-263,246
基于差分拉曼光谱技术与支持向量机(SVM)模型,提出了一种对便签纸类检材的快速可视化鉴别方法。实验获取了40组不同品牌便签纸样本的差分拉曼光谱数据,利用BP神经网络和差分技术完成谱图的除噪与基线校正后,借助F检验与主成分分析提取谱段信息,构建出SVM分类模型。实验结果表明,当设置Linear为SVM模型的核函数时,可以实现对样本测试集的完全准确划分,K折交叉验证的结果理想。相比于传统聚类分析手段,本方法可以在原始高维光谱数据中筛选出有效特征矩阵,且SVM模型兼具高效性和准确性,为公安实践中纸张类物证的区分鉴别提供一种新思路。  相似文献   

16.
In general, finding a one-dimensional representation of the kinetics of a high-dimensional system is a great simplification for the study of complex systems. Here, we propose a method to obtain a reaction coordinate whose potential of the mean force can reproduce the commitment probability distribution from the multidimensional surface. We prove that such a relevant one-dimensional representation can be readily calculated from the equilibrium distribution of commitment probabilities, which can be obtained with simulations. Also, it is shown that this representation is complementary to a previously proposed one-dimensional representation based on a quadratic approximation of the potential energy surface. The usefulness of the method is examined with dynamics in a two-dimensional system, showing that the one-dimensional surface thus obtained can predict the existence of an intermediate and the occurrence of path switching without a priori knowledge of the morphology of the original surface. The applicability of the method to more complex and realistic reactions such as protein folding is also discussed.  相似文献   

17.
Protein dynamics evolves in a high-dimensional space, comprising aharmonic, strongly correlated motional modes. Such correlation often plays an important role in analyzing protein function. In order to identify significantly correlated collective motions, here we employ independent subspace analysis based on the subspace joint approximate diagonalization of eigenmatrices algorithm for the analysis of molecular dynamics (MD) simulation trajectories. From the 100 ns MD simulation of T4 lysozyme, we extract several independent subspaces in each of which collective modes are significantly correlated, and identify the other modes as independent. This method successfully detects the modes along which long-tailed non-Gaussian probability distributions are obtained. Based on the time cross-correlation analysis, we identified a series of events among domain motions and more localized motions in the protein, indicating the connection between the functionally relevant phenomena which have been independently revealed by experiments.  相似文献   

18.
The last couple of years an overwhelming amount of data has emerged in the field of biomolecular structure determination. To explore information hidden in these structure databases, clustering techniques can be used. The outcome of the clustering experiments largely depends, among others, on the way the data is represented; therefore, the choice how to represent the molecular structure information is extremely important. This article describes what the influence of the different representations on the clustering is and how it can be analyzed by means of a dendrogram comparison method. All experiments are performed using a data set consisting of RNA trinucleotides. Besides the most basic structure representation, the Cartesian coordinates representation, several other structure representations are used.  相似文献   

19.
A configurational CAST (CAnonical representation of STereochemistry) coding method, which represents relative and absolute configuration, is described. The configurational CAST codes are constructed by canonical rotation of the dihedral angles of the input structure before the CAST codes are assigned. Using the configurational CAST, configurational differences can be distinguished independently of conformational differences. Representation of enantiomers is also achieved by a mirror image conversion method. The CAST representation shows the distinctive characteristics of several diastereomers and conformers that were examined. The method clearly represents the differences in configurations. Applications to organic molecules having complex stereochemistry are also demonstrated.  相似文献   

20.
Multivariance in science and engineering causes problematic situations even for continous and discrete cases. One way to overcome this situation is to decrease the multivariance level of the problem by using a divide—and—conquer based method. In this sense, Enhanced Multivariance Product Representation (EMPR) plays a part in the considered scenario and acts successfully. This method brings up a finite expansion to represent a multivariate function in terms of less-variate functions with the assistance of univariate support functions. This work aims to propose a new EMPR based algorithm which has two new features that improves the determination process of each expansion component through Fluctuation Free Integration method, which is an efficient method in evaluating multiple integrals through a universal matrix representation, and increases the approximation quality through inserting a piecewise structure into the standard EMPR algorithm. This new method is called Fluctuation Free Integration based piecewise EMPR. Some numerical implementations are also given to examine the performance of this proposed method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号