首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 921 毫秒
1.
Producing good low‐dimensional representations of high‐dimensional data is a common and important task in many data mining applications. Two methods that have been particularly useful in this regard are multidimensional scaling and nonlinear mapping. These methods attempt to visualize a set of objects described by means of a dissimilarity or distance matrix on a low‐dimensional display plane in a way that preserves the proximities of the objects to whatever extent is possible. Unfortunately, most known algorithms are of quadratic order, and their use has been limited to relatively small data sets. We recently demonstrated that nonlinear maps derived from a small random sample of a large data set exhibit the same structure and characteristics as that of the entire collection, and that this structure can be easily extracted by a neural network, making possible the scaling of data set orders of magnitude larger than those accessible with conventional methodologies. Here, we present a variant of this algorithm based on local learning. The method employs a fuzzy clustering methodology to partition the data space into a set of Voronoi polyhedra, and uses a separate neural network to perform the nonlinear mapping within each cell. We find that this local approach offers a number of advantages, and produces maps that are virtually indistinguishable from those derived with conventional algorithms. These advantages are discussed using examples from the fields of combinatorial chemistry and optical character recognition. © 2001 John Wiley & Sons, Inc. J Comput Chem 22: 373–386, 2001  相似文献   

2.
3.
We introduce stochastic proximity embedding (SPE), a novel self-organizing algorithm for producing meaningful underlying dimensions from proximity data. SPE attempts to generate low-dimensional Euclidean embeddings that best preserve the similarities between a set of related observations. The method starts with an initial configuration, and iteratively refines it by repeatedly selecting pairs of objects at random, and adjusting their coordinates so that their distances on the map match more closely their respective proximities. The magnitude of these adjustments is controlled by a learning rate parameter, which decreases during the course of the simulation to avoid oscillatory behavior. Unlike classical multidimensional scaling (MDS) and nonlinear mapping (NLM), SPE scales linearly with respect to sample size, and can be applied to very large data sets that are intractable by conventional embedding procedures. The method is programmatically simple, robust, and convergent, and can be applied to a wide range of scientific problems involving exploratory data analysis and visualization.  相似文献   

4.
Among the many dimensionality reduction techniques that have appeared in the statistical literature, multidimensional scaling and nonlinear mapping are unique for their conceptual simplicity and ability to reproduce the topology and structure of the data space in a faithful and unbiased manner. However, a major shortcoming of these methods is their quadratic dependence on the number of objects scaled, which imposes severe limitations on the size of data sets that can be effectively manipulated. Here we describe a novel approach that combines conventional nonlinear mapping techniques with feed-forward neural networks, and allows the processing of data sets orders of magnitude larger than those accessible with conventional methodologies. Rooted on the principle of probability sampling, the method employs a classical algorithm to project a small random sample, and then "learns" the underlying nonlinear transform using a multilayer neural network trained with the back-propagation algorithm. Once trained, the neural network can be used in a feed-forward manner to project the remaining members of the population as well as new, unseen samples with minimal distortion. Using examples from the fields of image processing and combinatorial chemistry, we demonstrate that this method can generate projections that are virtually indistinguishable from those derived by conventional approaches. The ability to encode the nonlinear transform in the form of a neural network makes nonlinear mapping applicable to a wide variety of data mining applications involving very large data sets that are otherwise computationally intractable.  相似文献   

5.
With the advancement of modern techniques, complex‐valued data have become more important in chemistry and many other areas. The data collected are often multi‐dimensional. This imposes an increasing demand on the tools used for the analysis of complex‐valued data. In multivariate data analysis, projection pursuit is a useful and important technique that in many cases gives better results than principal component analysis. One important projection pursuit variant uses the real‐valued kurtosis as its projection index and has been shown to be a powerful approach to address different problems. However, using the complex‐valued kurtosis as a projection index to deal with complex‐valued data is rare. This is, to a great extent, due to the lack of simple and fast optimization algorithms. In this work, simple and rapidly executed optimization algorithms for the complex‐valued kurtosis used as a projection index are proposed. The developed algorithms have a variety of advantages: no requirement for sphering or strong‐uncorrelation transformation of the data in advance, no assumption for the latent components (source signals) to be circular or non‐circular, search for maxima or minima on users' requirements, and users having the option to choose uncorrelated scores or orthogonal projection vectors. The mathematical development of the algorithms is described and simulated and real experimental data are employed to demonstrate the utility of the proposed algorithms. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

6.
In this article, a new molecular alignment procedure to provide general‐purpose, fast, automatic, and user‐intuitive three‐dimensional molecular alignments is presented. This procedure, called Topo‐Geometrical Superposition Approach (TGSA), is only based on comparisons of atom types and interatomic distances; hence, the procedure can handle large molecular sets within affordable computational costs. The method is able to accurately align 3D structures using the common molecular substructures, as inferred by the bonding pattern (atom correspondences), where present. The algorithm has been implemented into a program named TGSA99, and it has been tested over eight different molecular sets: flavilium salts, amino acids, indole derivatives, AZT, steroids, anilide derivatives, poly‐aromatic‐hydrocarbons, and inhibitors of thrombine. The TGSA algorithm performance is evaluated by means of computational time, number of superposed atoms, and index of fit between the compared structures. © 2000 John Wiley & Sons, Inc. J Comput Chem 22: 255–263, 2001  相似文献   

7.
A fast self-organizing algorithm for extracting the minimum number of independent variables that can fully describe a set of observations was recently described (Agrafiotis, D. K.; Xu, H. Proc. Natl. Acad. Sci.U.S.A. 2002, 99, 15869-15872). The method, called stochastic proximity embedding (SPE), attempts to generate low-dimensional Euclidean maps that best preserve the similarities between a set of related objects. Unlike conventional multidimensional scaling (MDS) and nonlinear mapping (NLM), SPE preserves only local relationships and, by doing so, reveals the intrinsic dimensionality and metric structure of the data. Its success depends critically on the choice of the neighborhood radius, which should be consistent with the local curvature of the underlying manifold. Here, we describe a procedure for determining that radius by examining the tradeoff between the stress function and the number of connected components in the neighborhood graph and show that it can be used to produce meaningful maps in any embedding dimension. The power of the algorithm is illustrated in two major areas of computational drug design: conformational analysis and diversity profiling of large chemical libraries.  相似文献   

8.
Advances in sensory systems have led to many industrial applications with large amounts of highly correlated data, particularly in chemical and pharmaceutical processes. With these correlated data sets, it becomes important to consider advanced modeling approaches built to deal with correlated inputs in order to understand the underlying sources of variability and how this variability will affect the final quality of the product. Additional to the correlated nature of the data sets, it is also common to find missing elements and noise in these data matrices. Latent variable regression methods such as partial least squares or projection to latent structures (PLS) have gained much attention in industry for their ability to handle ill‐conditioned matrices with missing elements. This feature of the PLS method is accomplished through the nonlinear iterative PLS (NIPALS) algorithm, with a simple modification to consider the missing data. Moreover, in expectation maximization PLS (EM‐PLS), imputed values are provided for missing data elements as initial estimates, conventional PLS is then applied to update these elements, and the process iterates to convergence. This study is the extension of previous work for principal component analysis (PCA), where we introduced nonlinear programming (NLP) as a means to estimate the parameters of the PCA model. Here, we focus on the parameters of a PLS model. As an alternative to modified NIPALS and EM‐PLS, this paper presents an efficient NLP‐based technique to find model parameters for PLS, where the desired properties of the parameters can be explicitly posed as constraints in the optimization problem of the proposed algorithm. We also present a number of simulation studies, where we compare effectiveness of the proposed algorithm with competing algorithms. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

9.
Processing plants can produce large amounts of data that process engineers use for analysis, monitoring, or control. Principal component analysis (PCA) is well suited to analyze large amounts of (possibly) correlated data, and for reducing the dimensionality of the variable space. Failing online sensors, lost historical data, or missing experiments can lead to data sets that have missing values where the current methods for obtaining the PCA model parameters may give questionable results due to the properties of the estimated parameters. This paper proposes a method based on nonlinear programming (NLP) techniques to obtain the parameters of PCA models in the presence of incomplete data sets. We show the relationship that exists between the nonlinear iterative partial least squares (NIPALS) algorithm and the optimality conditions of the squared residuals minimization problem, and how this leads to the modified NIPALS used for the missing value problem. Moreover, we compare the current NIPALS‐based methods with the proposed NLP with a simulation example and an industrial case study, and show how the latter is better suited when there are large amounts of missing values. The solutions obtained with the NLP and the iterative algorithm (IA) are very similar. However when using the NLP‐based method, the loadings and scores are guaranteed to be orthogonal, and the scores will have zero mean. The latter is emphasized in the industrial case study. Also, with the industrial data used here we are able to show that the models obtained with the NLP were easier to interpret. Moreover, when using the NLP many fewer iterations were required to obtain them. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

10.
In chemometrics, the supervised and unsupervised classification of high‐dimensional data has become a recurrent problem. Model‐based techniques for discriminant analysis and clustering are popular tools, which are renowned for their probabilistic foundations and their flexibility. However, classical model‐based techniques show a disappointing behaviour in high‐dimensional spaces, which up to now have been limited in their use within chemometrics. The recent developments in model‐based classification overcame these drawbacks and enabled the efficient classification of high‐dimensional data, even in the ‘small n / large p’ condition. This work presents a comprehensive review of these recent approaches, including regularization‐based techniques, parsimonious modelling, subspace classification methods and classification methods based on variable selection. The use of these model‐based methods is also illustrated on real‐world classification problems in chemometrics using R packages. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

11.
12.
A two‐dimensional diagram is proposed, in which the carbon number of each formula is plotted against its nominal mass, to visualize large sets of molecular formula data that can be derived from data generated by ultrahigh‐resolution Fourier transform ion cyclotron resonance‐MS. In such a carbon versus mass (CvM) diagram, each formula (CcHhOo) is unambiguously described by c, its (nominal) mass and the parameter i = c + o. Calculations of chemically allowable formulas illustrate that organic molecules occupy only certain spaces in such a diagram. The extension of these spaces increases with molecular mass in x‐direction (hydrogenation) and y‐direction (oxygenation). The data sets of molecules determined in natural organic matter(NOM) occupy only a certain range of the allowable space. The intensity of the mass spectrometric signals can be included as the third dimension into a CvM diagram. Separate CvM diagrams can be plotted for NOM molecules that include different heteroatoms. The benefits of the CvM diagram are illustrated by application onto data sets of fulvic acids from riverine and marine origin, of secondary organic aerosol, including organosulfates and organonitrates, as well as of ozonation of fulvic acids. The CvM diagram is a useful tool to visualize the elemental regularities in NOM isolates as well as the differences between isolates. It may also be applicable to large sets of molecular formula data generated in other disciplines such as petroleum biogeochemistry or metabolomics. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

13.
The same experimental data can often be equally well described by multiple mathematically equivalent kinetic schemes. In the present work, we investigate several model‐fitting algorithms and their ability to distinguish between mechanisms and derive the correct kinetic parameters for several different reaction classes involving consecutive reactions. We have conducted numerical experiments using synthetic experimental data for six classes of consecutive reactions involving different combinations of first‐ and second‐order processes. The synthetic data mimic time‐dependent absorption data as would be obtained from spectroscopic investigations of chemical kinetic processes. The connections between mathematically equivalent solutions are investigated, and analytical expressions describing these connections are derived. Ten optimization algorithms based on nonlinear least squares methods are compared in terms of their computational cost and frequency of convergence to global solutions. Performance is discussed, and a preferred method is recommended. A response surface visualization technique of projecting five‐dimensional data onto the three‐dimensional search space of the minimal function values is developed.  相似文献   

14.
We present a novel method for the local optimization of molecular complexes. This new approach is especially suited for usage in molecular docking. In molecular modeling, molecules are often described employing a compact representation to reduce the number of degrees of freedom. This compact representation is realized by fixing bond lengths and angles while permitting changes in translation, orientation, and selected dihedral angles. Gradient‐based energy minimization of molecular complexes using this representation suffers from well‐known singularities arising during the optimization process. We suggest an approach new in the field of structure optimization that allows to employ gradient‐based optimization algorithms for such a compact representation. We propose to use exponential mapping to define the molecular orientation which facilitates calculating the orientational gradient. To avoid singularities of this parametrization, the local minimization algorithm is modified to change efficiently the orientational parameters while preserving the molecular orientation, i.e. we perform well‐defined jumps on the objective function. Our approach is applicable to continuous, but not necessarily differentiable objective functions. We evaluated our new method by optimizing several ligands with an increasing number of internal degrees of freedom in the presence of large receptors. In comparison to the method of Solis and Wets in the challenging case of a non‐differentiable scoring function, our proposed method leads to substantially improved results in all test cases, i.e. we obtain better scores in fewer steps for all complexes. © 2008 Wiley Periodicals, Inc. J Comput Chem, 2009  相似文献   

15.
Stochastic proximity embedding (SPE) was developed as a method for efficiently calculating lower dimensional embeddings of high-dimensional data sets. Rather than using a global minimization scheme, SPE relies upon updating the distances of randomly selected points in an iterative fashion. This was found to generate embeddings of comparable quality to those obtained using classical multidimensional scaling algorithms. However, SPE is able to obtain these results in O(n) rather than O(n2) time and thus is much better suited to large data sets. In an effort both to speed up SPE and utilize it for even larger problems, we have created a multithreaded implementation which takes advantage of the growing general computing power of graphics processing units (GPUs). The use of GPUs allows the embedding of data sets containing millions of data points in interactive time scales.  相似文献   

16.
New algorithms for iterative diagonalization procedures that solve for a small set of eigen‐states of a large matrix are described. The performance of the algorithms is illustrated by calculations of low and high‐lying ionized and electronically excited states using equation‐of‐motion coupled‐cluster methods with single and double substitutions (EOM‐IP‐CCSD and EOM‐EE‐CCSD). We present two algorithms suitable for calculating excited states that are close to a specified energy shift (interior eigenvalues). One solver is based on the Davidson algorithm, a diagonalization procedure commonly used in quantum‐chemical calculations. The second is a recently developed solver, called the “Generalized Preconditioned Locally Harmonic Residual (GPLHR) method.” We also present a modification of the Davidson procedure that allows one to solve for a specific transition. The details of the algorithms, their computational scaling, and memory requirements are described. The new algorithms are implemented within the EOM‐CC suite of methods in the Q‐Chem electronic structure program. © 2014 Wiley Periodicals, Inc.  相似文献   

17.
18.
Two algorithms are introduced that show exceptional promise in finding molecular conformations using distance geometry on nuclear magnetic resonance data. The first algorithm is a gradient version of the majorization algorithm from multidimensional scaling. The main contribution is a large decrease in CPU time. The second algorithm is an iterative algorithm between possible conformations obtained from the first algorithm and permissible data points near the configuration. These ideas are similar to alternating least squares or alternating projections on convex sets. The iterations significantly improve the conformation from the first algorithm when applied to the small peptide E. coli STh enterotoxin. © 1993 John Wiley & Sons, Inc.  相似文献   

19.
Several parallel algorithms for Fock matrix construction are described. The algorithms calculate only the unique integrals, distribute the Fock and density matrices over the processors of a massively parallel computer, use blocking techniques to construct the distributed data structures, and use clustering techniques on each processor to maximize data reuse. Algorithms based on both square and row-blocked distributions of the Fock and density matrices are described and evaluated. Variants of the algorithms are discussed that use either triple-sort or canonical ordering of integrals, and dynamic or static task clustering schemes. The algorithms are shown to adapt to screening, with communication volume scaling down with computation costs. Modeling techniques are used to characterize algorithm performance. Given the characteristics of existing massively parallel computers, all the algorithms are shown to be highly efficient for problems of moderate size. The algorithms using the row-blocked data distribution are the most efficient. © 1996 by John Wiley & Sons, Inc.  相似文献   

20.
Underdetermined blind separation of nonnegative dependent sources consists in decomposing a set of observed mixed signals into greater number of original nonnegative and dependent component (source) signals. That is an important problem for which very few algorithms exist. It is also practically relevant for contemporary metabolic profiling of biological samples, such as biomarker identification studies, where sources (a.k.a. pure components or analytes) are aimed to be extracted from mass spectra of complex multicomponent mixtures. This paper presents a method for underdetermined blind separation of nonnegative dependent sources. The method performs nonlinear mixture‐wise mapping of observed data in high‐dimensional reproducible kernel Hilbert space (RKHS) of functions and sparseness‐constrained nonnegative matrix factorization (NMF) therein. Thus, the original problem is converted into new one with increased number of mixtures, increased number of dependent sources, and higher‐order (error) terms generated by nonlinear mapping. Provided that amplitudes of original components are sparsely distributed, which is the case for mass spectra of analytes, sparseness‐constrained NMF in RKHS yields, with significant probability, improved accuracy relative to the case when the same NMF algorithm is performed on the original problem. The method is exemplified on numerical and experimental examples related respectively to extraction of 10 dependent components from five mixtures and to extraction of 10 dependent analytes from mass spectra of two to five mixtures. Thereby, analytes mimic complexity of components expected to be found in biological samples. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号