首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 37 毫秒
1.
Principal component analysis (PCA) and other multivariate analysis methods have been used increasingly to analyse and understand depth profiles in X‐ray photoelectron spectroscopy (XPS), Auger electron spectroscopy (AES) and secondary ion mass spectrometry (SIMS). These methods have proved equally useful in fundamental studies as in applied work where speed of interpretation is very valuable. Until now these methods have been difficult to apply to very large datasets such as spectra associated with 2D images or 3D depth‐profiles. Existing algorithms for computing PCA matrices have been either too slow or demanded more memory than is available on desktop PCs. This often forces analysts to ‘bin’ spectra on much more coarse a grid than they would like, perhaps even to unity mass bins even though much higher resolution is available, or select only part of an image for PCA analysis, even though PCA of the full data would be preferred. We apply the new ‘random vectors’ method of singular value decomposition proposed by Halko and co‐authors to time‐of‐flight (ToF)SIMS data for the first time. This increases the speed of calculation by a factor of several hundred, making PCA of these datasets practical on desktop PCs for the first time. For large images or 3D depth profiles we have implemented a version of this algorithm which minimises memory needs, so that even datasets too large to store in memory can be processed into PCA results on an ordinary PC with a few gigabytes of memory in a few hours. We present results from ToFSIMS imaging of a citrate crystal and a basalt rock sample, the largest of which is 134GB in file size corresponding to 67 111 mass values at each of 512 × 512 pixels. This was processed into 100 PCA components in six hours on a conventional Windows desktop PC. © 2015 The Authors. Surface and Interface Analysis published by John Wiley & Sons Ltd.  相似文献   

2.
The transition operator method combined with second-order, self-energy corrections to the electron propagator (TOEP2) may be used to calculate valence and core-electron binding energies. This method is tested on a set of molecules to assess its predictive quality. For valence ionization energies, well known methods that include third-order terms achieve somewhat higher accuracy, but only with much higher demands for memory and arithmetic operations. Therefore, we propose the use of the TOEP2 method for the calculation of valence electron binding energies in large molecules where third-order methods are infeasible. For core-electron binding energies, TOEP2 results exhibit superior accuracy and efficiency and are relatively insensitive to the fractional occupation numbers that are assigned to the transition orbital.  相似文献   

3.
Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N=6 for all taxonomic levels.  相似文献   

4.
Despite great advances in X-ray absorption spectroscopy for the investigation of small molecule electronic structure, the application to biosystems of experimental techniques developed within this research field remains a challenge. To partially circumvent the problem, users resort to theoretical methods to interpret or predict the X-ray absorption spectra of large molecules. To accomplish this task, only low-cost computational strategies can be exploited. For this reason, some of them are single Slater determinant wavefunction approaches coupled with multiscale embedding techniques designed to treat large systems of biological interest. Therefore, in this work, we propose to apply the recently developed IMOM/ELMO embedding method to the determination of core-ionized states. The IMOM/ELMO technique resulted from the combination of the single Slater determinant Δself-consistent-field-initial maximum overlap approach (ΔSCF-IMOM) with the QM/ELMO (quantum mechanics/extremely localized molecular orbital) embedding strategy, a method where only the chemically relevant region of the examined system is treated at fully quantum chemical level, while the rest is described through transferred and frozen extremely localized molecular orbitals (ELMOs). The IMOM/ELMO technique was initially validated by computing core-ionization energies for small molecules, and it was afterwards exploited to study larger biosystems. The obtained results are in line with those reported in previous studies that applied alternative ΔSCF approaches. This makes us envisage a possible future application of the proposed method to the interpretation of X-ray absorption spectra of large molecules.  相似文献   

5.
Rift Valley fever virus (RVFV) is a potent human and livestock pathogen endemic to sub-Saharan Africa and the Arabian Peninsula that has potential to spread to other parts of the world. Although there is no proven effective and safe treatment for RVFV infections, a potential therapeutic target is the virally encoded nucleocapsid protein (N). During the course of infection, N binds to viral RNA, and perturbation of this interaction can inhibit viral replication. To gain insight into how N recognizes viral RNA specifically, we designed an algorithm that uses a distance matrix and multidimensional scaling to compare the predicted secondary structures of known N-binding RNAs, or aptamers, that were isolated and characterized in previous in vitro evolution experiment. These aptamers did not exhibit overt sequence or predicted structure similarity, so we employed bioinformatic methods to propose novel aptamers based on analysis and clustering of secondary structures. We screened and scored the predicted secondary structures of novel randomly generated RNA sequences in silico and selected several of these putative N-binding RNAs whose secondary structures were similar to those of known N-binding RNAs. We found that overall the in silico generated RNA sequences bound well to N in vitro. Furthermore, introduction of these RNAs into cells prior to infection with RVFV inhibited viral replication in cell culture. This proof of concept study demonstrates how the predictive power of bioinformatics and the empirical power of biochemistry can be jointly harnessed to discover, synthesize, and test new RNA sequences that bind tightly to RVFV N protein. The approach would be easily generalizable to other applications.  相似文献   

6.
Relaxation times (T1, T2, T1rho) are usually evaluated from exponential decay data by least-squares fitting methods. For this procedure, the integrals or amplitudes of signals must be determined, which can be laborious with large data sets. Moreover, the fitting requires a priori knowledge of the number of exponential components responsible for the decay. We have adapted inverse Laplace transformation (ILT) for the analysis of relaxation data. Exponential components are resolved with ILT to reciprocal space on their corresponding relaxation rate values. The ILT approach was applied to 3D linewidth-resolved 15N HSQC experiments to evaluate 15N T1 and T2 relaxation times of ubiquitin. The resulting spectrum is a true 3D spectrum, where the signals are separated by their 1H and 15N chemical shifts (HSQC correlations) and by their relaxation rate values (R1 or R2). From this spectrum, the relaxation times can be obtained directly with a simple peak-picking procedure.  相似文献   

7.
As several structural proteomic projects are producing an increasing number of protein structures with unknown function, methods that can reliably predict protein functions from protein structures are in urgent need. In this paper, we present a method to explore the clustering patterns of amino acids on the 3-dimensional space for protein function prediction. First, amino acid residues on a protein structure are clustered into spatial groups using hierarchical agglomerative clustering, based on the distance between them. Second, the protein structure is represented using a graph, where each node denotes a cluster of amino acids. The nodes are labeled with an evolutionary profile derived from the multiple alignment of homologous sequences. Then, a shortest-path graph kernel is used to calculate similarities between the graphs. Finally, a support vector machine using this graph kernel is used to train classifiers for protein function prediction. We applied the proposed method to two separate problems, namely, prediction of enzymes and prediction of DNA-binding proteins. In both cases, the results showed that the proposed method outperformed other state-of-the-art methods.  相似文献   

8.
Yam C  Zhang Q  Wang F  Chen G 《Chemical Society reviews》2012,41(10):3821-3838
The poor scaling of many existing quantum mechanical methods with respect to the system size hinders their applications to large systems. In this tutorial review, we focus on latest research on linear-scaling or O(N) quantum mechanical methods for excited states. Based on the locality of quantum mechanical systems, O(N) quantum mechanical methods for excited states are comprised of two categories, the time-domain and frequency-domain methods. The former solves the dynamics of the electronic systems in real time while the latter involves direct evaluation of electronic response in the frequency-domain. The localized density matrix (LDM) method is the first and most mature linear-scaling quantum mechanical method for excited states. It has been implemented in time- and frequency-domains. The O(N) time-domain methods also include the approach that solves the time-dependent Kohn-Sham (TDKS) equation using the non-orthogonal localized molecular orbitals (NOLMOs). Besides the frequency-domain LDM method, other O(N) frequency-domain methods have been proposed and implemented at the first-principles level. Except one-dimensional or quasi-one-dimensional systems, the O(N) frequency-domain methods are often not applicable to resonant responses because of the convergence problem. For linear response, the most efficient O(N) first-principles method is found to be the LDM method with Chebyshev expansion for time integration. For off-resonant response (including nonlinear properties) at a specific frequency, the frequency-domain methods with iterative solvers are quite efficient and thus practical. For nonlinear response, both on-resonance and off-resonance, the time-domain methods can be used, however, as the time-domain first-principles methods are quite expensive, time-domain O(N) semi-empirical methods are often the practical choice. Compared to the O(N) frequency-domain methods, the O(N) time-domain methods for excited states are much more mature and numerically stable, and have been applied widely to investigate the dynamics of complex molecular systems.  相似文献   

9.
In many sequence data mining applications, the goal is to find frequent substrings. Some of these applications like extracting motifs in protein and DNA sequences are looking for frequently occurring approximate contiguous substrings called simple motifs. By approximate we mean that some mismatches are allowed during similarity test between substrings, and it helps to discover unknown patterns. Structured motifs in DNA sequences are frequent structured contiguous substrings which contains two or more simple motifs. There are some works that have been done to find simple motifs but these works have problems such as low scalability, high execution time, no guarantee to find all patterns, and low flexibility in adaptation to other application. The Flame is the only algorithm that can find all unknown structured patterns in a dataset and has solved most of these problems but its scalability for very large sequences is still weak. In this research a new approach named Next-Symbol-Array based Motif Discovery (NSAMD) is represented to improve scalability in extracting all unknown simple and structured patterns. To reach this goal a new data structure has been presented called Next-Symbol-Array. This data structure makes change in how to find patterns by NSAMD in comparison with Flame and helps to find structured motif faster. Proposed algorithm is as accurate as Flame and extracts all existing patterns in dataset. Performance comparisons show that NSAMD outperforms Flame in extracting structured motifs in both execution time (51% faster) and memory usage (more than 99%). Proposed algorithm is slower in extracting simple motifs but considerable improvement in memory usage (more than 99%) makes NSAMD more scalable than Flame. This advantage of NSAMD is very important in biological applications in which very large sequences are applied.  相似文献   

10.
In order to apply ab initio wave-function-based correlation methods to metals, it is desirable to split the calculation into a mean-field part and a correlation part. Whereas the mean-field part (here Hartree-Fock) is performed in the extended periodic system, it is necessary to use for the correlation part local wave-function-based correlation methods in finite fragments of the solid. For these finite entities it is necessary to construct an embedding. The authors suggest an embedding scheme which has itself no metallic character but can mimic the metal in the internal region, where the atoms are correlated. With this embedding it is also possible to localize the metallic orbitals in the central part. The long-range nonadditive contributions of metallicity and correlation are treated with the method of increments. In this paper they present different ways to construct such an embedding and discuss the influence of the embedding on the correlation energy of the solid.  相似文献   

11.
Clustering methods have been widely used to group together similar conformational states from molecular simulations of biomolecules in solution. For applications such as the interaction of a protein with a surface, the orientation of the protein relative to the surface is also an important clustering parameter because of its potential effect on adsorbed‐state bioactivity. This study presents cluster analysis methods that are specifically designed for systems where both molecular orientation and conformation are important, and the methods are demonstrated using test cases of adsorbed proteins for validation. Additionally, because cluster analysis can be a very subjective process, an objective procedure for identifying both the optimal number of clusters and the best clustering algorithm to be applied to analyze a given dataset is presented. The method is demonstrated for several agglomerative hierarchical clustering algorithms used in conjunction with three cluster validation techniques. © 2016 Wiley Periodicals, Inc.  相似文献   

12.
A method that uses the abundances of large clusters formed in electrospray ionization to determine the solution-phase molar fractions of amino acids in multi-component mixtures is demonstrated. For solutions containing either four or 10 amino acids, the relative abundances of protonated molecules differed from their solution-phase molar fractions by up to 30-fold and 100-fold, respectively. For the four-component mixtures, the molar fractions determined from the abundances of larger clusters consisting of 19 or more molecules were within 25% of the solution-phase molar fractions, indicating that the abundances and compositions of these clusters reflect the relative concentrations of these amino acids in solution, and that ionization and detection biases are significantly reduced. Lower accuracy was obtained for the 10-component mixtures where values determined from the cluster abundances were typically within a factor of three of their solution molar fractions. The lower accuracy of this method with the more complex mixtures may be due to specific clustering effects owing to the heterogeneity as a result of significantly different physical properties of the components, or it may be the result of lower S/N for the more heterogeneous clusters and not including the low-abundance more highly heterogeneous clusters in this analysis. Although not as accurate as using traditional standards, this clustering method may find applications when suitable standards are not readily available.  相似文献   

13.
Density-based spatial clustering of applications with noise (DBSCAN) is an unsupervised classification algorithm which has been widely used in many areas with its simplicity and its ability to deal with hidden clusters of different sizes and shapes and with noise. However, the computational issue of the distance table and the non-stability in detecting the boundaries of adjacent clusters limit the application of the original algorithm to large datasets such as images. In this paper, the DBSCAN algorithm was revised and improved for image clustering and segmentation. The proposed clustering algorithm presents two major advantages over the original one. Firstly, the revised DBSCAN algorithm made it applicable for large 3D image dataset (often with millions of pixels) by using the coordinate system of the image data. Secondly, the revised algorithm solved the non-stability issue of boundary detection in the original DBSCAN. For broader applications, the image dataset can be ordinary 3D images or in general, it can also be a classification result of other type of image data e.g. a multivariate image.  相似文献   

14.
The method of conserved core substructure matching (CSM) for the overlay of protein-ligand complexes is described. The method relies upon distance geometry to align structurally similar substructures without regard to sequence similarity onto substructures from a reference protein empirically selected to include key determinants of binding site location and geometry. The error in ligand position is reduced in reoriented ensembles generated with CSM when compared to other overlay methods. Since CSM can only succeed when the selected core substructure is geometrically conserved, misalignments only rarely occur. The method may be applied to reliably overlay large numbers of protein-ligand complexes in a way that optimizes ligand position at a specific binding site or subsite or to align structures from large and diverse protein families where the conserved binding site is localized to only a small portion of either protein. Core substructures may be complex and must be chosen with care. We have created a database of empirically selected core substructures to demonstrate the utility of CSM alignment of ligand binding sites in important drug targets. A Web-based interface can be used to apply CSM to align large collections of protein-ligand complexes for use in drug design using these substructures or to evaluate the use of alternative core substructures that may then be shared with the larger user community. Examples show the benefit of CSM in the practice of structure-based drug design.  相似文献   

15.
Within the harmonic approximation to transition state theory, the biggest challenge involved in finding the mechanism or rate of transitions is the location of the relevant saddle points on the multidimensional potential energy surface. The saddle point search is particularly challenging when the final state of the transition is not specified. In this article we report on a comparison of several methods for locating saddle points under these conditions and compare, in particular, the well-established rational function optimization (RFO) methods using either exact or approximate Hessians with the more recently proposed minimum mode following methods where only the minimum eigenvalue mode is found, either by the dimer or the Lanczos method. A test problem involving transitions in a seven-atom Pt island on a Pt(111) surface using a simple Morse pairwise potential function is used and the number of degrees of freedom varied by varying the number of movable atoms. In the full system, 175 atoms can move so 525 degrees of freedom need to be optimized to find the saddle points. For testing purposes, we have also restricted the number of movable atoms to 7 and 1. Our results indicate that if attempting to make a map of all relevant saddle points for a large system (as would be necessary when simulating the long time scale evolution of a thermal system) the minimum mode following methods are preferred. The minimum mode following methods are also more efficient when searching for the lowest saddle points in a large system, and if the force can be obtained cheaply. However, if only the lowest saddle points are sought and the calculation of the force is expensive but a good approximation for the Hessian at the starting position of the search can be obtained at low cost, then the RFO approaches employing an approximate Hessian represent the preferred choice. For small and medium sized systems where the force is expensive to calculate, the RFO approaches employing an approximate Hessian is also the more efficient, but when the force and Hessian can be obtained cheaply and only the lowest saddle points are sought the RFO approach using an exact Hessian is the better choice. These conclusions have been reached based on a comparison of the total computational effort needed to find the saddle points and the number of saddle points found for each of the methods. The RFO methods do not perform very well with respect to the latter aspect, but starting the searches further away from the initial minimum or using the hybrid RFO version presented here improves this behavior considerably in most cases.  相似文献   

16.
Consider the network of all secondary structures of a given RNA sequence, where nodes are connected when the corresponding structures have base pair distance one. The expected degree of the network is the average number of neighbors, where average may be computed with respect to the either the uniform or Boltzmann probability. Here, we describe the first algorithm, RNAexpNumNbors , that can compute the expected number of neighbors, or expected network degree, of an input sequence. For RNA sequences from the Rfam database, the expected degree is significantly less than the constrained minimum free energy structure, defined to have minimum free energy (MFE) over all structures consistent with the Rfam consensus structure. The expected degree of structural RNAs, such as purine riboswitches, paradoxically appears to be smaller than that of random RNA, yet the difference between the degree of the MFE structure and the expected degree is larger than that of random RNA. Expected degree does not seem to correlate with standard structural diversity measures of RNA, such as positional entropy and ensemble defect. The program RNAexpNumNbors is written in C, runs in cubic time and quadratic space, and is publicly available at http://bioinformatics.bc.edu/clotelab/RNAexpNumNbors . © 2014 Wiley Periodicals, Inc.  相似文献   

17.
An accurate first-principles treatment of chemical reactions for large systems remains a significant challenge facing electronic structure theory. Hybrid models, such as quantum mechanics:molecular mechanics (QM:MM) and quantum mechanics:quantum mechanics (QM:QM) schemes, provide a promising avenue for such studies. For many chemistries, including important reactions in materials science, molecular mechanics or semiempirical methods may not be appropriate, or parameters may not be available (e.g., surface chemistry of compound semiconductors such as indium phosphide or catalytic chemistry of transition metal oxides). In such cases, QM:QM schemes are of particular interest. In this work, a QM:QM electronic embedding model within the ONIOM (our own N-layer integrated molecular orbital molecular mechanics) extrapolation framework is presented. To define the embedding potential, we choose the real-system low-level Mulliken atomic charges. This results in a set of well-defined and unique embedding charges. However, the parametric dependence of the charges on molecular geometry complicates the energy gradient that is necessary for the efficient exploration of potential energy surfaces. We derive an efficient form for the forces where a single set of self-consistent field response equations is solved. Initial tests of the method and key algorithmic issues are discussed.  相似文献   

18.
Hierarchical clustering is the most often used method for grouping similar patterns of gene expression data. A fundamental problem with existing implementations of this clustering method is the inability to handle large data sets within a reasonable time and memory resources. We propose a parallelized algorithm of hierarchical clustering to solve this problem. Our implementation on a multiple instruction multiple data (MIMD) architecture shows considerable reduction in computational time and inter-node communication overhead, especially for large data sets. We use the standard message passing library, message passing interface (MPI) for any MIMD systems.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号