首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 906 毫秒
1.
We demonstrate the effectiveness of an economical scheme that uses numerical basis sets in computations with SIESTA. The economical basis sets demonstrated, in which high-level double-zeta basis plus polarization orbitals (DZP) are applied only for atoms of strong electronegativity and metal atoms while a double-zeta basis is applied to the rest of the atoms of small proton-bound carboxylic acid clusters and sodium–organic compounds, predict correct geometric structures very close to those obtained using DZP for all atoms. The use of economical basis sets can save about 30–50% of the CPU time that is used for calculations with large basis sets. This study provides a general guideline for basis set selection in SIESTA computations of large systems.Acknowledgement. The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (project no. CityU 1033/00P).  相似文献   

2.
This paper develops a multi-parturition genetic algorithm (MPGA) to be used in geometrical bounding of the overlapped clusters in a data set for the classification of chemical data. Two new operators have been introduced to modify the conventional genetic algorithm, namely, multi-parturition and decimation and orientated creation to improve the linear classification results and diminish the computational time. To circumvent the difficulty commonly encountered in the treatment of linearly inseparable chemical data sets, the optimized linear classifier is further modified to provide a complementary nonlinear classifier. For this reason the space regions of the overlapped clusters have been bounded by erection of half-hyperellipsoids over the linearly misclassified patterns. The proposed MPGA was applied to classify a number of chemical and other data sets with a dimension from 4 to 14. Experimental results have indicated that the proposed MPGA could classify seriously overlapped data sets with an acceptable error rate.  相似文献   

3.
《Analytica chimica acta》2004,515(1):87-100
The goal of present work is to analyse the effect of having non-informative variables (NIV) in a data set when applying cluster analysis and to propose a method computationally capable of detecting and removing these variables. The method proposed is based on the use of a genetic algorithm to select those variables important to make the presence of groups in data clear. The procedure has been implemented to be used with k-means and using the cluster silhouettes as fitness function for the genetic algorithm.The main problem that can appear when applying the method to real data is the fact that, in general, we do not know a priori what the real cluster structure is (number and composition of the groups).The work explores the evolution of the silhouette values computed from the clusters built by using k-means when non-informative variables are added to the original data set in both a literature data set as well as some simulated data in higher dimension. The procedure has also been applied to real data sets.  相似文献   

4.
The process of Drug Discovery is a complex and high risk endeavor that requires focused attention on experimental hypotheses, the application of diverse sets of technologies and data to facilitate high quality decision-making. All is aimed at enhancing the quality of the chemical development candidate(s) through clinical evaluation and into the market. In support of the lead generation and optimization phases of this endeavor, high throughput technologies such as combinatorial/high throughput synthesis and high throughput and ultra-high throughput screening, have allowed the rapid analysis and generation of large number of compounds and data. Today, for every analog synthesized 100 or more data points can be collected and captured in various centralized databases. The analysis of thousands of compounds can very quickly become a daunting task. In this article we present the process we have developed for both analyzing and prioritizing large sets of data starting from diversity and focused uHTS in support of lead generation and secondary screens supporting lead optimization. We will describe how we use informatics and computational chemistry to focus our efforts on asking relevant questions about the desired attributes of a specific library, and subsequently in guiding the generation of more information-rich sets of analogs in support of both processes.  相似文献   

5.
Cross‐validation (CV) is a common approach for determining the optimal number of components in a principal component analysis model. To guarantee the independence between model testing and calibration, the observation‐wise k‐fold operation is commonly implemented in each cross‐validation step. This operation renders the CV algorithm computationally intensive, and it is the main limitation to apply CV on very large data sets. In this paper, we carry out an empirical and theoretical investigation of the use of this operation in the element‐wise k‐fold (ekf) algorithm, the state‐of‐the‐art CV algorithm. We show that when very large data sets need to be cross‐validated and the computational time is a matter of concern, the observation‐wise k‐fold operation can be skipped. The theoretical properties of the resulting modified algorithm, referred to as column‐wise k‐fold (ckf) algorithm, are derived. Also, its performance is evaluated with several artificial and real data sets. We suggest the ckf algorithm to be a valid alternative to the standard ekf to reduce the computational time needed to cross‐validate a data set. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

6.
Drug–target interaction (DTI) prediction is a challenging step in further drug repositioning, drug discovery and drug design. The advent of high-throughput technologies brings convenience to the development of DTI prediction methods. With the generation of a high number of data sets, many mathematical models and computational algorithms have been developed to identify the potential drug–target pairs. However, most existing methods are proposed based on the single view data. By integrating the drug and target data from different views, we aim to get more stable and accurate prediction results.In this paper, a multiview DTI prediction method based on clustering is proposed. We first introduce a model for single view drug–target data. The model is formulated as an optimization problem, which aims to identify the clusters in both drug similarity network and target protein similarity network, and at the same time make the clusters with more known DTIs be connected together. Then the model is extended to multiview network data by maximizing the consistency of the clusters in each view. An approximation method is proposed to solve the optimization problem. We apply the proposed algorithms to two views of data. Comparisons with some existing algorithms show that the multiview DTI prediction algorithm can produce more accurate predictions. For the considered data set, we finally predict 54 possible DTIs. From the similarity analysis of the drugs/targets, enrichment analysis of DTIs and genes in each cluster, it is shown that the predicted DTIs have a high possibility to be true.  相似文献   

7.
8.
A hierarchical clustering algorithm--NIPALSTREE--was developed that is able to analyze large data sets in high-dimensional space. The result can be displayed as a dendrogram. At each tree level the algorithm projects a data set via principle component analysis onto one dimension. The data set is sorted according to this one dimension and split at the median position. To avoid distortion of clusters at the median position, the algorithm identifies a potentially more suited split point left or right of the median. The procedure is recursively applied on the resulting subsets until the maximal distance between cluster members exceeds a user-defined threshold. The approach was validated in a retrospective screening study for angiotensin converting enzyme (ACE) inhibitors. The resulting clusters were assessed for their purity and enrichment in actives belonging to this ligand class. Enrichment was observed in individual branches of the dendrogram. In further retrospective virtual screening studies employing the MDL Drug Data Report (MDDR), COBRA, and the SPECS catalog, NIPALSTREE was compared with the hierarchical k-means clustering approach. Results show that both algorithms can be used in the context of virtual screening. Intersecting the result lists obtained with both algorithms improved enrichment factors while losing only few chemotypes.  相似文献   

9.
The Sphere Exclusion algorithm is a well-known algorithm used to select diverse subsets from chemical-compound libraries or collections. It can be applied with any given distance measure between two structures. It is popular because of the intuitive geometrical interpretation of the method and its good performance on large data sets. This paper describes Directed Sphere Exclusion (DISE), a modification of the Sphere Exclusion algorithm, which retains all positive properties of the Sphere Exclusion algorithm but generates a more even distribution of the selected compounds in the chemical space. In addition, the computational requirement is significantly reduced, thus it can be applied to very large data sets.  相似文献   

10.
Gene expression data are characterized by thousands even tens of thousands of measured genes on only a few tissue samples. This can lead either to possible overfitting and dimensional curse or even to a complete failure in analysis of microarray data. Gene selection is an important component for gene expression-based tumor classification systems. In this paper, we develop a hybrid particle swarm optimization (PSO) and tabu search (HPSOTS) approach for gene selection for tumor classification. The incorporation of tabu search (TS) as a local improvement procedure enables the algorithm HPSOTS to overleap local optima and show satisfactory performance. The proposed approach is applied to three different microarray data sets. Moreover, we compare the performance of HPSOTS on these datasets to that of stepwise selection, the pure TS and PSO algorithm. It has been demonstrated that the HPSOTS is a useful tool for gene selection and mining high dimension data.  相似文献   

11.
12.
Serial analysis of gene expression (SAGE) is a powerful tool to obtain gene expression profiles. Clustering analysis is a valuable technique for analyzing SAGE data. In this paper, we propose an adaptive clustering method for SAGE data analysis, namely, PoissonAPS. The method incorporates a novel clustering algorithm, Affinity Propagation (AP). While AP algorithm has demonstrated good performance on many different data sets, it also faces several limitations. PoissonAPS overcomes the limitations of AP using the clustering validation measure as a cost function of merging and splitting, and as a result, it can automatically cluster SAGE data without user-specified parameters. We evaluated PoissonAPS and compared its performance with other methods on several real life SAGE datasets. The experimental results show that PoissonAPS can produce meaningful and interpretable clusters for SAGE data.  相似文献   

13.
Research into the advancement of computer-aided molecular design (CAMD) has a tendency to focus on the discipline of algorithm development. Such efforts are often wrought to the detriment of the data set selection and analysis used in said algorithm validation. Here we highlight the potential problems this can cause in the context of druglikeness classification. More rigorous efforts are applied to the selection of decoy (nondruglike) molecules from the ACD. Comparisons are made between model performance using the standard technique of random test set creation with test sets derived from explicit ontological separation by drug class. The dangers of viewing druglike space as sufficiently coherent to permit simple classification are highlighted. In addition the issues inherent in applying unfiltered data and random test set selection to (Q)SAR models utilizing large and supposedly heterogeneous databases are discussed.  相似文献   

14.
We describe a method of performing trilinear analysis on large data sets using a modification of the PARAFAC‐ALS algorithm. Our method iteratively decomposes the data matrix into a core matrix and three loading matrices based on the Tucker1 model. The algorithm is particularly useful for data sets that are too large to upload into a computer's main memory. While the performance advantage in utilizing our algorithm is dependent on the number of data elements and dimensions of the data array, we have seen a significant performance improvement over operating PARAFAC‐ALS on the full data set. In one case of data comprising hyperspectral images from a confocal microscope, our method of analysis was approximately 60 times faster than operating on the full data set, while obtaining essentially equivalent results. Copyright © 2008 by John Wiley & Sons, Ltd.  相似文献   

15.
An alternative strategy to find the minimal energy structure of nanoclusters is presented and implemented. We use it to determine the structure of metallic clusters. It consists in an unbiased search, with a global minimum algorithm: conformational space annealing. First, we find the minima of a many-body phenomenological potential to create a data bank of putative minima. This procedure assures us the generation of a set of cluster configurations of large diversity. Next, the clusters in this data bank are relaxed by ab initio techniques to obtain their energies and geometrical structures. The scheme is successfully applied to magic number 13 atom clusters of rhodium, palladium, and silver. We obtained minimal energy cluster structures not previously reported, which are different from the phenomenological minima. Moreover, they are not always highly symmetric, thus casting some doubt on the customary biased search scheme, which consists in relaxing with density functional theory global minima chosen among high symmetry structures obtained by means of phenomenological potentials.  相似文献   

16.
The Interval Correlation Optimised Shifting algorithm (icoshift) has recently been introduced for the alignment of nuclear magnetic resonance spectra. The method is based on an insertion/deletion model to shift intervals of spectra/chromatograms and relies on an efficient Fast Fourier Transform based computation core that allows the alignment of large data sets in a few seconds on a standard personal computer. The potential of this programme for the alignment of chromatographic data is outlined with focus on the model used for the correction function. The efficacy of the algorithm is demonstrated on a chromatographic data set with 45 chromatograms of 64,000 data points. Computation time is significantly reduced compared to the Correlation Optimised Warping (COW) algorithm, which is widely used for the alignment of chromatographic signals. Moreover, icoshift proved to perform better than COW in terms of quality of the alignment (viz. of simplicity and peak factor), but without the need for computationally expensive optimisations of the warping meta-parameters required by COW. Principal component analysis (PCA) is used to show how a significant reduction on data complexity was achieved, improving the ability to highlight chemical differences amongst the samples.  相似文献   

17.
Density-based spatial clustering of applications with noise (DBSCAN) is an unsupervised classification algorithm which has been widely used in many areas with its simplicity and its ability to deal with hidden clusters of different sizes and shapes and with noise. However, the computational issue of the distance table and the non-stability in detecting the boundaries of adjacent clusters limit the application of the original algorithm to large datasets such as images. In this paper, the DBSCAN algorithm was revised and improved for image clustering and segmentation. The proposed clustering algorithm presents two major advantages over the original one. Firstly, the revised DBSCAN algorithm made it applicable for large 3D image dataset (often with millions of pixels) by using the coordinate system of the image data. Secondly, the revised algorithm solved the non-stability issue of boundary detection in the original DBSCAN. For broader applications, the image dataset can be ordinary 3D images or in general, it can also be a classification result of other type of image data e.g. a multivariate image.  相似文献   

18.
In pharmaceutical research, collections of active compounds directed against specific therapeutic targets usually evolve over time. Small molecule discovery is an iterative process. New compounds are discovered, alternative compound series explored, some series discontinued, and others prioritized. The design of new compounds usually takes into consideration prior chemical and structure-activity relationship (SAR) knowledge. Hence, historically grown compound collections represent a viable source of chemical and SAR information that might be utilized to retrospectively analyze roadblocks in compound optimization and further guide discovery projects. However, SAR analysis of large and heterogeneous sets of active compounds is also principally complicated. We have subjected evolving compound data sets to SAR monitoring using activity landscape models in order to evaluate how composition and SAR characteristics might change over time. Chemotype and potency distributions in evolving data sets directed against different therapeutic targets were analyzed and alternative activity landscape representations generated at different points in time to monitor the progression of global and local SAR features. Our results show that the evolving data sets studied here have predominantly grown around seed clusters of active compounds that often emerged early on, while other SAR islands remained largely unexplored. Moreover, increasing scaffold diversity in evolving data sets did not necessarily yield new SAR patterns, indicating a rather significant influence of "me-too-ism" (i.e., introducing new chemotypes that are similar to already known ones) on the composition and SAR information content of the data sets.  相似文献   

19.
《结构化学》2020,39(7):1185-1193
Atomic clusters of subnanometer scale and variable chemical composition offer great opportunities for rational design of functional nanomaterials. Among them, cage clusters doped with endohedral atom are particularly interesting owing to their enhanced stability and highly tunable physical and chemical properties. In this perspective, first we give a brief overview of the history of doped cage clusters and introduce the home-developed comprehensive genetic algorithm(CGA) for structure prediction of clusters. Then, we show a few examples of magnetic clusters and subnanometer catalysts based on doped cage clusters, which are computationally revealed or designed by the CGA code. Finally, we give an outlook for some future directions of cluster science.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号