首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The dimensions of databases can be defined based on a variety of concepts, ranging from the standard tools of principal component analysis to context-biased approaches. The effective dimensions of databases, in particular the effective dimensions involving continua such as electron density data, provide a set of important tools for database comparisons and for the evaluation of some aspects of database quality. The problems associated with database comparisons and database mergers, such as those occurring in the process of database unification in the actual merger of two pharmaceutical companies, provide challenging tasks and opportunities for data science. Some of the tools for effective dimension reduction and dimension expansion are reviewed in the context of database quality control and conditions for database compatibility are presented. A common misconception affecting data sampling techniques for data quality evaluation is discussed and methods for circumventing the associated sampling errors are described.  相似文献   

2.
Several procedures are available for simulating and optimising separations in ion chromatography (IC), based on the application of retention models to an extensive database of analyte retention times on a wide range of columns. These procedures are subject to errors arising from batch-to-batch variability in the synthesis of stationary phases, or when using a column having a different diameter to that used when the database was acquired originally. Approaches are described in which the retention database can be recalibrated to accommodate changes in the stationary phase (ion-exchange selectivity coefficient and ion-exchange capacity) or in the column diameter which lead to changes in phase ratio. The entire database can be recalibrated for all analytes on a particular column by performing three isocratic separations with two analyte ions. The retention data so obtained are then used to derive a "porting" equation which is employed to generate the required simulated separation. Accurate prediction of retention times is demonstrated for both anions and cations on 2mm and 0.4mm diameter columns under elution conditions which consist of up to five sequential isocratic or linear gradient elution steps. The proposed approach gives average errors in retention time prediction of less than 3% and the correlation coefficient was 0.9849 between predicted and observed retention times for 344 data points comprising 33 anionic or cationic analytes, 5 column internal diameters and 8 complex elution profiles.  相似文献   

3.
With the accelerated accumulation of genomic sequence data, there is a pressing need to develop computational methods and advanced bioinformatics infrastructure for reliable and large-scale protein annotation and biological knowledge discovery. The Protein Information Resource (PIR) provides an integrated public resource of protein informatics to support genomic and proteomic research. PIR produces the Protein Sequence Database of functionally annotated protein sequences. The annotation problems are addressed by a classification-driven and rule-based method with evidence attribution, coupled with an integrated knowledge base system being developed. The approach allows sensitive identification, consistent and rich annotation, and systematic detection of annotation errors, as well as distinction of experimentally verified and computationally predicted features. The knowledge base consists of two new databases, sequence analysis tools, and graphical interfaces. PIR-NREF, a non-redundant reference database, provides a timely and comprehensive collection of all protein sequences, totaling more than 1,000,000 entries. iProClass, an integrated database of protein family, function, and structure information, provides extensive value-added features for about 830,000 proteins with rich links to over 50 molecular databases. This paper describes our approach to protein functional annotation with case studies and examines common identification errors. It also illustrates that data integration in PIR supports exploration of protein relationships and may reveal protein functional associations beyond sequence homology.  相似文献   

4.
Guided data capture software (GDC) is described for mass-scale abstraction from the literature of experimental thermophysical and thermochemical property data for organic chemical systems involving one, two, and three components, chemical reactions, and chemical equilibria. Property values are captured with a strictly hierarchical system based upon rigorous application of the thermodynamic constraints of the Gibbs phase rule with full traceability to source documents. Key features of the program and its adherence to scientific principles are described with particular emphasis on data-quality issues, both in terms of data accuracy and database integrity.  相似文献   

5.
The performance of the algorithm COMPLX for detecting protein-ligand or other macromolecular complexes has been tested for highly complex data sets. These data contain m/z values for ions of proteins of the SWISS-PROT database within simulated biological mixtures where each component shares a similar molecular weight and/or isoelectric point (pI). As many as 1600 ion signals were entered to challenge the algorithm to identify ion signals associated with a single protein complex that has been ionised and detected within a mass spectrometer. Despite the complexity of such data sets, the algorithm is shown to be able to identify the presence of individual bimolecular complexes. The output data can be re-evaluated by the user as necessary in light of any additional information that is known concerning the nature of predicted associations, as well as the quality of the data-set in terms of errors in m/z values as a direct consequence of the mass calibration or resolution achieved. The data presented illustrates that the best results are obtained when output results are ranked according to the largest continuous series of ion pairs detected for a protein or macromolecule and its complex for which the ligand mass is assigned the lowest mass error.  相似文献   

6.
This paper develops two new hybrid meta exchange-correlation functionals for thermochemistry, thermochemical kinetics, and nonbonded interactions. The new functionals are called PW6B95 (6-parameter functional based on Perdew-Wang-91 exchange and Becke-95 correlation) and PWB6K (6-parameter functional for kinetics based on Perdew-Wang-91 exchange and Becke-95 correlation). The resulting methods were comparatively assessed against the MGAE109/3 main group atomization energy database, against the IP13/3 ionization potential database, against the EA13/3 electron affinity database, against the HTBH38/4 and NHTBH38/04 hydrogen-transfer and non-hydrogen-transfer barrier height databases, against the HB6/04 hydrogen bonding database, against the CT7/04 charge-transfer complex database, against the DI6/04 dipole interaction database, against the WI7/05 weak interaction database, and against the new PPS5/05 pi-pi stacking interaction database. From the assessment and comparison of methods, we draw the following conclusions, based on an analysis of mean unsigned errors: (i) The PW6B95, MPW1B95, B98, B97-1, and TPSS1KCIS methods give the best results for a combination of thermochemistry and nonbonded interactions. (ii) PWB6K, MPWB1K, BB1K, MPW1K, and MPW1B95 give the best results for a combination of thermochemical kinetics and nonbonded interactions. (iii) PWB6K outperforms the MP2 method for nonbonded interactions. (iv) PW6B95 gives errors for main group covalent bond energies that are only 0.41 kcal (as measured by mean unsigned error per bond (MUEPB) for the MGAE109 database), as compared to 0.56 kcal/mol for the second best method and 0.92 kcal/mol for B3LYP.  相似文献   

7.
Collecting, organizing, and reviewing chemical information associated with screening hits are human time-consuming. The task depends highly on the individual, and human errors may result in missing leads or wasting resources. To overcome these hurdles, we have developed a decision support system, Hits Analysis Database (HAD). HAD is a software tool that automatically generates an ISIS database file containing compound structures, biological activities, calculated properties such as clogP, hazard fragment labels, structure classifications, etc. All data are processed by available software and packed into a single SD file. In addition to search capabilities, HAD provides an overview of structural classes and associated activity statistics. Chemical structures can be organized by maximum common substructure clustering. The ease of use and customized features make HAD a chief tool in lead selection processes.  相似文献   

8.
For many years, MP2 served as the principal method for the treatment of noncovalent interactions. Until recently, this was the only technique that could be used to produce reasonably accurate binding energies, with binding energy errors generally below ~35%, at a reasonable computational cost. The past decade has seen the development of many new methods with improved performance for noncovalent interactions, several of which are based on MP2. Here, we assess the performance of MP2, LMP2, MP2-F12, and LMP2-F12, as well as spin component scaled variants (SCS) of these methods, in terms of their abilities to produce accurate interaction energies for binding motifs commonly found in organic and biomolecular systems. Reference data from the newly developed S66 database of interaction energies are used for this assessment, and a further set of 38 complexes is used as a test set for SCS methods developed herein. The strongly basis set-dependent nature of MP2 is confirmed in this study, with the SCS technique greatly reducing this behavior. It is found in this work that the spin component scaling technique can effectively be used to dramatically improve the performance of MP2 and MP2 variants, with overall errors being reduced by factors of about 1.5-2. SCS versions of all MP2 variants tested here are shown to give similarly accurate overall results.  相似文献   

9.
Weng  Mouyi  Wang  Zhi  Qian  Guoyu  Ye  Yaokun  Chen  Zhefeng  Chen  Xin  Zheng  Shisheng  Pan  Feng 《中国科学:化学(英文版)》2019,62(8):982-986
Material identification technique is crucial to the development of structure chemistry and materials genome project. Current methods are promising candidates to identify structures effectively, but have limited ability to deal with all structures accurately and automatically in the big materials database because different material resources and various measurement errors lead to variation of bond length and bond angle. To address this issue, we propose a new paradigm based on graph theory(GTscheme) to improve the efficiency and accuracy of material identification, which focuses on processing the "topological relationship" rather than the value of bond length and bond angle among different structures. By using this method, automatic deduplication for big materials database is achieved for the first time, which identifies 626,772 unique structures from 865,458 original structures.Moreover, the graph theory scheme has been modified to solve some advanced problems such as identifying highly distorted structures, distinguishing structures with strong similarity and classifying complex crystal structures in materials big data.  相似文献   

10.
The Portuguese INAA laboratory processes approximately one thousand of multi-matrix samples per year, generating fifteen thousands of results in the same period, using the k 0 methodology. In order to ensure that the data produced meets the require quality any sample analysed is processed together with a reference material. Therefore, every year a large amount of results of many reference materials are generated. This work analysed a large database created with the results from the reference materials irradiated in the period 2009–2013. Zeta-scores were calculated and different control charts were created as function of the time period, irradiated mass, reference material and operator. The objective of this work was to recognise human errors, to identify deficiencies in the protocols and to improve the quality of the results generated by the laboratory.  相似文献   

11.
Two different soft computing (SC) techniques (a competitive learning neural network and an integrated neural network-fuzzy logic-genetic algorithm approach) are employed in the analysis of a database subset obtained from the Cambridge Structural Database. The chemical problem chosen for study is relevant to the relationship between various metric parameters in transition metal imido (LnMdNZ, Z = carbon-based substituent) complexes and the chemical consequences of such relationships. The SC techniques confirmed and quantified the suspected relationship between the metal-nitrogen bond length and the metal-nitrogen-substituent bond angle for transition metal imidos: increased metal-nitrogen-carbon angles correlate with shortened metal-nitrogen distances. The mining effort also yielded an unexpected correlation between the NC distance and the MNC angle-shorter NC correlate with larger MNC. A fuzzy inference system is used to construct an MNred-NC-MNC hypersurface. This hypersurface suggests a complicated interdependence among NC, MNred, and the angle subtended by these two bonds. Also, major portions of the hypersurface are very flat, in regions where MNC is approaching linearity. The relationships are also seen to be influenced by whether the imido substituent is an alkyl or aryl group. Computationally, the present results are of particular interest in two respects. First, SC classification was able to isolate an "outlier" cluster. Identification of outliers is important as they may correspond to unreported experimental errors in the database or novel chemical entities, both of which warrant further investigation. Second, the SC database mining not only confirmed and quantified a suspected relationship (MNred versus MNC) within the data but also yielded a trend that was not suspected (NC versus MNC).  相似文献   

12.
For a set of a priori given radionuclides, extracted from a general nuclide data library, the authors use median estimates of the gamma-peak areas and estimates of their errors to produce a list of possible radionuclides matching gamma-ray line(s) and some measure of the reliability of this assignment.

An a priori determined list of nuclides is obtained by searching for a match with the energy information of the database. This procedure is performed in an interactive graphic mode by markers that superimpose the energy information provided by a general gamma-ray data library on the spectral data. This library of experimental data includes approximately 17,000 gamma-energy lines related to 756 known gamma emitter radionuclides listed by ICRP.  相似文献   


13.
For accurate thermochemical tests of electronic structure theory, accurate true anharmonic zero-point vibrational energies ZPVE(true) are needed. We discuss several possibilities to extract this information for molecules from density functional or wave function calculations and/or available experimental data: (1) Empirical universal scaling of density-functional-calculated harmonic ZPVE(harm)s, where we find that polyatomics require smaller scaling factors than diatomics. (2) Direct density-functional calculation by anharmonic second-order perturbation theory PT2. (3) Weighted averages of harmonic ZPVE(harm) and fundamental ZPVE(fund) (from fundamental vibrational transition frequencies), with weights (3/4, 1/4) for diatomics and (5/8,3/8) for polyatomics. (4) Experimental correction of the PT2 harmonic contribution, i.e., the estimate ZPVE(true)PT2 + (ZPVE(fund)expt - ZPVE(fund)PT2) for ZPVE(true). The (5/8,3/8) average of method 3 and the additive correction of method 4 have been proposed here. For our database of experimental ZPVE(true), consisting of 27 diatomics and 8 polyatomics, we find that methods 1 and 2, applied to the popular B3LYP and the nonempirical PBE and TPSS functionals and their one-parameter hybrids, yield polyatomic errors on the order of 0.1 kcal/mol. Larger errors are expected for molecules larger than those in our database. Method 3 yields errors on the order of 0.02 kcal/mol, but requires very accurate (e.g., experimental, coupled cluster, or best-performing density functional) input harmonic ZPVE(harm). Method 4 is the best-founded one that meets the requirements of high accuracy and practicality, requiring as experimental input only the highly accurate and widely available ZPVE(fund)expt and producing errors on the order of 0.05 kcal/mol that are relatively independent of functional and basis set. As a part of our study, we also test the ability of the density functionals to predict accurate equilibrium bond lengths and angles for a data set of 21 mostly polyatomic molecules (since all calculated ZPVEs are evaluated at the correspondingly calculated molecular geometries).  相似文献   

14.
Activity data for small molecules are invaluable in chemoinformatics. Various bioactivity databases exist containing detailed information of target proteins and quantitative binding data for small molecules extracted from journals and patents. In the current work, we have merged several public and commercial bioactivity databases into one bioactivity metabase. The molecular presentation, target information, and activity data of the vendor databases were standardized. The main motivation of the work was to create a single relational database which allows fast and simple data retrieval by in-house scientists. Second, we wanted to know the amount of overlap between databases by commercial and public vendors to see whether the former contain data complementing the latter. Third, we quantified the degree of inconsistency between data sources by comparing data points derived from the same scientific article cited by more than one vendor. We found that each data source contains unique data which is due to different scientific articles cited by the vendors. When comparing data derived from the same article we found that inconsistencies between the vendors are common. In conclusion, using databases of different vendors is still useful since the data overlap is not complete. It should be noted that this can be partially explained by the inconsistencies and errors in the source data.  相似文献   

15.
In this work, we build upon our previous work on the theoretical spectroscopy of ammonia, NH(3). Compared to our 2008 study, we include more physics in our rovibrational calculations and more experimental data in the refinement procedure, and these enable us to produce a potential energy surface (PES) of unprecedented accuracy. We call this the HSL-2 PES. The additional physics we include is a second-order correction for the breakdown of the Born-Oppenheimer approximation, and we find it to be critical for improved results. By including experimental data for higher rotational levels in the refinement procedure, we were able to greatly reduce our systematic errors for the rotational dependence of our predictions. These additions together lead to a significantly improved total angular momentum (J) dependence in our computed rovibrational energies. The root-mean-square error between our predictions using the HSL-2 PES and the reliable energy levels from the HITRAN database for J = 0-6 and J = 7∕8 for (14)NH(3) is only 0.015 cm(-1) and 0.020∕0.023 cm(-1), respectively. The root-mean-square errors for the characteristic inversion splittings are approximately 1∕3 smaller than those for energy levels. The root-mean-square error for the 6002 J = 0-8 transition energies is 0.020 cm(-1). Overall, for J = 0-8, the spectroscopic data computed with HSL-2 is roughly an order of magnitude more accurate relative to our previous best ammonia PES (denoted HSL-1). These impressive numbers are eclipsed only by the root-mean-square error between our predictions for purely rotational transition energies of (15)NH(3) and the highly accurate Cologne database (CDMS): 0.00034 cm(-1) (10 MHz), in other words, 2 orders of magnitude smaller. In addition, we identify a deficiency in the (15)NH(3) energy levels determined from a model of the experimental data.  相似文献   

16.
We present work on the creation of a ceramic materials database which contains data gleaned from literature data sets as well as new data obtained from combinatorial experiments on the London University Search Instrument. At the time of this writing, the database contains data related to two main groups of materials, mainly in the perovskite family. Permittivity measurements of electroceramic materials are the first area of interest, while ion diffusion measurements of oxygen ion conductors are the second. The nature of the database design does not restrict the type of measurements which can be stored; as the available data increase, the database may become a generic, publicly available ceramic materials resource.  相似文献   

17.
Abstract

This paper deals with the origins of errors in data interpretation when using modern GPC with dual detection (refractometer-viscometer) as a method of determination of average molecular weights of polymers. We describe here the different errors classified in two groups: typical chromatographic errors and data treatment errors and we show that they can lead to very miscalculated molecular weight values. For every case, we have tried to propose the best way to avoid or correct these errors so as to use modern GPC as a very accurate method of polymer characterization.  相似文献   

18.
We describe an Oracle database application for general use within virtual chemistry. The application functions as a central hub and repository for chemical data with interfaces to external calculators. It deals with the general problems of merging data from disparate sources and with scheduling of computational tasks for parallel or sequential execution in a mixed environment. The central database is used for the storage of input, intermediary, and final data as well as for job control. A calculation job is split into distinct tasks, or units of work, which are put in a queue. Tasks are dequeued and handled by specialized calculators. These calculators are in-house or commercial programs for which adaptor modules for connection to the database must be written. Tasks are handled in a transactional fashion, so that uncompleted or failed tasks are left in the queue. This makes the system stable to many types of disturbances. Sorting, filtering, and merging operations are handled by the database itself. Usage is very general, but some specific examples are (1) as a back end for a chemical property calculator Web page, (2) in an automated quantitative structure-activity relationship system, and (3) in virtual screens.  相似文献   

19.
Molecular dynamics simulations is an important application in theoretical chemistry, and with the large high‐performance computing resources available today the programs also generate huge amounts of output data. In particular in life sciences, with complex biomolecules such as proteins, simulation projects regularly deal with several terabytes of data. Apart from the need for more cost‐efficient storage, it is increasingly important to be able to archive data, secure the integrity against disk or file transfer errors, to provide rapid access, and facilitate exchange of data through open interfaces. There is already a whole range of different formats used, but few if any of them (including our previous ones) fulfill all these goals. To address these shortcomings, we present “Trajectory Next Generation” (TNG)—a flexible but highly optimized and efficient file format designed with interoperability in mind. TNG both provides state‐of‐the‐art multiframe compression as well as a container framework that will make it possible to extend it with new compression algorithms without modifications in programs using it. TNG will be the new file format in the next major release of the GROMACS package, but it has been implemented as a separate library and API with liberal licensing to enable wide adoption both in academic and commercial codes. © 2013 Wiley Periodicals, Inc.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号