首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With the accelerated accumulation of genomic sequence data, there is a pressing need to develop computational methods and advanced bioinformatics infrastructure for reliable and large-scale protein annotation and biological knowledge discovery. The Protein Information Resource (PIR) provides an integrated public resource of protein informatics to support genomic and proteomic research. PIR produces the Protein Sequence Database of functionally annotated protein sequences. The annotation problems are addressed by a classification-driven and rule-based method with evidence attribution, coupled with an integrated knowledge base system being developed. The approach allows sensitive identification, consistent and rich annotation, and systematic detection of annotation errors, as well as distinction of experimentally verified and computationally predicted features. The knowledge base consists of two new databases, sequence analysis tools, and graphical interfaces. PIR-NREF, a non-redundant reference database, provides a timely and comprehensive collection of all protein sequences, totaling more than 1,000,000 entries. iProClass, an integrated database of protein family, function, and structure information, provides extensive value-added features for about 830,000 proteins with rich links to over 50 molecular databases. This paper describes our approach to protein functional annotation with case studies and examines common identification errors. It also illustrates that data integration in PIR supports exploration of protein relationships and may reveal protein functional associations beyond sequence homology.  相似文献   

2.
The exponential growth of large-scale molecular sequence data and of the PubMed scientific literature has prompted active research in biological literature mining and information extraction to facilitate genome/proteome annotation and improve the quality of biological databases. Motivated by the promise of text mining methodologies, but at the same time, the lack of adequate curated data for training and benchmarking, the Protein Information Resource (PIR) has developed a resource for protein literature mining—iProLINK (integrated Protein Literature INformation and Knowledge). As PIR focuses its effort on the curation of the UniProt protein sequence database, the goal of iProLINK is to provide curated data sources that can be utilized for text mining research in the areas of bibliography mapping, annotation extraction, protein named entity recognition, and protein ontology development. The data sources for bibliography mapping and annotation extraction include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes several hundred abstracts and full-text articles tagged with experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database. The data sources for entity recognition and ontology development include a protein name dictionary, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, as well as a protein ontology based on PIRSF protein family names. iProLINK is freely accessible at http://pir.georgetown.edu/iprolink, with hypertext links for all downloadable files.  相似文献   

3.
Tandem mass spectrometry is commonly used to identify peptides (and thereby proteins) that are present in complex mixtures. Peptide identification from tandem mass spectra is partially automated, but still requires human curation to resolve "borderline" peptide-spectrum matches (PSMs). SILVER is web-based software that assists manual curation of tandem mass spectra, using a recently developed intensity-based machine-learning approach to scoring PSMs, Elias et al. In this method, a large training set of peptide, fragment, and peak-intensity properties for both matched and mismatched PSMs was used to develop a score measuring consistency between each predicted fragment ion of a candidate peptide and its corresponding observed spectral peak intensity. The SILVER interface provides a visual representation of match quality between each candidate fragment ion and the observed spectrum, thereby expediting manual curation of tandem mass spectra. SILVER is available online at http://llama.med.harvard.edu/Software.html.  相似文献   

4.
李勋  王任小 《中国化学》2009,27(1):23-28
我们发展了一种名为KIAb(Keyword-based Identification of Antibodies)的方法用于自动识别Protein Data Bank(PDB)中的抗体结构。该方法通过读取PDB格式的文件,查找与抗体相关的特定关键词并做出判断。我们使用该方法从PDB中识别出780个结构文件,经人工检查其中767个为抗体,成功率高达98.3%。结果基本包括了抗体结构数据库Summary of Antibody Crystal Structures(SACS)中收录的所有条目,而且还包括该数据库没有收录的34个抗体结构。因此该方法对PDB数据库中抗体的识别更为完备而且具有很低的假阳性率。  相似文献   

5.
In the last two decades, the volumes of chemical and biological data are constantly increasing. The problem of converting data sets into knowledge is both expensive and time-consuming, as a result a workflow technology with platforms such as KNIME, was built up to facilitate searching through multiple heterogeneous data sources and filtering for specific criteria then extracting hidden information from these large data. Before any QSAR modeling, a manual data curation is extremely recommended. However, this can be done, for small datasets, but for the extensive data accumulated recently in public databases a manual process of big data will be hardly feasible. In this work, we suggest using KNIME as an automated solution for workflow in data curation, development, and validation of predictive QSAR models from a huge dataset.In this study, we used 250250 structures from NCI database, only 3520 compounds could successfully pass through our workflow safely with their corresponding experimental log P, this property was investigated as a case study, to improve some existing log P calculation algorithms.  相似文献   

6.
Publicly available compound and bioactivity databases provide an essential basis for data-driven applications in life-science research and drug design. By analyzing several bioactivity repositories, we discovered differences in compound and target coverage advocating the combined use of data from multiple sources. Using data from ChEMBL, PubChem, IUPHAR/BPS, BindingDB, and Probes & Drugs, we assembled a consensus dataset focusing on small molecules with bioactivity on human macromolecular targets. This allowed an improved coverage of compound space and targets, and an automated comparison and curation of structural and bioactivity data to reveal potentially erroneous entries and increase confidence. The consensus dataset comprised of more than 1.1 million compounds with over 10.9 million bioactivity data points with annotations on assay type and bioactivity confidence, providing a useful ensemble for computational applications in drug design and chemogenomics.  相似文献   

7.
Validated MALDI-TOF/TOF mass spectra for protein standards   总被引:5,自引:0,他引:5  
A current focus of proteomics research is the establishment of acceptable confidence measures in the assignment of protein identifications in an unknown sample. Development of new algorithmic approaches would greatly benefit from a standard reference set of spectra for known proteins for the purpose of testing and training. Here we describe an openly available library of mass spectra generated on an ABI 4700 MALDI TOF/TOF from 246 known, individually purified and trypsin-digested protein samples. The initial full release of the Aurum Dataset includes gel images, peak lists, spectra, search result files, decoy database analysis files, FASTA file of protein sequences, manual curation, and summary pages describing protein coverage and peptides matched by MS/MS followed by decoy database analysis using Mascot, Sequest, and X!Tandem. The data are publicly available for use at ProteomeCommons.org.  相似文献   

8.
Mass spectrometry imaging (MSI) is widely used for the label-free molecular mapping of biological samples. The identification of co-localized molecules in MSI data is crucial to the understanding of biochemical pathways. One of key challenges in molecular colocalization is that complex MSI data are too large for manual annotation but too small for training deep neural networks. Herein, we introduce a self-supervised clustering approach based on contrastive learning, which shows an excellent performance in clustering of MSI data. We train a deep convolutional neural network (CNN) using MSI data from a single experiment without manual annotations to effectively learn high-level spatial features from ion images and classify them based on molecular colocalizations. We demonstrate that contrastive learning generates ion image representations that form well-resolved clusters. Subsequent self-labeling is used to fine-tune both the CNN encoder and linear classifier based on confidently classified ion images. This new approach enables autonomous and high-throughput identification of co-localized species in MSI data, which will dramatically expand the application of spatial lipidomics, metabolomics, and proteomics in biological research.

Contrastive learning is used to train a deep convolutional neural network to identify high-level features in mass spectrometry imaging data. These features enable self-supervised clustering of ion images without manual annotation.  相似文献   

9.
Mitochondrial proteins exert important functions in biological pathways, particularly they are involved in apoptotic processes. We applied proteomics technologies to analyze the mitochondrial proteins of the neuroblastoma cell line IMR-32, which is often used in apoptosis studies. The proteins were analyzed by two-dimensional (2-D) electrophoresis followed by matrix-assisted laser desorption/ionization-mass spectrometry (MALDI-MS). 185 different gene products were identified, of which approximately 55% were enzymes with a broad spectrum of catalytic activities. Sixteen proteins were detected only in this preparation, the others have been detected in two or more protein samples analyzed by MS in our laboratory. The 16 unique gene products were represented by one spot each, whereas most of the frequently detected proteins were represented by multiple spots. In average, approximately 5-10 spots corresponded to one gene product. For two thirds of the proteins identified, an annotation exists in the SWISS-PROT database about their subcellular location. They are mainly described as mitochondrial, 8 as endoplasmic reticulum, 3 as peroxisomal and only 12 low-abundance proteins are described as cytosolic proteins. The list includes about 30 unknown, hypothetical or poorly described gene products. Some of them are represented by strong spots and the present study shows that they are indeed expressed and are localized in the mitochondria.  相似文献   

10.
We have developed a new algorithm to identify proteins by means of peptide mass fingerprinting. Starting from the matrix-assisted laser desorption/ionization-time-of-flight (MALDI-TOF) spectra and environmental data such as species, isoelectric point and molecular weight, as well as chemical modifications or number of missed cleavages of a protein, the program performs a fully automated identification of the protein. The first step is a peak detection algorithm, which allows precise and fast determination of peptide masses, even if the peaks are of low intensity or they overlap. In the second step the masses and environmental data are used by the identification algorithm to search in protein sequence databases (SWISS-PROT and/or TrEMBL) for protein entries that match the input data. Consequently, a list of candidate proteins is selected from the database, and a score calculation provides a ranking according to the quality of the match. To define the most discriminating scoring calculation we analyzed the respective role of each parameter in two directions. The first one is based on filtering and exploratory effects, while the second direction focuses on the levels where the parameters intervene in the identification process. Thus, according to our analysis, all input parameters contribute to the score, however with different weights. Since it is difficult to estimate the weights in advance, they have been computed with a generic algorithm, using a training set of 91 protein spectra with their environmental data. We tested the resulting scoring calculation on a test set of ten proteins and compared the identification results with those of other peptide mass fingerprinting programs.  相似文献   

11.
The issue of clustering proteins into homologous protein families (HPFs) has attracted considerable attention by researchers. On one side, many databases of protein families have been developed by using popular sequence alignment tools and relatively simple clustering methods followed by extensive manual curation. On the other side, more elaborate clustering approaches have been used, yet with a very limited degree of success. This paper advocates an approach to clustering protein families involving knowledge of the protein functions to adjust the parameter of similarity scale shift. One more source of external information is utilised as we proceed to reconstruct HPF evolutionary histories over an evolutionary tree; the consistency between these histories and information on gene arrangement in the genomes is used to narrow down the choice of the clustering.  相似文献   

12.
We present a data processing pipeline for Pyrolysis-Gas Chromatography/Mass Spectrometry (Py-GC/MS) data that is suitable for high-throughput analysis of lignocellulosic samples. The aproach applies multivariate curve resolution by alternate regression (MCR-AR) and automated peak assignment. MCR-AR employs parallel processing of multiple chromatograms, as opposed to sequential processing used in prevailing applications. Parallel processing provides a global peak list that is consistent for all chromatograms, and therefore does not require tedious manual curation. We evaluated this approach on wood samples from aspen and Norway spruce, and found that parallel processing results in an overall higher precision of peak area from integrated peaks. To further increase the speed of data processing we evaluated automated peak assignment solely based on basepeak mass. This approach gave estimates of the proportion of lignin (as syringyl-, guaiacyl and p-hydroxyphenyl-type lignin) and carbohydrate polymers in the wood samples that were in high agreement with those where peak assignments were based on full spectra. This method establishes Py-GC/MS as a sensitive, robust and versatile high-throughput screening platform well suited to a non-specialist operator.  相似文献   

13.
The peptide mass fingerprinting technique is commonly used for identifying proteins analyzed by mass spectrometry (MS) after enzymatic digestion. Our goal is to build a theoretical model that predicts the mass spectra of such digestion products in order to improve the identification and characterization of proteins using this technique. We present here the first step towards a full MS model. We have modeled MS spectra using the atomic composition of peptides and evaluated the influence that this composition may have on the MS signals. Peptides deduced from the SWISS-PROT protein sequence database were used for the calculation. To validate the model, the variability of the peptide mass distribution in SWISS-PROT was compared to two theoretical, randomly generated databases. Functions have been built that describe the behavior of the isotopic distribution according to the mass of peptides. The variability of these functions was analyzed. In particular, the influence of sulfur was studied. This work, while representing only a first step in the construction of an MS model, yields immediate practical results, as the new isotopic distribution model significantly improves peak detection in MS spectra used by protein identification algorithms.  相似文献   

14.
The scientific literature is important source of experimental and chemical structure data. Very often this data has been harvested into smaller or bigger data collections leaving the data quality and curation issues on shoulders of users. The current research presents a systematic and reproducible workflow for collecting series of data points from scientific literature and assembling a database that is suitable for the purposes of high quality modelling and decision support. The quality assurance aspect of the workflow is concerned with the curation of both chemical structures and associated toxicity values at (1) single data point level and (2) collection of data points level. The assembly of a database employs a novel “timeline” approach. The workflow is implemented as a software solution and its applicability is demonstrated on the example of the Tetrahymena pyriformis acute aquatic toxicity endpoint. A literature collection of 86 primary publications for T. pyriformis was found to contain 2,072 chemical compounds and 2,498 unique toxicity values, which divide into 2,440 numerical and 58 textual values. Every chemical compound was assigned to a preferred toxicity value. Examples for most common chemical and toxicological data curation scenarios are discussed.  相似文献   

15.
In order to understand the molecular mechanism underlying any disease, knowledge about the interacting proteins in the disease pathway is essential. The number of revealed protein-protein interactions (PPI) is still very limited compared to the available protein sequences of different organisms. Experiment based high-throughput technologies though provide some data about these interactions, those are often fairly noisy. Computational techniques for predicting protein–protein interactions therefore assume significance. 1296 binary fingerprints that encode a combination of structural and geometric properties were developed using the crystallographic data of 15,000 protein complexes in the pdb server. In a case study, these fingerprints were created for proteins implicated in the Type 2 diabetes mellitus disease. The fingerprints were input into a SVM based model for discriminating disease proteins from non disease proteins yielding a classification accuracy of 78.2% (AUC value of 0.78) on an external data set composed of proteins retrieved via text mining of diabetes related literature. A PPI network was constructed and analysed to explore new disease targets. The integrated approach exemplified here has a potential for identifying disease related proteins, functional annotation and other proteomics studies.  相似文献   

16.
We recently studied the protein composition of a Saccharomyces cerevisiae wine yeast strain (K310) of enological interest. About 2,500 spots of 8-250 kDa observed molecular mass were resolved by two-dimensional gel electrophoresis. Experimental molecular masses and isoelectric points were calculated for most of them. Twenty-seven proteins were subjected to Edman microsequencing. N-terminal sequences of 12/27 proteins were determined, whereas internal sequences of 6/27 proteins were obtained following in situ proteolysis. Comparison between the experimental data and those reported in the SWISS-PROT database revealed some differences between genotypic and phenotypic sequences. These are indicative of the changes a protein can undergo with respect to the primary structure coded by the genomic DNA. Our results highlight the need to complement genomic analysis with detailed proteomics in order to refine the vast amount of information provided by DNA sequencing and to find an exact correlation between genome and proteome.  相似文献   

17.
Complete and accurate profiling of cellular organelle proteomes, while challenging, is important for the understanding of detailed cellular processes at the organelle level. Mass spectrometry technologies coupled with bioinformatics analysis provide an effective approach for protein identification and functional interpretation of organelle proteomes. In this study, we have compiled human organelle reference datasets from large-scale proteomic studies and protein databases for 7 lysosome-related organelles (LROs), as well as the endoplasmic reticulum and mitochondria, for comparative organelle proteome analysis. Heterogeneous sources of human organelle proteins and rodent homologs are mapped to human UniProtKB protein entries based on ID and/or peptide mappings, followed by functional annotation and categorization using the iProXpress proteomic expression analysis system. Cataloging organelle proteomes allows close examination of both shared and unique proteins among various LROs and reveals their functional relevance. The proteomic comparisons show that LROs are a closely related family of organelles. The shared proteins indicate the dynamic and hybrid nature of LROs, while the unique transmembrane proteins may represent additional candidate marker proteins for LROs. This comparative analysis, therefore, provides a basis for hypothesis formulation and experimental validation of organelle proteins and their functional roles.  相似文献   

18.
Compound annotation using MS/MS data is the major bottleneck in interpretation of mass spectrometry data during non-targeted screening and suspect screening exposomics studies. Apart from compound identification using available databases or mass spectral libraries, the true challenge comes when completely new compounds have to be identified. Along with recent advances in MS instrumentation that set grounds to a new revolutionary age in environmental exposomics, a multitude of cheminformatics annotation approaches has been developed. Herein, we review the basic principles of the cutting-edge cheminformatics MS-based approaches employed in eco-exposome annotation.We give a solid background discussing the eco-exposome concept in relation to the advances in MS instrumentation, and define the three crucial cheminformatics tasks used in the eco-exposome annotation: molecular formula assignment, compound prioritization and compound annotation. The basic principles of compound annotation are discussed, which are based on three approaches of utilizing structural information inherent to MS data. These involve direct, indirect and joint annotation approaches. We assess their performance through the ability to annotate eco-exposome constituents. We discuss future perspectives and give directions to new annotation strategies and performance evaluation protocols aiming to solve current issues hampering the incorporation of cheminformatics annotation approaches in regular eco-exposome annotation workflows.  相似文献   

19.
Patent specifications are one of many information sources needed to progress drug discovery projects. Understanding compound prior art and novelty checking, validation of biological assays, and identification of new starting points for chemical explorations are a few areas where patent analysis is an important component. Cheminformatics methods can be used to facilitate the identification of so-called key compounds in patent specifications. Such methods, relying on structural information extracted from documents by expert curation or text mining, can complement or in some cases replace the traditional manual approach of searching for clues in the text. This paper describes and compares three different methods for the automatic prediction of key compounds in patent specifications using structural information alone. For this data set, the cluster seed analysis described by Hattori et al. (Hattori, K.; Wakabayashi, H.; Tamaki, K. Predicting key example compounds in competitors' patent applications using structural information alone. J. Chem. Inf. Model.2008, 48, 135-142) is superior in terms of prediction accuracy with 26 out of 48 drugs (54%) correctly predicted from their corresponding patents. Nevertheless, the two new methods, based on frequency of R-groups (FOG) and maximum common substructure (MCS) similarity measures, show significant advantages due to their inherent ability to visualize relevant structural features. The results of the FOG method can be enhanced by manual selection of the scaffolds used in the analysis. Finally, a successful example of applying FOG analysis for designing potent ATP-competitive AXL kinase inhibitors with improved properties is described.  相似文献   

20.
In the past few years, NMR has been extensively utilized as a screening tool for drug discovery using various types of compound libraries. The designs of NMR specific chemical libraries that utilize a fragment-based approach based on drug-like characteristics have been previously reported. In this article, a new type of compound library will be described that focuses on aiding in the functional annotation of novel proteins that have been identified from various ongoing genomics efforts. The NMR functional chemical library is comprised of small molecules with known biological activity such as: co-factors, inhibitors, metabolites and substrates. This functional library was developed through an extensive manual effort of mining several databases based on known ligand interactions with protein systems. In order to increase the efficiency of screening the NMR functional library, the compounds are screened as mixtures of 3-4 compounds that avoids the need to deconvolute positive hits by maintaining a unique NMR resonance and function for each compound in the mixture. The functional library has been used in the identification of general biological function of hypothetical proteins identified from the Protein Structure Initiative.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号