首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
蛋白质折叠类型的分类建模与识别   总被引:2,自引:0,他引:2  
刘岳  李晓琴  徐海松  乔辉 《物理化学学报》2009,25(12):2558-2564
蛋白质的氨基酸序列如何决定空间结构是当今生命科学研究中的核心问题之一. 折叠类型反映了蛋白质核心结构的拓扑模式, 折叠识别是蛋白质序列-结构研究的重要内容. 我们以占Astral 1.65序列数据库中α, β和α/β三类蛋白质总量41.8%的36个无法独立建模的折叠类型为研究对象, 选取其中序列一致性小于25%的样本作为训练集, 以均方根偏差(RMSD)为指标分别进行系统聚类, 生成若干折叠子类, 并对各子类建立基于多结构比对算法(MUSTANG)结构比对的概形隐马尔科夫模型(profile-HMM). 将Astral 1.65中序列一致性小于95%的9505个样本作为检验集, 36个折叠类型的平均识别敏感性为90%, 特异性为99%, 马修斯相关系数(MCC)为0.95. 结果表明: 对于成员较多, 无法建立统一模型的折叠类型, 基于RMSD的系统分类建模均可实现较高准确率的识别, 为蛋白质折叠识别拓展了新的方法和思路, 为进一步研究奠定了基础.  相似文献   

2.
In this paper, we propose a method to create the 60-dimensional feature vector for protein sequences via the general form of pseudo amino acid composition. The construction of the feature vector is based on the contents of amino acids, total distance of each amino acid from the first amino acid in the protein sequence and the distribution of 20 amino acids. The obtained cosine distance metric (also called the similarity matrix) is used to construct the phylogenetic tree by the neighbour joining method. In order to show the applicability of our approach, we tested it on three proteins: 1) ND5 protein sequences from nine species, 2) ND6 protein sequences from eight species, and 3) 50 coronavirus spike proteins. The results are in agreement with known history and the output from the multiple sequence alignment program ClustalW, which is widely used. We have also compared our phylogenetic results with six other recently proposed alignment-free methods. These comparisons show that our proposed method gives a more consistent biological relationship than the others. In addition, the time complexity is linear and space required is less as compared with other alignment-free methods that use graphical representation. It should be noted that the multiple sequence alignment method has exponential time complexity.  相似文献   

3.
4.
Point Accepted Mutation (PAM) is the Markov model of amino acid replacements in proteins introduced by Dayhoff and her co-workers (Dayhoff et al., 1978). The PAM matrices and other matrices based on the PAM model have been widely accepted as the standard scoring system of protein sequence similarity in protein sequence alignment tools. Here, we present Contact Accepted mutatiOn (CAO), a Markov model of protein residue contact mutations. The CAO model simulates the interchanging of structurally defined side-chain contacts, and introduces additional structural information into protein sequence alignments. Therefore, similarities between structurally conserved sequences can be detected even without apparent sequence similarity. CAO has been benchmarked on the HOMSTRAD database and a subset of the CATH database, by comparing sequence alignments with reference alignments derived from structural superposition. CAO yields scores that reflect coherently the structural quality of sequence alignments, which has implications particularly for homology modelling and threading techniques.  相似文献   

5.
The presented program ALIGN_MTX makes alignment of two textual sequences with an opportunity to use any several characters for the designation of sequence elements and arbitrary user substitution matrices. It can be used not only for the alignment of amino acid and nucleotide sequences but also for sequence-structure alignment used in threading, amino acid sequence alignment, using preliminary known PSSM matrix, and in other cases when alignment of biological or non-biological textual sequences is required. This distinguishes it from the majority of similar alignment programs that make, as a rule, alignment only of amino acid or nucleotide sequences represented as a sequence of single alphabetic characters. ALIGN_MTX is presented as downloadable zip archive at http://www.imbbp.org/software/ALIGN_MTX/ and available for free use.As application of using the program, the results of comparison of different types of substitution matrix for alignment quality in distantly related protein pair sets were presented. Threading matrix SORDIS, based on side-chain orientation in relation to hydrophobic core centers with evolutionary change-based substitution matrix BLOSUM and using multiple sequence alignment information position-specific score matrices (PSSM) were taken for test alignment accuracy. The best performance shows PSSM matrix, but in the reduced set with lower sequence similarity threading matrix SORDIS shows the same performance and it was shown that combined potential with SORDIS and PSSM can improve alignment quality in evolutionary distantly related protein pairs.  相似文献   

6.
Summary A new database of conserved amino acid residues is derived from the multiple sequence alignment of over 84 families of protein sequences that have been reported in the literature. This database contains sequences of conserved hydrophobic core patterns which are probably important for structure and function, since they are conserved for most sequences in that family. This database differs from other single-motif or signature databases reported previously, since it contains multiple patterns for each family. The new database is used to align a new sequence with the conserved regions of a family. This is analogous to reports in the literature where multiple sequence alignments are used to improve a sequence alignment. A program called Homology-Plot (suitable for IBM or compatible computers) uses this database to find homology of a new sequence to a family of protein sequences. There are several advantages to using multiple patterns. First, the program correctly identifies a new sequence as a member of a known family. Second, the search of the entire database is rapid and requires less than one minute. This is similar to performing a multiple sequence alignment of a new sequence to all of the known protein family sequences. Third, the alignment of a new sequence to family members is reliable and can reproduce the alignment of conserved regions already described in the literature. The speed and efficiency of this method is enhanced, since there is no need to score for insertions or deletions as is done in the more commonly used sequence alignment methods. In this method only the patterns are aligned. HomologyPlot also provides general information on each family, as well as a listing of patterns in a family.  相似文献   

7.
A new type of human calicivirus (HuCV) showing the classic cup-shaped surface morphology was identified in the stool sample from a child with symptoms of acute gastroenteritis in Seoul, Korea (SK virus). Genomic RNA was extracted directly from the stool sample, and the nucleotide sequence of 3.2 kb of the 3' end of SK virus was determined from cDNA. This region spanned sequences from the RNA-dependent RNA polymerase (RDRP) region in the open reading frame 1 (ORF1) to the 3' poly A tail. The non-structural and capsid protein coding sequences were fused in a single ORF as observed in Manchester type (Genogroup III). However, ORF2 of Manchester virus was missing in SK virus. In RDRP region, SK virus showed amino acid and nucleotide identities of 74-75% and 68-69% respectively, with those of Manchester virus, while showed 34-46% and 55-60% identities respectively with those of other human caliciviruses. However, capsid protein of SK virus showed a partial (29-46%) amino acid identity with those of other caliciviruses including Manchester type. The closest resemblance in amino acid (97-99%) and nucleotide sequence (85-86%) identities were found in RDRP region with Vanderbijlpark and Pretoria isolates recently found in South Africa. These results suggest that SK virus together with Vanderbijlpark and Pretoria isolates belong to a new type different from Manchester virus.  相似文献   

8.
9.
1. Crystalline ribonuclease samples obtained from different commercial sources in addition to one prepared in the laboratory were resolved into their components, RNases I, II, III and IV, by a new two-dimensional electrophoretic technique 2. RNase I and RNase II liberated more uridyhc acid and cytidylic acid from yeast nbonucleic acid, and demonstrated a greater enzymic activity on undine-2', 3'-phosphate and cytidine-2',3'-phosphate, than either RNase III or RNase IV RNase III and RNaso IV liberated more adenylic acid and guanylic acid from yeast ribonucleic acid, and showed a greater enzymic activity on adenosine-2',3'-phosphate and guanosine-2', 3 '-phosphate than cither RNase I and RNase II 3. The degree of heterogeneity of the RNase samples studied revealed the age of the preparation 4. It is thus demonstrated, that certain of the activities of “crystalline nbonucleasc” reside in four different protein entities, and some activity toward punne nucleotidc esters existed in two of the four protein entities  相似文献   

10.
We have created an analysis pipeline called Sprockets, which can be used to classify proteins into various hierarchical “families”, and build searchable models of these families. The construction of these families is based on data from Expressed Sequence Tags (ESTs) and Coding DNA Sequences (CDSs), making Sprockets clusters especially suitable for studying gene families in organisms for which the completely sequenced genome does not (yet) exist. The pipeline consists of two main parts: pair-wise analysis and grouping of sequences with Z-score statistics, followed by hierarchical splitting of clusters into alignable protein families. Various computational and statistical techniques applied in Sprockets allow it to act like a massive and selective multiple sequence alignment engine for combining individual sequence collections and related public sequences. The end result is a database of gene Hidden Markov Models, each related to the other by three levels of similarity: secondary structure, function and evolutionary origin. For a sample 20,000 EST set from Lactuca spp., Sprockets provided a 9% improvement in mapping of function to unknown sequences over traditional pair-wise search methods and InterPro mapping.  相似文献   

11.
The evolutionary relationships of organisms are traditionally delineated by the alignment‐based methods using some DNA or protein sequences. In the post‐genome era, the phylogenetics of life could be inferred from many sources such as genomic features, not just from comparison of one or several genes. To investigate the possibility that the physicochemical properties of protein sequences might reflect the phylogenetic ones, an alignment‐free method using a support vector machine (SVM) classifier is implemented to establish the phylogenetic relationships between some protein sequences. There are two types of datasets, namely, the “Enzymatic” (assigned by an EC accession) and “Proteins” used to train the SVM classifiers. By computing the F‐score for feature selection, we find that the classification accuracies of trained SVM classifiers could be significantly enhanced to 84% and 80%, respectively, for the enzymatic and “proteins” datasets classified if the protein sequences are represented with some top 255 features selected. These show that some physicochemical features of amino acid sequences selected are sufficient for inferring the phylogenetic properties of the protein sequences. Moreover, we find that the selected physicochemical features appear to correlate with the physiological characteristic of the taxonomic classes classified. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2010  相似文献   

12.
RNase (ribonuclease) mapping by nucleobase-specific endonucleases combined with mass spectrometry (MS) is a powerful analytical method for characterizing ribonucleic acids such as transfer RNAs. Typical free solution enzymatic digestion of RNA samples results in a significant amount of RNase being present in the sample solution analyzed by MS. In some cases, the RNase can lead to contamination of the high performance liquid chromatography and MS instrumentation. Here we investigate and compare several different approaches for reducing or eliminating contaminating RNase from the digested RNA sample before LC-MS analysis. Approaches using immobilized RNases were found to be most effective, with no enzyme carryover into the digested sample detected. Among the various options for immobilized RNases, we show that carbodiimide-based reactions can be used to couple RNases to carboxylic acid-terminated magnetic beads. The immobilized enzymes retain biological activity, are re-usable, and do not interfere with subsequent LC-MS analysis of the expected RNase digestion products. The use of immobilized RNases provides a simple approach for eliminating enzyme contamination in mass spectrometry-based RNase mapping experiments.  相似文献   

13.
We demonstrate that an oligopeptide containing the C-terminal sequence of RNase A binds to RNase A in a stoichiometric and site-specific manner. Our observations are consistent with the interaction found in the major domain-swapped RNase A dimer, so that the peptide binding may be promoted through the swapping with the C-terminal beta-sheet of RNase A. Because the design of a protein-binding peptide is much simpler than other methods such as the combinatorial method, we propose that investigation using an oligopeptide may be of general application to domain swapping in proteins as well as for the development of an oligopeptide tool that specifically binds to a target protein.  相似文献   

14.
A statistical analytical approach has been used to analyze the secondary structure (SS) of amino acids as a function of the sequence of amino acid residues. We have used 306 non-homologous best-resolved protein structures from the Protein Data Bank for the analysis. A sequence region of 32 amino acids on either side of the residue is considered in order to calculate single amino acid propensities, di-amino acid potentials and tri-amino acid potentials. A weighted sum of predictions obtained using these properties is used to suggest a final prediction method. Our method is as good as the best-known SS prediction methods, is the simplest of all the methods, and uses no homologous sequence/family alignment data, yet gives 72% SS prediction accuracy. Since the method did not use many other factors that may increase the prediction accuracy there is scope to achieve greater accuracy using this approach. Received: 4 May 1998 / Accepted: 17 September 1998 / Published online: 10 December 1998  相似文献   

15.
We describe a very efficient search for nucleotide alignments, which is analogous to the novel very efficient search for protein alignment. Just as it has been the case with the alignment of proteins, based on 20 × 20 adjacency matrices for amino acids, obtained from a superposition of labeled amino acids adjacency matrices for the proteins considered, one can construct labeled matrices of size 4 × 4, listing adjacencies of nucleotides in DNA sequence. The matrix elements correspond to 16 pairs of adjacent nucleotides. To obtain DNA alignments, one combines information in the corresponding matrices for a pair of DNA nucleotides. Matrices are obtained by insertion of the sequential labels for pairs of nucleotides in the corresponding cells of the 4 × 4 tables. When two such matrices are superimposed, one can identify all segments in two DNA sequences, which are shifted relative to one another by the same amount in either direction, without using trial‐and‐error displacements of the two sequences one relative to the other to find local nucleotide alignments. © 2012 Wiley Periodicals, Inc.  相似文献   

16.
A total of 49 protein sequences of alkaline proteases retrieved from GenBank representing different species of Aspergillus have been characterized for various physiochemical properties, homology search, multiple sequence alignment, motif, and super family search and phylogenetic tree construction. The sequence level homology was obtained among different groups of alkaline protease enzymes, viz alkaline serine protease, oryzin, calpain-like protease, serine protease, subtilisin-like alkaline proteases. Multiple sequence alignment of alkaline protease protein sequence of different Aspergillus species revealed a stretch of conserved region for amino acid residues from 69 to 110 and 130–204. The phylogenetic tree constructed indicated several Aspergillus species-specific clusters for alkaline proteases namely Aspergillus fumigatus, Aspergillus niger, Aspergillus oryzae, Aspergillus clavatus. The distributions of ten commonly observed motifs were analyzed among these proteases. Motif 1 with a signature amino acid sequence of 50 amino acids, i.e., ASFSNYGKVVDIFAPGQDILSCWIGSTTATNTISGTSMATPHIVGLSCYL, was uniformly observed in proteases protein sequences indicating its involvement with the structure and enzymatic function. Motif analysis of acidic proteases of Aspergillus and bacterial alkaline proteases has revealed different signature amino acid sequences. The superfamily search for these proteases revealed the presence of subtilases, serine-carboxyl proteinase, calpain large subunit, and thermolysin-like superfamilies with 45 representing the subtilases superfamily.  相似文献   

17.
All currently leading protein secondary structure prediction methods use a multiple protein sequence alignment to predict the secondary structure of the top sequence. In most of these methods, prior to prediction, alignment positions showing a gap in the top sequence are deleted, consequently leading to shrinking of the alignment and loss of position-specific information. In this paper we investigate the effect of this removal of information on secondary structure prediction accuracy. To this end, we have designed SymSSP, an algorithm that post-processes the predicted secondary structure of all sequences in a multiple sequence alignment by (i) making use of the alignment's evolutionary information and (ii) re-introducing most of the information that would otherwise be lost. The post-processed information is then given to a new dynamic programming routine that produces an optimally segmented consensus secondary structure for each of the multiple alignment sequences. We have tested our method on the state-of-the-art secondary structure prediction methods PHD, PROFsec, SSPro2 and JNET using the HOMSTRAD database of reference alignments. Our consensus-deriving dynamic programming strategy is consistently better at improving the segmentation quality of the predictions compared to the commonly used majority voting technique. In addition, we have applied several weighting schemes from the literature to our novel consensus-deriving dynamic programming routine. Finally, we have investigated the level of noise introduced by prediction errors into the consensus and show that predictions of edges of helices and strands are half the time wrong for all the four tested prediction methods.  相似文献   

18.
Heat shock proteins are an important class of molecular chaperones known to impart tolerance under high temperature stress. sHSP26, a member of small heat shock protein subfamily is specifically involved in protecting plant’s photosynthetic machinery. The present study aimed at identifying and characterizing sequence and structural variations in sHSP26 from genetically diverse progenitor and non-progenitor species of wheat. In silico analysis identified three paralogous copies of TaHSP26 to reside on short arm of chromosome 4A while one homeologue each was localized on long arm of chromosome 4B and 4D of cultivated bread wheat. Wild DD-genome donor Aegilops tauschii carried an additional sHSP26 gene (AET4Gv20569400) which was absent in the cultivated DD genome of bread wheat. In vitro amplification of this novel gene in wild accessions of Ae. tauschii and synthetic hexaploid wheat but not in cultivated bread wheat validated this finding. Further, significant length polymorphism could be identified in exon1 from diverse sHSP26 sequences. Multiple sequence alignment of procured sequences revealed numerous sSNPs and nsSNPs. D3A, P125 L, Q242 K were designated as homeolog specific- while A49 G as non-progenitor specific amino acid replacements. A 9-bp indel in TmHSP26-1(GA) translated into a deletion of SPM amino acid segment in chloroplast specific conserved consensus region III. High degree of divergence in nucleotide sequence between cultivated and wild species appeared in the form of higher ω values (Ka/Ks >1) indicating positive selection during the course of evolution. Phylogenetic analysis elucidated ancestral relationships between wheat sHSP26 proteins and orthologous proteins across plant kingdom. Overall, data mining approach may be employed as an effective pre-breeding strategy to identify and mobilize novel stress responsive genes and distinct allelic variants from wider germplasm collections of wheat to enhance climate resilience of present day elite wheat cultivars.  相似文献   

19.
We present a formulation of the Needleman-Wunsch type algorithm for sequence alignment in which the mutation matrix is allowed to vary under the control of a hidden Markov process. The fully trainable model is applied to two problems in bioinformatics: the recognition of related gene/protein names and the alignment and scoring of homologous proteins.  相似文献   

20.
Structural studies of the high molecular mass (HMM) glutenin subunits 1Bx7 (from cvs Hereward and Galatea) and 1Bx20 (from cv. Bidi17) of bread wheat were conducted using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOFMS) and reversed-phase high-performance liquid chromatography/electrospray ionization mass spectrometry (RP-HPLC/ESI-MS). For all three proteins, MALDI-TOFMS analysis showed that the isolated fractions contained a second component with a mass about 650 Da lower than the major component. The testing and correction of the gene-derived amino acid sequences of the three proteins were performed by direct MALDI-TOFMS analysis of their tryptic peptide mixture. Analysis of the digest was performed by recording several MALDI mass spectra of the mixture at low, medium and high mass ranges, optimizing the matrix and the acquisition parameters for each mass range. Complementary data were obtained by RP-HPLC/ESI-MS analysis of the tryptic digest. This resulted in coverage of about 98% of the sequences. In contrast to the gene-derived data, the results obtained demonstrate the insertion of the sequence QPGQGQ between Trp716 and Gln717 of subunit 1Bx7 (cv. Galatea) and a possible single amino acid substitution within the T20 peptide of subunit 1Bx20. Moreover, the mass spectrometric data demonstrated that the lower mass components present in all the fractions correspond to the major components but lack about six amino acid residues, which are probably lost from the protein C-terminus. Finally, the results obtained provide evidence for the lack of glycosylation or other post-translational modifications of these subunits.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号