首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 109 毫秒
1.
基于支持向量机的高维特征非线性快速筛选与肽QSAR建模   总被引:1,自引:0,他引:1  
以氨基酸的531个物理化学性质参数直接表征肽的结构, 基于支持向量回归发展了一种新的高维特征非线性快速筛选方法, 将其应用于苦味二肽和血管紧张素转化酶抑制剂2个肽体系的定量序效关系(QSAR)建模, 各筛选获得10个意义明确的保留描述子. 以保留描述子建立支持向量回归模型, 其拟合精度、留一法交叉测试精度和外部预测精度较文献报道结果均有较大幅度提升, 优势明显; 对所建模型进行了非线性回归显著性测验、单因子相对重要性显著性测验和单因子效应分析, 增强了模型的可解释性. 新方法在肽、蛋白质QSAR建模等高维数据回归预测领域有广泛应用前景.  相似文献   

2.
基于岭回归和SVM的高维特征选择与肽QSAR建模   总被引:1,自引:0,他引:1  
岭回归估计权重绝对值在一定程度上体现了对应特征作用大小, 据此发展了基于岭回归(RR)和支持向量机(SVM)的高维特征选择算法. 对苦味二肽(BTT)和细胞毒性T淋巴细胞(CTL)表位9 肽两个肽体系, 以氨基酸的531 个物理化学性质参数直接表征肽结构, 各获得1062、4779 个初始特征; 对训练集, 初始特征以岭回归排序后序贯引入, 当SVM留一法交叉测试(LOOCV)的均方误差(MSE)显著上扬时终止, 最后以多轮末尾淘汰进一步精筛, 分别获得7、18个物理化学意义明确的保留特征. 基于保留特征与支持向量回归(SVR), 对训练集建立定量构效关系(QSAR)模型, 预测独立测试集, 其拟合精度、留一法交叉测试精度、独立预测精度均优于现有文献报道结果. 新方法运行速度快, 选取的特征物理化学意义明确, 解释性强, 在肽、蛋白质定量构效关系建模等高维数据回归预测领域有较广泛应用前景.  相似文献   

3.
基于SVR和k-近邻群的组合预测在QSAR中的应用   总被引:1,自引:0,他引:1  
为提高定量构效关系(QSAR)研究的预测精度,发展了一种新的基于支持向量机回归(SVR)非线性筛选分子结构描述符、基于k-近邻群的非线性组合预测方法.首先以均方误差(MSE)最小为原则,以留一法通过多轮末尾淘汰实施分子结构描述符的非线性SVR汰选并给出最优核函数和相应保留描述符;其次基于待测样本与训练样本保留描述符向量的欧氏距离,以不同k-近邻群子模型双重留一法预测值反映样本集的异质性;然后基于MSE最小,以留一法通过多轮末尾淘汰实施近邻群子模型的非线性SVR汰选并给出最优核函数和相应保留子模型;最后基于保留子模型以双重留一法实施组合预测.以取代苯胺和苯酚类化合物对大型溞的QSAR实例验证表明:新方法在所有参比模型中预测精度最高,且能更精细地反映描述符与化合物毒性间的非线性关系,具结构风险最小、非线性、适于小样本,能有效克服过拟合、维数灾和局极小,非线性筛选描述符和子模型,非线性组合预测,自动选择最优核函数及其相应参数,泛化推广能力优异、预测精度高等诸多优点,在QSAR研究中有广泛应用前景.  相似文献   

4.
Multi-KNN-SVR组合预测在含氟化合物QSAR研究中的应用   总被引:1,自引:0,他引:1  
为深入认识含氟农药生物活性与其结构之间的关系, 建立了理想的QSAR模型, 从化合物油水分配系数等7个分子结构描述符出发, 基于支持向量回归(SVR)和MSE最小原则, 经自动寻找最优核函数和非线性筛选描述符, 构建了多个K-最近邻(KNN)预测子模型. 再经非线性筛选获得保留子模型, 以保留子模型实施组合预测(Multi-KNN-SVR). 33种含氟化合物对5种不同病害生物活性的留一法组合预测结果表明, 采用非线性筛选描述符和KNN子模型能有效地提高预测精度, 基于多个KNN子模型的非线性组合能进一步提高预测性能. Multi-KNN-SVR组合预测在QSAR以及其它相关预测研究中具有广泛应用前景.  相似文献   

5.
高活性细胞毒T细胞(CTL)表位鉴定是设计肿瘤疫苗的关键内容.采用天然氨基酸的531个物理化学性质参数表征HLA-A*0201限制性表位9肽, 从531×9个初始描述子出发, 经二元矩阵重排过滤器粗筛和多轮末尾淘汰精细筛选, 获得18个物理化学意义明确的保留描述子. 18个保留描述子主要涉及除1位、5位外各位置残基的疏水性和空间结构特征, 3位残基疏水性对活性影响最大, 且2位、4位、9位残基共占10个保留描述子,支持2位和9位残基为锚点、3位为关键位点以及4位残基为标志链的现有认知. 对18个保留描述子以支持向量回归构建定量序效模型,其拟合、留一法交叉验证决定系数R2、Qcv2分别为0.957、0.708; 独立预测决定系数及均方根误差Qext2 、RMSEext分别为0.818、0.366, 明显优于文献报道. 通过对全组合虚拟9肽的预测, 得到了多条预测活性高于已知表位肽的9肽, 可供实验验证. 较全面阐明了特定位置残基对多肽亲和性的影响规律, 为高活性多肽疫苗分子设计提供了切实指导.  相似文献   

6.
采用变性和非变性电泳、 高效凝胶排阻色谱、 内源荧光发射光谱和荧光相图以及生物活性测定等方法, 研究了盐酸胍诱导的变性卵清溶菌酶分子的重折叠过程及此过程中卵清溶菌酶分子各稳定构象态的分布和过渡. 结果表明, 当复性液中盐酸胍浓度分别约为5.0和2.4 mol/L时, 变性卵清溶菌酶分子的重折叠过程各存在1个稳定折叠中间态, 重折叠过程符合"四态模型". 在卵清溶菌酶分子四态重折叠过程基础上, 结合盐酸胍与卵清溶菌酶分子之间的缔合-解离平衡, 给出了一个定量描述变性剂诱导的蛋白质分子复性过程中蛋白质分子复性率随溶液中变性剂浓度变化的方程. 该方程包含2个特征折叠参数, 一个是蛋白质分子从一个稳定构象态过渡到另一个稳定构象态的热力学过渡平衡常数k; 另一个是在此过程中平均每个蛋白质分子所结合的变性剂分子数目m. 通过这2个特征折叠参数能够定量描述盐酸胍诱导的变性卵清溶菌酶完全去折叠态、 折叠中间态和天然态分子随复性液中盐酸胍浓度变化的分布和过渡情况.  相似文献   

7.
张竹青* 《物理化学学报》2012,28(10):2381-2389
蛋白质全新设计和折叠研究是从两个不同的方向来理解蛋白质序列-结构-功能关系这一结构生物学重要问题. 蛋白质全新设计取得的成功实例一定程度上检验了人们对蛋白质结构和相互作用理解的准确性, 但它们中多数所表现的不同于天然蛋白质的折叠动力学特征也表明, 要达到最终的功能化实现目标还面临着不少的挑战. 本文综述了蛋白质全新设计的发展过程及现状, 蛋白质折叠研究在实验、理论及模拟方面的研究进展, 以及全新设计蛋白质的折叠机制的研究现状. 阐述了深入了解全新设计蛋白质与天然蛋白质折叠机制的不同, 可以为进一步有效地合理化设计蛋白质提供有益的参考.  相似文献   

8.
蛋白质折叠是目前结构生物学领域的核心问题之一, 理解蛋白质结构折叠机制及其与生物功能之间的相互关系一直是生命科学家非常重要的研究内容, 并且该研究受到越来越多不同学科领域研究工作者的高度重视. 蛋白质大多数在数十毫秒、微秒或几秒内完成自我折叠过程, 但其折叠过程中所发生的分子结构精细转变却在纳秒甚至更短时间尺度内完成. 由于其折叠时间分辨率的限制, 目前无论是从常规实验还是理论计算角度对其研究都存在一定的难度. 本文首先概述了蛋白质折叠研究在实验和理论模拟方面存在的一些问题,然后以结构典型且可快速折叠的人工设计多肽Trp-cage为例,主要对其折叠过渡温度、折叠形成模型及其肽链上关键氨基酸残基在折叠过程中的作用三个方面进行了详细讨论, 综述了模型多肽Trp-cage的折叠动力学行为分别在实验和理论模拟方面的研究进展. 最后就如何有效化解蛋白质残基间相互作用网络进而降低其折叠机制的复杂性提出了一些新的建议, 不仅有助于阐明该迷你蛋白Trp-cage快速折叠、稳定形成的驱动力成因, 而且也能为蛋白质折叠机制研究和多肽设计提供有益参考.  相似文献   

9.
采用完全计数法,研究了二维紧密蛋白质链在不同HP序列时的构象性质,特别是具有唯一基态能量的折叠序列的性质.对于具有N个单体的紧密蛋白质链,发现有一定比例的序列为折叠序列.在这些折叠序列中,疏水基团(H)的数目比亲水基团(P)多20%,并同200种真实蛋白质分子的疏水基团和亲水基团的结果进行了比较.对于不同的折叠序列,根据序列中其疏水基团的数目,把具有相同疏水基团数目的序列归在同一类,发现这样的序列在总的序列中的相对含量满足高斯分布.同时还对序列中H(或者P)团族大小及其分别进行了研究,发现折叠序列与无规随机序列不同.还研究了不同折叠序列在不同链长时的比热情况,发现其相转变温度TC主要与链长有关,与折叠序列无关.  相似文献   

10.
蛋白质折叠类型的分类建模与识别   总被引:2,自引:0,他引:2  
刘岳  李晓琴  徐海松  乔辉 《物理化学学报》2009,25(12):2558-2564
蛋白质的氨基酸序列如何决定空间结构是当今生命科学研究中的核心问题之一. 折叠类型反映了蛋白质核心结构的拓扑模式, 折叠识别是蛋白质序列-结构研究的重要内容. 我们以占Astral 1.65序列数据库中α, β和α/β三类蛋白质总量41.8%的36个无法独立建模的折叠类型为研究对象, 选取其中序列一致性小于25%的样本作为训练集, 以均方根偏差(RMSD)为指标分别进行系统聚类, 生成若干折叠子类, 并对各子类建立基于多结构比对算法(MUSTANG)结构比对的概形隐马尔科夫模型(profile-HMM). 将Astral 1.65中序列一致性小于95%的9505个样本作为检验集, 36个折叠类型的平均识别敏感性为90%, 特异性为99%, 马修斯相关系数(MCC)为0.95. 结果表明: 对于成员较多, 无法建立统一模型的折叠类型, 基于RMSD的系统分类建模均可实现较高准确率的识别, 为蛋白质折叠识别拓展了新的方法和思路, 为进一步研究奠定了基础.  相似文献   

11.
基于地统计学与支持向量回归的QSAR建模   总被引:4,自引:0,他引:4  
基于主成分分析(PCA)、地统计学(GS)和支持向量回归(SVR), 提出了一种新的定量构效关系(QSAR)个体化预测方法——Weight-PCA-GS-SVR. 其基本思路是: 先以PCA降维并消除自变量间的信息冗余, 继以SVR经非线性主成分筛选去除与因变量无关的主成分, 再以保留主成分计算样本间的加权距离, 然后以高维GS确定公用变程; 每一个待测样本都以自身为中心从训练集中找出加权距离小于公用变程的私有k个近邻, 以SVR训练建模完成个体化预测. Weight-PCA-GS-SVR从行、列两个方向对模型进行了优化, 为自变量提供了一种新的加权方法, 为解决最优k近邻选择难题提供了新的思路, 并具有SVR原来的优点. 经3个化合物活性实例数据集验证, 新方法在所有参比模型中预测精度最高, 且明显优于文献报道结果, Weight-PCA-GS-SVR在QSAR等回归预测领域有较广泛的应用前景.  相似文献   

12.
The ability to predict protein folding rates constitutes an important step in understanding the overall folding mechanisms. Although many of the prediction methods are structure based, successful predictions can also be obtained from the sequence. We developed a novel method called prediction of protein folding rates (PPFR), for the prediction of protein folding rates from protein sequences. PPFR implements a linear regression model for each of the mainstream folding dynamics including two-, multi-, and mixed-state proteins. The proposed method provides predictions characterized by strong correlations with the experimental folding rates, which equal 0.87 for the two- and multistate proteins and 0.82 for the mixed-state proteins, when evaluated with out-of-sample jackknife test. Based on in-sample and out-of-sample tests, the PPFR's predictions are shown to be better than most of other sequence only and structure-based predictors and complementary to the predictions of the most recent sequence-based QRSM method. We show that simultaneous incorporation of several characteristics, including the sequence, physiochemical properties of residues, and predicted secondary structure provides improved quality. This hybridized prediction model was analyzed to reveal the complementary factors that can be used in tandem to predict folding rates. We show that bigger proteins require more time for folding, higher helical and coil content and the presence of Phe, Asn, and Gln may accelerate the folding process, the inclusion of Ile, Val, Thr, and Ser may slow down the folding process, and for the two-state proteins increased beta-strand content may decelerate the folding process. Finally, PPFR provides strong correlation when predicting sequences with low similarity.  相似文献   

13.
Understanding the relationship between amino acid sequences and folding rate of proteins is a challenging task similar to protein folding problem. In this work, we have analyzed the relative importance of protein sequence and structure for predicting the protein folding rates in terms of amino acid properties and contact distances, respectively. We found that the parameters derived with protein sequence (physical-chemical, energetic, and conformational properties of amino acid residues) show very weak correlation (|r| < 0.39) with folding rates of 28 two-state proteins, indicating that the sequence information alone is not sufficient to understand the folding rates of two-state proteins. However, the maximum positive correlation obtained for the properties, number of medium-range contacts, and alpha-helical tendency reveals the importance of local interactions to initiate protein folding. On the other hand, a remarkable correlation (r varies from -0.74 to -0.88) has been obtained between structural parameters (contact order, long-range order, and total contact distance) and protein folding rates. Further, we found that the secondary structure content and solvent accessibility play a marginal role in determining the folding rates of two-state proteins. Multiple regression analysis carried out with the combination of three properties, beta-strand tendency, enthalpy change, and total contact distance improved the correlation to 0.92 with protein folding rates. The relative importance of existing methods along with multiple-regression model proposed in this work will be discussed. Our results demonstrate that the native-state topology is the major determinant for the folding rates of two-state proteins.  相似文献   

14.
Machine learning algorithms have wide range of applications in bioinformatics and computational biology such as prediction of protein secondary structures, solvent accessibility, binding site residues in protein complexes, protein folding rates, stability of mutant proteins, and discrimination of proteins based on their structure and function. In this work, we focus on two aspects of predictions: (i) protein folding rates and (ii) stability of proteins upon mutations. We briefly introduce the concepts of protein folding rates and stability along with available databases, features for prediction methods and measures for prediction performance. Subsequently, the development of structure based parameters and their relationship with protein folding rates will be outlined. The structure based parameters are helpful to understand the physical basis for protein folding and stability. Further, basic principles of major machine learning techniques will be mentioned and their applications for predicting protein folding rates and stability of mutant proteins will be illustrated. The machine learning techniques could achieve the highest accuracy of predicting protein folding rates and stability. In essence, statistical methods and machine learning algorithms are complimenting each other for understanding and predicting protein folding rates and the stability of protein mutants. The available online resources on protein folding rates and stability will be listed.  相似文献   

15.
Prediction of protein folding rates from amino acid sequences is one of the most important challenges in molecular biology. In this work, I have related the protein folding rates with physical-chemical, energetic and conformational properties of amino acid residues. I found that the classification of proteins into different structural classes shows an excellent correlation between amino acid properties and folding rates of two- and three-state proteins, indicating the importance of native state topology in determining the protein folding rates. I have formulated a simple linear regression model for predicting the protein folding rates from amino acid sequences along with structural class information and obtained an excellent agreement between predicted and experimentally observed folding rates of proteins; the correlation coefficients are 0.99, 0.96 and 0.95, respectively, for all-alpha, all-beta and mixed class proteins. This is the first available method, which is capable of predicting the protein folding rates just from the amino acid sequence with the aid of generic amino acid properties and structural class information.  相似文献   

16.
主成分分析-支持向量回归建模方法及应用研究   总被引:14,自引:5,他引:14  
将主成分分析(PCA)用于近红外光谱的特征提取,并与支持向量回归(SVR)相结合,实现了主成分分析-支持向量回归(PCA-SVR)用于近红外光谱定量分析的建模方法。与单纯的SVR方法相比,不仅提高了运算速度,而且提高了模型的预测准确度。将PCA-SVR方法用于烟草样品中总糖和总挥发碱含量的测定,所得结果的预测均方根误差分别为1.323和0.0477;回收率分别为91.8%~112.6%和88.9%~120.2%。  相似文献   

17.
Prediction of protein folding rate change upon amino acid substitution is an important and challenging problem in protein folding kinetics and design. In this work, we have analyzed the relationship between amino acid properties and folding rate change upon mutation. Our analysis showed that the correlation is not significant with any of the studied properties in a dataset of 476 mutants. Further, we have classified the mutants based on their locations in different secondary structures and solvent accessibility. For each category, we have selected a specific combination of amino acid properties using genetic algorithm and developed a prediction scheme based on quadratic regression models for predicting the folding rate change upon mutation. Our results showed a 10-fold cross validation correlation of 0.72 between experimental and predicted change in protein folding rates. The correlation is 0.73, 0.65 and 0.79, respectively in strand, helix and coil segments. The method has been further tested with an extended dataset of 621 mutants and a blind dataset of 62 mutants, and we observed a good agreement with experiments. We have developed a web server for predicting the folding rate change upon mutation and it is available at .  相似文献   

18.
One of the most important challenges in computational and molecular biology is to understand the relationship between amino acid sequences and the folding rates of proteins. Recent works suggest that topological parameters, amino acid properties, chain length and the composition index relate well with protein folding rates, however, sequence order information has seldom been considered as a property for predicting protein folding rates. In this study, amino acid sequence order was used to derive an effective method, based on an extended version of the pseudo-amino acid composition, for predicting protein folding rates without any explicit structural information. Using the jackknife cross validation test, the method was demonstrated on the largest dataset (99 proteins) reported. The method was found to provide a good correlation between the predicted and experimental folding rates. The correlation coefficient is 0.81 (with a highly significant level) and the standard error is 2.46. The reported algorithm was found to perform better than several representative sequence-based approaches using the same dataset. The results indicate that sequence order information is an important determinant of protein folding rates.  相似文献   

19.
It was predicted that the folding space for various protein sequences is restricted and a maximum of 1000 protein folds could be expected. Although, there were about 648 folds identified, general functional features of individual folds is not thoroughly studied. We selected OB-fold, which is supposed to be an oligonucleotide and oligosaccharide binding fold to study the general functional features. OB-fold is a small beta-barrel fold formed from 5 strands connected by modulating loops. We observed consistently 2 or 3 loops on the same face of barrel acting as clamps to bind to their ligands. Depending on the ligand, which could be a single or double stranded DNA/RNA or an oligosaccharide, and their conformational properties the loops change in length and sequence to accommodate various ligands. Different classes of OB-folded proteins were analyzed and found that the functional features are retained in spite of negligible sequence homology among various proteins studied.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号