首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 18 毫秒
1.
In mass spectrometry-based shotgun proteomics, protein quantification and protein identification are two major computational problems. To quantify the protein abundance, a list of proteins must be firstly inferred from the raw data. Then the relative or absolute protein abundance is estimated with quantification methods, such as spectral counting. Until now, most researchers have been dealing with these two processes separately. In fact, the protein inference problem can be regarded as a special protein quantification problem in the sense that truly present proteins are those proteins whose abundance values are not zero. Some recent published papers have conceptually discussed this possibility. However, there is still a lack of rigorous experimental studies to test this hypothesis.In this paper, we investigate the feasibility of using protein quantification methods to solve the protein inference problem. Protein inference methods aim to determine whether each candidate protein is present in the sample or not. Protein quantification methods estimate the abundance value of each inferred protein. Naturally, the abundance value of an absent protein should be zero. Thus, we argue that the protein inference problem can be viewed as a special protein quantification problem in which one protein is considered to be present if its abundance is not zero. Based on this idea, our paper tries to use three simple protein quantification methods to solve the protein inference problem effectively. The experimental results on six data sets show that these three methods are competitive with previous protein inference algorithms. This demonstrates that it is plausible to model the protein inference problem as a special protein quantification task, which opens the door of devising more effective protein inference algorithms from a quantification perspective. The source codes of our methods are available at: http://code.google.com/p/protein-inference/.  相似文献   

2.
The protein disulfide bond is a covalent bond that forms during post-translational modification by the oxidation of a pair of cysteines. In protein, the disulfide bond is the most frequent covalent link between amino acids after the peptide bond. It plays a significant role in three-dimensional (3D) ab initio protein structure prediction (aiPSP), stabilizing protein conformation, post-translational modification, and protein folding. In aiPSP, the location of disulfide bonds can strongly reduce the conformational space searching by imposing geometrical constraints. Existing experimental techniques for the determination of disulfide bonds are time-consuming and expensive. Thus, developing sequence-based computational methods for disulfide bond prediction becomes indispensable. This study proposed a stacking-based machine learning approach for disulfide bond prediction (diSBPred). Various useful sequence and structure-based features are extracted for effective training, including conservation profile, residue solvent accessibility, torsion angle flexibility, disorder probability, a sequential distance between cysteines, and more. The prediction of disulfide bonds is carried out in two stages: first, individual cysteines are predicted as either bonding or non-bonding; second, the cysteine-pairs are predicted as either bonding or non-bonding by including the results from cysteine bonding prediction as a feature.The examination of the relevance of the features employed in this study and the features utilized in the existing nearest neighbor algorithm (NNA) method shows that the features used in this study improve about 7.39 % in jackknife validation balanced accuracy. Moreover, for individual cysteine bonding prediction and cysteine-pair bonding prediction, diSBPred provides a 10-fold cross-validation balanced accuracy of 82.29 % and 94.20 %, respectively. Altogether, our predictor achieves an improvement of 43.25 % based on balanced accuracy compared to the existing NNA based approach. Thus, diSBPred can be utilized to annotate the cysteine bonding residues of protein sequences whose structures are unknown as well as improve the accuracy of the aiPSP method, which can further aid in experimental studies of the disulfide bond and structure determination.  相似文献   

3.
Literature contains over fifty years of accumulated methods proposed by researchers for predicting the secondary structures of proteins in silico. A large part of this collection is comprised of artificial neural network-based approaches, a field of artificial intelligence and machine learning that is gaining increasing popularity in various application areas. The primary objective of this paper is to put together the summary of works that are important but sparse in time, to help new researchers have a clear view of the domain in a single place. An informative introduction to protein secondary structure and artificial neural networks is also included for context. This review will be valuable in designing future methods to improve protein secondary structure prediction accuracy. The various neural network methods found in this problem domain employ varying architectures and feature spaces, and a handful stand out due to significant improvements in prediction. Neural networks with larger feature scope and higher architecture complexity have been found to produce better protein secondary structure prediction. The current prediction accuracy lies around the 84% marks, leaving much room for further improvement in the prediction of secondary structures in silico. It was found that the estimated limit of 88% prediction accuracy has not been reached yet, hence further research is a timely demand.  相似文献   

4.
Organic light-emitting diode (OLED) materials have exhibited a wide range of applications. However, the further development and commercialization of OLEDs requires higher quality OLED materials, including materials with a high thermal stability. Thermal stability is associated with the glass transition temperature (Tg) and decomposition temperature (Td), but experimental determinations of these two important properties generally involve a time-consuming and laborious process. Thus, the development of a quick and accurate prediction tool is highly desirable. Motivated by the challenge, we explored machine learning (ML) by constructing a new dataset with more than 1,000 samples collected from a wide range of literature, through which ensemble learning models were explored. Models trained with the LightGBM algorithm exhibited the best prediction performance, where the values of mean absolute error, root mean squared error, and R2 were 17.15 K, 24.63 K, and 0.77 for Tg prediction and 24.91 K, 33.88 K, and 0.78 for Td prediction. The prediction performance and the generalization of the ML models were further tested by two applications, which also exhibited satisfactory results. Experimental validation further demonstrated the reliability and the practical potential of the ML-based models. In order to extend the practical application of the ML-based models, an online prediction platform was constructed. This platform includes the optimal prediction models and all the thermal stability data under study, and it is freely available at http://www.oledtppxmpugroup.com. We expect that this platform will become a useful tool for experimental investigation of Tg and Td, accelerating the design of OLED materials with desired properties.  相似文献   

5.
陈乐添  张旭  陈安  姚赛  胡绪  周震 《催化学报》2022,43(1):11-32
随着能源需求增长与化石燃料资源枯竭之间的矛盾日益突出,以及石油、天然气等不可再生资源的燃烧带来的环境问题和全球变暖,清洁可再生能源越来越受到人们的重视.因此,包括能源转换和可逆能源使用等的可持续发展技术受到广泛关注.其中,电催化被认为是清洁能源转化的重要方法.目前,电催化反应的催化剂仍以贵金属为主.但贵金属昂贵的价格极...  相似文献   

6.
Ischemic stroke is a common neurological disorder, and is still the principal cause of serious long-term disability in the world. Selection of features related to stroke prognosis is highly valuable for effective intervention and treatment. In this study, an integrated machine learning approach was used to select the features as prognosis factors of stroke on The International Stroke Trial (IST) dataset. We considered the common problems of feature selection and prediction in medical datasets. Firstly, the importance of features was ranked by the Shapiro-Wilk algorithm and the Pearson correlations between features were analyzed. Then, we used Recursive Feature Elimination with Cross-Validation (RFECV), which incorporated linear SVC, Random-Forest-Classifier, Extra-Trees-Classifier, AdaBoost-Classifier, and Multinomial-Naïve-Bayes-Classifier as estimator respectively, to select robust features. Furthermore, the importance of selected features was determined by Random-Forest-Classifier and Shapiro-Wilk algorithm. Finally, twenty-three selected features were used by SVC, MLP, Random-Forest, and AdaBoost-Classifier to predict the RVISINF (Infarct visible on CT) of acute stroke on IST dataset. It was suggested that the selected features could be used to infer the long-term prognosis of acute stroke at a high accuracy, and it also could be used to extract factors related to RVISINF, which is associated with large artery occlusion (LAO) in ischemic stroke patient.  相似文献   

7.
《印度化学会志》2023,100(1):100815
The right combination of surfactants and stabilizers in the detergent formulations plays a significant role in their cleaning performance. However, it becomes a complex optimization problem when the formulation is composed of multiple ingredients and the solution has to be optimized for competing performance metrics. In recent times, machine learning techniques have been used extensively to study such processes. In this research, a detergent pre-formulation has been designed using an aqueous solution of Tween-20, Ethanol and 1-Octanol. To determine the optimal values of the ingredients of the formulations, supervised machine learning models were developed and optimized for the Ross Miles Index 30 ml (RMI 30) and cleaning time (CT). A full factorial experimental design was performed and three regression models based on linear, 2FI and Quadratic designs were developed respectively for RMI30 and CT. ANOVA analysis of trained models reported an optimal p-value of 0.0018 for RMI 30 and less than 0.0001 for CT. The optimal values for RMI30 and CT obtained through regression models are 72.32 ml and 17.67 s. For multi-objective optimization, grey relational analysis was performed. Two pairs of optimal values corresponding to Rank 1 were recorded as 88.9 ml, 20 s (RMI30, CT); and 81.2 ml, 14 s (RMI30, CT) respectively. As a result, the optimal combination of Tween-20, Ethanol and 1-Octanol for maximizing the RMI30 and minimizing the CT are reported. The obtained optimal values were experimentally validated.  相似文献   

8.
9.
In patients with depression, the use of 5-HT reuptake inhibitors can improve the condition. Machine learning methods can be used in ligand-based activity prediction processes. In order to predict SERT inhibitors, the SERT inhibitor data from the ChEMBL database was screened and pre-processed. Then 4 machine learning methods (LR, SVM, RF, and KNN) and 4 molecular fingerprints (CDK, Graph, MACCS, and PubChem) were used to build 16 prediction models. The top 5 models of accuracy (Q) in the cross-validation of training set were used to build three different ensemble learning models. In the test1 set, the VOT_CLF3 model had the largest SP (0.871), Q (0.869), AUC (0.919), and MCC (0.728). In the unbalanced test2 set, VOT_CLF3 had the largest SE (0.857), SP (0.867), Q (0.865) and MCC (0.639). VOT_CLF3 was recommended for the virtual screening process of SERT inhibitors. In addition, 12 molecular structural alerts that frequently appear in SERT inhibitors were found (P < 0.05), which provided important reference value for the design work of SERT inhibitors.  相似文献   

10.
This study was planned to in silico screening of ssDNA aptamer against Escherichia coli O157:H7 by combination of machine learning and the PseKNC approach. For this, firstly a total numbers of 47 validated ssDNA aptamers as well as 498 random DNA sequences were considered as positive and negative training data respectively. The sequences then converted to numerical vectors using PseKNC method through Pse-in-one 2.0 web server. After that, the numerical vectors were subjected to classification by the SVM, ANN and RF algorithms available in Orange 3.2.0 software. The performances of the tested models were evaluated using cross-validation, random sampling and ROC curve analyzes. The primary results demonstrated that the ANN and RF algorithms have appropriate performances for the data classification. To improve the performances of mentioned classifiers the positive training data was triplicated and re-training process was also performed. The results confirmed that data size improvement had significant effect on the accuracy of data classification especially about RF model. Subsequently, the RF algorithm with accuracy of 98% was selected for aptamer screening. The thermodynamics details of folding process as well as secondary structures of the screened aptamers were also considered as final evaluations. The results confirmed that the selected aptamers by the proposed method had appropriate structure properties and there is no thermodynamics limit for the aptamers folding.  相似文献   

11.
This study unites six popular machine learning approaches to enhance the prediction of a molecular binding affinity between receptors (large protein molecules) and ligands (small organic molecules). Here we examine a scheme where affinity of ligands is predicted against a single receptor – human thrombin, thus, the models consider ligand features only. However, the suggested approach can be repurposed for other receptors. The methods include Support Vector Machine, Random Forest, CatBoost, feed-forward neural network, graph neural network, and Bidirectional Encoder Representations from Transformers. The first five methods use input features based on physico-chemical properties of molecules, while the last one is based on textual molecular representations. All approaches do not rely on atomic spatial coordinates, avoiding a potential bias from known structures, and are capable of generalizing for compounds with unknown conformations. Within each of the methods, we have trained two models that solve classification and regression tasks. Then, all models are grouped into a pipeline of two subsequent ensembles. The first ensemble aggregates six classification models which vote whether a ligand binds to a receptor or not. If a ligand is classified as active (i.e., binds), the second ensemble predicts its binding affinity in terms of the inhibition constant Ki.  相似文献   

12.
A DNA microarray can track the expression levels of thousands of genes simultaneously. Previous research has demonstrated that this technology can be useful in the classification of cancers. Cancer microarray data normally contains a small number of samples which have a large number of gene expression levels as features. To select relevant genes involved in different types of cancer remains a challenge. In order to extract useful gene information from cancer microarray data and reduce dimensionality, feature selection algorithms were systematically investigated in this study. Using a correlation-based feature selector combined with machine learning algorithms such as decision trees, nave Bayes and support vector machines, we show that classification performance at least as good as published results can be obtained on acute leukemia and diffuse large B-cell lymphoma microarray data sets. We also demonstrate that a combined use of different classification and feature selection approaches makes it possible to select relevant genes with high confidence. This is also the first paper which discusses both computational and biological evidence for the involvement of zyxin in leukaemogenesis.  相似文献   

13.
Multi-instance multi-label (MIML) learning has been proven to be effective for the genome-wide protein function prediction problems where each training example is associated with not only multiple instances but also multiple class labels. To find an appropriate MIML learning method for genome-wide protein function prediction, many studies in the literature attempted to optimize objective functions in which dissimilarity between instances is measured using the Euclidean distance. But in many real applications, Euclidean distance may be unable to capture the intrinsic similarity/dissimilarity in feature space and label space. Unlike other previous approaches, in this paper, we propose to learn a multi-instance multi-label distance metric learning framework (MIMLDML) for genome-wide protein function prediction. Specifically, we learn a Mahalanobis distance to preserve and utilize the intrinsic geometric information of both feature space and label space for MIML learning. In addition, we try to deal with the sparsely labeled data by giving weight to the labeled data. Extensive experiments on seven real-world organisms covering the biological three-domain system (i.e., archaea, bacteria, and eukaryote; Woese et al., 1990) show that the MIMLDML algorithm is superior to most state-of-the-art MIML learning algorithms.  相似文献   

14.
环境中的微/纳米塑料污染引起了人们极大关注。土壤中的微/纳米塑料不可避免对植物产生影响,因此预测微/纳米塑料的植物毒性可为土壤中微/纳米塑料治理提供抓手。本文以水稻为研究对象,发展了基于同步辐射X射线荧光 (SRXRF) 光谱与机器学习的非靶标金属组学方法,以预测聚氯乙烯纳米塑料(nPVC) 对水稻的毒性。首先将水稻暴露于不同浓度(500 ppb与500 ppm) nPVC,培养35天后,收集水稻叶;其次,利用SRXRF研究暴露nPVC后水稻叶中金属组的变化;然后,利用机器学习方法区分暴露不同浓度nPVC水稻样品。对SRXRF光谱进行主成分分析 (PCA) 非监督聚类,发现500 ppm组能够良好聚类,而500 ppb组与对照组无明显差异,表明500 ppb的nPVC暴露对植物的毒性远低于500 ppm nPVC。对SRXRF全光谱,利用线性模型K近邻算法(KNN)和非线性模型支持向量机(SVM)建立预测模型,区分不同组别的准确率可达94.12%。为了提升运算速度,减少模型计算量,使用竞争性自适应加权重采样算法(CARS)挑选特征光谱建立预测模型,区分不同组别的准确率为89.51%。相对全光谱模型,特征光谱预测模型虽然预测准确率下降了4.61%,但模型输入参数减少了99.38%,因此同样具有良好潜力。本研究表明基于SRXRF和机器学习的非靶标金属组学可准确预测不同浓度nPVC对水稻金属组的干扰程度,从而反映nPVC对水稻毒性的浓度依赖性。该方法同样可用于预测其它微/纳米塑料毒性的浓度依赖性。  相似文献   

15.
16.
Adsorption process was simulated in this study for removal of Hg and Ni from water using nanocomposite materials. The used nanostructured material for the adsorption study was a combined MOF and layered double hydroxide, which is considered as MOF-LDH in this work. The data were obtained from resources and different machine learning models were trained. We selected three different regression models, including elastic net, decision tree, and Gradient boosting, to make regression on the small data set with two inputs and two outputs. Inputs are Ion type (Hg or Ni) and initial ion concentration in the feed solution (C0), and outputs are equilibrium concentration (Ce) and equilibrium capacity of the adsorbent (Qe) in this dataset. After tuning their hyper-parameters, final models were implemented and assessed using different metrics. In terms of the R2-score metric, all models have more than 0.97 for Ce and more than 0.88 for Qe. The Gradient Boosting has an R2-score of 0.994 for Qe. Also, considering RMSE and MAE, Gradient Boosting shows acceptable errors and best models. Finally, the optimal values with the GB model are identical to dataset optimal: (Ion = Ni, C0 = 250, Ce = 206.0). However, for Qe, it is different and is equal to (Ion = Hg, C0 = 121.12, Ce = 606.15). The results revealed that the developed methods of simulation are of high capacity in prediction of adsorption for removal of heavy metals using nanostructure materials.  相似文献   

17.
Pre-packed columns have been increasingly used in process development and biomanufacturing thanks to their ease of use and consistency. Traditionally, packing quality is predicted through rate models, which require extensive calibration efforts through independent experiments to determine relevant mass transfer and kinetic rate constants. Here we propose machine learning as a complementary predictive tool for column performance. A machine learning algorithm, extreme gradient boosting, was applied to a large data set of packing quality (plate height and asymmetry) for pre-packed columns as a function of quantitative parameters (column length, column diameter, and particle size) and qualitative attributes (backbone and functional mode). The machine learning model offered excellent predictive capabilities for the plate height and the asymmetry (90 and 93%, respectively), with packing quality strongly influenced by backbone (∼70% relative importance) and functional mode (∼15% relative importance), well above all other quantitative column parameters. The results highlight the ability of machine learning to provide reliable predictions of column performance from simple, generic parameters, including strategic qualitative parameters such as backbone and functionality, usually excluded from quantitative considerations. Our results will guide further efforts in column optimization, for example, by focusing on improvements of backbone and functional mode to obtain optimized packings.  相似文献   

18.
In order to understand the molecular mechanism underlying any disease, knowledge about the interacting proteins in the disease pathway is essential. The number of revealed protein-protein interactions (PPI) is still very limited compared to the available protein sequences of different organisms. Experiment based high-throughput technologies though provide some data about these interactions, those are often fairly noisy. Computational techniques for predicting protein–protein interactions therefore assume significance. 1296 binary fingerprints that encode a combination of structural and geometric properties were developed using the crystallographic data of 15,000 protein complexes in the pdb server. In a case study, these fingerprints were created for proteins implicated in the Type 2 diabetes mellitus disease. The fingerprints were input into a SVM based model for discriminating disease proteins from non disease proteins yielding a classification accuracy of 78.2% (AUC value of 0.78) on an external data set composed of proteins retrieved via text mining of diabetes related literature. A PPI network was constructed and analysed to explore new disease targets. The integrated approach exemplified here has a potential for identifying disease related proteins, functional annotation and other proteomics studies.  相似文献   

19.
20.
BackgroundDiscover possible Drug Target Interactions (DTIs) is a decisive step in the detection of the effects of drugs as well as drug repositioning. There is a strong incentive to develop effective computational methods that can effectively predict potential DTIs, as traditional DTI laboratory experiments are expensive, time-consuming, and labor-intensive. Some technologies have been developed for this purpose, however large numbers of interactions have not yet been detected, the accuracy of their prediction still low, and protein sequences and structured data are rarely used together in the prediction process.MethodsThis paper presents DTIs prediction model that takes advantage of the special capacity of the structured form of proteins and drugs. Our model obtains features from protein amino-acid sequences using physical and chemical properties, and from drugs smiles (Simplified Molecular Input Line Entry System) strings using encoding techniques. Comparing the proposed model with different existing methods under K-fold cross validation, empirical results show that our model based on ensemble learning algorithms for DTI prediction provide more accurate results from both structures and features data.ResultsThe proposed model is applied on two datasets:Benchmark (feature only) datasets and DrugBank (Structure data) datasets. Experimental results obtained by Light-Boost and ExtraTree using structures and feature data results in 98 % accuracy and 0.97 f-score comparing to 94 % and 0.92 achieved by the existing methods. Moreover, our model can successfully predict more yet undiscovered interactions, and hence can be used as a practical tool to drug repositioning.A case study of applying our prediction model on the proteins that are known to be affected by Corona viruses in order to predict the possible interactions among these proteins and existing drugs is performed. Also, our model is applied on Covid-19 related drugs announced on DrugBank. The results show that some drugs like DB00691 and DB05203 are predicted with 100 % accuracy to interact with ACE2 protein. This protein is a self-membrane protein that enables Covid-19 infection. Hence, our model can be used as an effective tool in drug reposition to predict possible drug treatments for Covid-19.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号