期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Pure Ion Chromatograms Combined with Advanced Machine Learning Methods Improve Accuracy of Discriminant Models in LC–MS-Based Untargeted Metabolomics

Miao Tian Zhonglong Lin Xu Wang Jing Yang Wentao Zhao Hongmei Lu Zhimin Zhang Yi Chen 《Molecules (Basel, Switzerland)》2021,26(9)

Untargeted metabolomics based on liquid chromatography coupled with mass spectrometry (LC–MS) can detect thousands of features in samples and produce highly complex datasets. The accurate extraction of meaningful features and the building of discriminant models are two crucial steps in the data analysis pipeline of untargeted metabolomics. In this study, pure ion chromatograms were extracted from a liquor dataset and left-sided colon cancer (LCC) dataset by K-means-clustering-based Pure Ion Chromatogram extraction method version 2.0 (KPIC2). Then, the nonlinear low-dimensional embedding by uniform manifold approximation and projection (UMAP) showed the separation of samples from different groups in reduced dimensions. The discriminant models were established by extreme gradient boosting (XGBoost) based on the features extracted by KPIC2. Results showed that features extracted by KPIC2 achieved 100% classification accuracy on the test sets of the liquor dataset and the LCC dataset, which demonstrated the rationality of the XGBoost model based on KPIC2 compared with the results of XCMS (92% and 96% for liquor and LCC datasets respectively). Finally, XGBoost can achieve better performance than the linear method and traditional nonlinear modeling methods on these datasets. UMAP and XGBoost are integrated into KPIC2 package to extend its performance in complex situations, which are not only able to effectively process nonlinear dataset but also can greatly improve the accuracy of data analysis in non-target metabolomics. 相似文献

2.

基于组合预测法的共享单车租赁需求量预测

下载免费PDF全文

张建同孙嘉青《运筹与管理》2021,30(10):146-152

共享单车的租赁需求量预测对于单车企业提升运营效率十分必要,是单车再调度的前提。为了更加准确地预测出共享单车的租赁需求量,本文结合随机森林、XGBoost、GBDT三类数据驱动预测算法的优点,提出了一种基于向量投影法的加权对数平均组合模型。定义了组合模型的优性,非劣性,劣性的概念。并证明了该方法至少是一种非劣性的预测方法。通过将该方法运用于现实问题中,以解决实际单车租赁需求量预测问题。实例研究发现:该方法在单车租赁需求量预测中可以为优性预测模型, 能够对单车再调度起到正向作用。该方法可以为单车租赁需求量预测的相关研究提供一种切实有效的解决方向。相似文献

3.

基于XGBOOST的恒星光谱分类特征数值化

张枭罗阿理《光谱学与光谱分析》2019,39(10):3292-3296

恒星光谱分类是研究恒星的基础性工作之一,常用的光谱分类是基于20世纪70年代Morgan和Keenan建立起来的并逐步完善的MK分类系统。然而基于MK规则的交互式决策分类系统对处理海量天文光谱数据存在着一定的困难。目前光谱巡天一般采用的自动化分类则是模版匹配方法而忽略对谱线特征的测量。怎样自动、客观地提取海量光谱中的分类特征并应用这些特征进行分类可以对天体的物理化学性质的统计分析至关重要。针对此问题,通过机器学习和计算光谱的谱线指数结合的方法,提取光谱特征,并通过大数据分析定量地确定对光谱特征谱线的分类判据（数值化）,确定每一类光谱具有物理意义的特征谱线的强度分布。首先对LAMOST DR4恒星光谱测量其谱线指数作为输入,光谱的分类标记采用官方发布的分类结果。使用XGBoost算法进行自动分类及特征排序,从而获得已知或未知的对于分类决策最为敏感的谱线。首先,选取高信噪比（S/N>30）、被LAMOST标记为B,A,F和M的恒星光谱数据,总计约414万个。然后,对光谱数据计算谱线指数从而使其得到降维处理,过滤冗余信息。其次,将处理后的恒星光谱数据随机划分为训练集和测试集,通过适当调整算法参数,用训练集得到所需要的分类决策树模型,用测试集测试其稳定性和可用性,以防止出现过拟合,同时使用算法自带函数进行提取分类特征。最后,输出并整理实验中算法所得的决策树模型,并挑选其概率比较大的分支作为最终的决策树模型。通过实验,可以发现在固定参数下,XGBoost所得的模型有一定的自适应性,较少受数据集影响,总体准确率可达88.5%;同时其所输出的分类决策树与已知的特征较为吻合,而且可以获得基于大数据的、数值化的特征谱线对应分类的范围,为完善基于特征的分类提供定量的规则。相似文献

4.

基于语义相似度与XGBoost算法的英语作文智能评价框架研究

吕欣程雨夏《浙江大学学报(理学版)》2020,47(3):329-336

作文智能评分和评语智能生成能极大减轻评阅专家的工作量、节约人力成本。目前,评分和评语结果的准确性与公平性尚不高。近年来,机器学习和自然语言处理等技术的快速发展,在一定程度上提升了文本分类、机器翻译等任务的性能,但仍有许多新的研究成果尚未应用于作文智能评价。本研究综合了词向量(word2vec)、段落向量(paragraph2vec)、词性向量(pos2vec)和LDA (latent dirichlet allocation)等特征,共同组合为作文的语义表示向量;采用基于kNN (k nearest neighbors)算法的语义相似度模型,得到作文的评语标签;采用基于XGBoost(extreme gradient boosting)的回归模型计算英语作文的评分值;并以900篇大学生英语作文为样本,构造算例进行验证。最后表明,提出的智能评价框架在英语作文自动评分和评语生成的准确性上,都要高于传统方法。相似文献

5.

Nuclear spin-spin coupling constants prediction based on XGBoost and LightGBM algorithms

Xin-xin Zhang Tong Deng 《Molecular physics》2020,118(14)

Nuclear magnetic resonance (NMR) is a robust method for the analysis of molecular complex structures, and the measurement of the nuclear spin–spin coupling constant is the key. In this paper, based on the 3D coordinates of the atoms in the molecule, the spin–spin coupling constants of atom-pairs are directly predicted using Extreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machine (LightGBM). The calculated result of DFT method is taken as the target value. Experiment shows that LightGBM (R²: 0.93) overall performance is better than XGBoost. In some molecules, the predicted fit (R²) of the coupling constant between atoms even reached 1.00. This research avoids complex quantum mechanics and can assist in NMR to gain insight into the structure and dynamics of molecules, thereby enriching the data information analysis method of nuclear magnetic interaction. 相似文献

6.

Holistic Prediction of the pKa in Diverse Solvents Based on a Machine-Learning Approach

Qi Yang Yao Li Dr. Jin-Dong Yang Yidi Liu Dr. Long Zhang Prof. Dr. Sanzhong Luo Prof. Dr. Jin-Pei Cheng 《Angewandte Chemie (Weinheim an der Bergstrasse, Germany)》2020,132(43):19444-19453

相似文献

7.

基于XGBoost与可见-近红外光谱的煤矸识别方法

李瑞李博王学文刘涛李廉洁樊书祥《光谱学与光谱分析》2022,42(9):2947-2955

煤矸智能识别是实现综放开采智能化亟待研发的新技术;可见-近红外光谱技术具有环保、实时等优势,满足煤矸智能分选的要求。为解决基于可见-近红外光谱的煤矸识别问题,引入在数据科学竞赛中表现出色的极端梯度提升树（XGBoost）算法。搭建可见-近红外光谱实验平台采集来自山西西铭、陕西神木、内蒙古巴隆图煤矿的块状煤与矸石样品在370～1 049 nm波段的反射光谱;利用黑白校正、始末波段去除、SG卷积平滑和标准正态变量变换（SNV）对采集的原始光谱进行预处理,以减少光照不均、噪声以及光程差的影响。依据三个煤矿煤与矸石样品反射光谱的差异划分实验组和测试组,实验组差异微小,用于对比不同模型的性能,挑选最佳算法;测试组差异较明显,用于测试最佳算法在其他煤矿下的表现,检验算法对不同煤矿的适用性。在实验组的实验中,首先基于XGBoost算法建立煤与矸石分类模型,并引入常用的机器学习分类算法k近邻法（KNN）、随机森林（RF）、支持向量机（SVM）做对比,结果表明XGBoost的表现最佳,十折交叉验证的平均准确度（ACC₁₀）、分类准确度（ACC）与AUC值分别达到0.957 2,0.970 5与0.971 6,体现出较强的稳定性与分类能力。其次为降低数据维度减少模型运算量,使用递归特征选择（RFE）、连续投影算法（SPA）与竞争性自适应重加权算法（CARS）分别进行特征波长的选择并与上述四种分类算法结合构建简化分类模型,经测试RFE与XGBoost组合的简化模型表现最佳,ACC₁₀,ACC与AUC值分别为0.965 7,0.980 3与0.980 3且数据维度降至9,在降低数据维度的同时提高了模型的稳定性与分类能力。在测试组的实验中,基于优选出的XGBoost与RFE-XGB算法建立的模型,同样可以实现对其他矿区煤与矸石稳定精确地识别,且简化模型表现更好,与实验组结果一致。相似文献

8.

近红外光谱的海水微塑料快速识别

吴雪冯巍巍蔡宗岐王清《光谱学与光谱分析》2022,42(11):3501-3506

光谱技术与机器学习算法结合快速识别微塑料, 为微塑料的现场检测提供了极大的技术支持,是一个得到极大关注的新领域。近红外光谱检测技术具有检测速度快、灵敏度高、不损坏样品,且可以在不对样品进行预处理的情况下直接检测等特点,在化学分析、质量检测等领域广泛应用。本文基于近红外光谱检测技术,研究比较了结合Support Vector Machine（SVM）和Extreme Gradient Boosting（XGBoost）两种机器学习分类算法,构建微塑料的高速有效识别分类模型。采用微型近红外光谱仪采集了20种常见的微塑料标准样品的光谱数据,为了防止过拟合,对每种样品多次采样,共收集了1 260个微塑料样本,每个样本包含512个数据点。利用XGBoost算法进行特征重要性排序,共提取了对识别准确率影响较大的65个数据点。分别采用SVM算法和XGBoost算法对数据降维后提取的65个数据点建立微塑料快速识别模型,并运用网格搜索（GridSearchCV）对XGBoost算法影响较大的超参数进行选取,确定n_estimators,learning_rate,min_child_weigh,max_depth,gamma的最佳超参数分别为700,0.07,1,1,0.0。为了提高模型的稳定性,识别速率和泛化能力,对模型采用10折交叉验证和混淆矩阵评估;研究结果表明,XGBoost模型对微塑料的识别准确率为97%,而SVM模型对微塑料的识别准确率为95%;XGBoost模型对微塑料识别的正确率优于SVM模型。综上所述,XGBoost模型微塑料识别整体性能优于SVM模型,为实际微塑料快速识别提供技术支撑。相似文献

9.

Hybrid Basketball Game Outcome Prediction Model by Integrating Data Mining Methods for the National Basketball Association

Wei-Jen Chen Mao-Jhen Jhou Tian-Shyug Lee Chi-Jie Lu 《Entropy (Basel, Switzerland)》2021,23(4)

The sports market has grown rapidly over the last several decades. Sports outcomes prediction is an attractive sports analytic challenge as it provides useful information for operations in the sports market. In this study, a hybrid basketball game outcomes prediction scheme is developed for predicting the final score of the National Basketball Association (NBA) games by integrating five data mining techniques, including extreme learning machine, multivariate adaptive regression splines, k-nearest neighbors, eXtreme gradient boosting (XGBoost), and stochastic gradient boosting. Designed features are generated by merging different game-lags information from fundamental basketball statistics and used in the proposed scheme. This study collected data from all the games of the NBA 2018–2019 seasons. There are 30 teams in the NBA and each team play 82 games per season. A total of 2460 NBA game data points were collected. Empirical results illustrated that the proposed hybrid basketball game prediction scheme achieves high prediction performance and identifies suitable game-lag information and relevant game features (statistics). Our findings suggested that a two-stage XGBoost model using four pieces of game-lags information achieves the best prediction performance among all competing models. The six designed features, including averaged defensive rebounds, averaged two-point field goal percentage, averaged free throw percentage, averaged offensive rebounds, averaged assists, and averaged three-point field goal attempts, from four game-lags have a greater effect on the prediction of final scores of NBA games than other game-lags. The findings of this study provide relevant insights and guidance for other team or individual sports outcomes prediction research. 相似文献

10.

基于数据挖掘技术的股票收益率方向研究

苟小菊王芊《运筹与管理》2021,30(1):163-169

本文依据数据挖掘技术对股票收益率的变化方向进行探究。通过小波多尺度分解,将股票价格转化为不同频率域下的子序列数据、并对其中的高频序列进行降噪。构建极度梯度提升树(XGBoost)、以及其它主流机器学习算法,对沪深300和中证500指数中成分股的涨跌进行了拟合并预测。研究发现XGBoost的平均准确率分别达到了54.69%和55.13%,同时依据预测信号构建的投资策略可产生稳定收益,表明该方法具备较强的预测能力。在此基础上,对机器学习算法存在的“黑箱”问题进行了阐述和研究,对模型选股的逻辑进行了探析:提出一种因子权重的度量方法,研究发现市净率、市盈率、能量潮等指标在模型中是较为重要的判别指标,并通过偏相依关系度量了模型中各因子对于股价涨跌方向的边际影响,得到模型倾向于选择市盈率、市净率较小的股票等一些结论,使算法的逻辑更为清楚。相似文献