首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
本文讨论了中文文本挖掘的三个问题:分词、关键词提取和文本分类。对分词问题,介绍了基于层叠隐马尔可夫模型的ICTCLAS分词法,以及将词与词之间的分隔视为缺失数据并用EM算法求解的WDM方法;对关键词提取问题,提出了贝叶斯因子法,并介绍了使用稀疏回归的CCS方法;对文本分类问题,介绍了根据关键词频率建立分类器的方法,以及先建立主题模型再根据主题概率建立分类器的方法。本文通过两组文本数据对上述方法进行比较,并给出使用建议。  相似文献   

2.
陶朝杰  杨进 《经济数学》2020,37(3):214-220
虚假评论是电商发展过程中一个无法避免的难题. 针对在线评论数据中样本类别不平衡情况,提出基于BalanceCascade-GBDT算法的虚假评论识别方法. BalanceCascade算法通过设置分类器的误报率逐步缩小大类样本空间,然后集成所有基分类器构建最终分类器. GBDT以其高准确性和可解释性被广泛应用于分类问题中,并且作为样本扰动不稳定算法,是十分合适的基分类模型. 模型基于Yelp评论数据集,采用AUC值作为评价指标,并与逻辑回归、随机森林以及神经网络算法进行对比,实验证明了该方法的有效性.  相似文献   

3.
石子烨  梁恒  白峰杉 《计算数学》2014,36(3):325-334
数据分割研究的基本内容是数据的分类和聚类,是数据挖掘的核心问题之一,在实际问题中应用广泛.特别是针对有向网络数据的研究更是学科发展的前沿.但由于这类问题结构的非对称性,使得模型与算法的构建存在本质困难,因此相应的研究结果较少.本文借鉴分子动力学方法的思想,提出了一类新的网络数据半监督分类模型及算法.该算法不仅适用于关系对称的无向网络数据,而且适用于关系非对称的有向网络.最后针对期刊引用网络数据进行了数值实验,结果表明了模型及算法的可行性和有效性.  相似文献   

4.
产品垃圾评论在一定程度上影响了评论信息的参考价值,本文旨在建立识别模型将垃圾评论从评论文本中剔除,保留真实的产品评论。首先,分析了产品评论的特点,从数据搜集、文本预处理、互信息检验、文本表示4个模块提取了14个特征。然后,利用高互补性建立了基于KNN和Bayes算法的组合分类器模型。最后,利用交叉验证对iPhone 6Plus的产品评论进行检验,得到评价指标分别为:正确识别率75.3%、召回率82.1%以及F1值77.5%.  相似文献   

5.
蒋翠清  梁坤  丁勇  段锐 《运筹与管理》2017,26(2):135-139
网络借贷环境下基于Adaboost的信用评价方法具有较高的基分类器分歧度和样本误分代价。现有研究没有考虑分歧度和误分代价对基分类器样本权重的影响,从而降低了网络借贷信用评价结果的有效性。为此,提出一种基于改进Adaboost的信用评价方法。该方法根据基分类器的误分率,样本在不同基分类器上分类结果的分歧程度,以及样本的误分代价等因素,调整Adaboost模型的样本赋权策略,使得改进后的Adaboost模型能够对分类困难样本和误分代价高的样本实施有针对性的学习,从而提高网络借贷信用评价结果的有效性。基于拍拍贷平台数据的实验结果表明,提出的方法在分类精度和误分代价等方面显著优于传统的基于Adaboost的信用评价方法。  相似文献   

6.
张文  王强  唐子旭  秦广杰  李健 《运筹与管理》2022,31(11):167-173
机器学习相关技术的发展提升了在线虚假评论识别的准确率,然而现阶段机器学习模型缺少足够量的已标注数据来进行模型训练。本文基于生成式对抗网络(GAN)提出了评论数据集扩充方法GAN-RDE(GAN-Review Dataset Expansion)以解决虚假评论识别中模型训练数据贫乏问题。具体而言,首先将初始评论数据划分为真实评论数据集和虚假评论数据集,使用真实评论数据集和虚假评论数据集分别训练GAN,生成符合真实评论与虚假评论特征分布的向量。然后将GAN训练得到的符合评论特征分布的向量与初始评论数据集的特征词词向量矩阵进行合并,扩充模型训练数据。最后,利用朴素贝叶斯、多层感知机和支持向量机作为基础分类器,对比数据扩充前后虚假评论识别的效果。实验结果表明,使用GAN-RDE方法扩充评论数据集后,机器学习模型对虚假评论识别准确率得到显著提升。  相似文献   

7.
在设备故障诊断领域,操作说明、维修记录等文本数据具有极大的应用价值,充分挖掘和利用这类数据能大幅度提升故障诊断的工作效率.现有研究常用语义特征抽取及无监督聚类方法挖掘文本数据,辅助进行故障定位,但这类方法通常无法解释故障原因和给出提供相应维修方案的理由,据此生成的故障维修方案不易于理解.文章基于现有的成熟预训练语言模型BERT (bidirectional encoder representation from transformers),提出了一种基于BERT的短文本分类模型和知识图谱结合的故障定位方法,以充分挖掘和利用铁路CIR设备的文本数据中蕴含的知识和规律.所用方法首先基于CIR设备的功能层次关系确定故障模块,然后借助基于BERT的文本分类技术实现故障的初步定位,最后结合知识图谱进一步确定故障原因等信息辅助进行故障诊断,基于知识图谱积累的故障诊断知识提供故障维修方案易于维修人员理解,有助于知识的管理和工程效率的提升.在文本分类技术方面,文章利用铁路CIR设备故障维修台账记录数据进行实验,实验结果证明,基于BERT的短文本分类模型相较传统分类模型在性能上有较大的提升;在故障诊断方...  相似文献   

8.
电子商务行业已成为国家战略性新兴行业,不仅拉动中国经济增长,更改变了人们的生活方式.对电子商务平台产品评论的意见信息进行统计分析,对于了解消费者对产品的关注点,改善平台购物体验,促使生产商对产品改进升级等具有重要意义.互联网时代,数据类型已从单一的结构化数据扩展到文本、图片等非结构化数据.文本挖掘是对大量非结构化数据处理和分析的过程.意见挖掘在文本挖掘基础上添加了人工智能,可以更有效地分析文本数据中的意见信息.文章以京东商城魅族MX3的用户评论为基础数据,采用意见挖掘中的条件随机场模型,并且在模型中加入了是否评价句特征,提高了条件随机场模型的绩效,通过对比试验验证了特征的有效性,从而对意见信息进行分类和可视化分析.  相似文献   

9.
本文主要研究正态混合模型的贝叶斯分类方法.贝叶斯分类以后验概率最大为准则,后验概率需要估计相关的条件分布.对于连续型数据的分类,其数据由多个类别混合而成,仅用单一分布难以描述,此时混合模型是一个较好的选择,并且可由EM算法获得.模拟实验表明,基于正态混合模型的贝叶斯分类方法是可行有效的.对于特征较多的分类,不同特征对分类的影响不同,本文对每个特征应用基于正态混合模型的贝叶斯分类方法构建基本分类器,然后结合集成学习,用AdaBoost算法赋予每个分类器权重,再线性组合它们得到最终分类器.通过UCI数据库中实际的Wine Data Set验证表明,本文分类方法与集成学习的结合可以得到高准确率和稳定的分类.  相似文献   

10.
基于天涯杂谈2015年全年帖子,对其标题进行文本挖掘,通过LDA主题模型分类,计算主题比率.再通过对帖子的点击量,回复量,回复点击比,持续热度各前100的帖子进行词频统计,得到上述4个指标的TOP100热帖.进一步,对比分析了TOP100热帖的主题比率与全部帖子的主题比率.文章的研究结果可以捕捉到2015年天涯网友的热点关注方向,结合情感分析技术,研究结果清晰地勾勒出天涯杂谈版块的网络舆情方向和网民态度.  相似文献   

11.
Extreme learning machine (ELM) not only is an effective classifier in supervised learning, but also can be applied on unsupervised learning and semi-supervised learning. The model structure of unsupervised extreme learning machine (US-ELM) and semi-supervised extreme learning machine (SS-ELM) are same as ELM, the difference between them is the cost function. We introduce kernel function to US-ELM and propose unsupervised extreme learning machine with kernel (US-KELM). And SS-KELM has been proposed. Wavelet analysis has the characteristics of multivariate interpolation and sparse change, and Wavelet kernel functions have been widely used in support vector machine. Therefore, to realize a combination of the wavelet kernel function, US-ELM, and SS-ELM, unsupervised extreme learning machine with wavelet kernel function (US-WKELM) and semi-supervised extreme learning machine with wavelet kernel function (SS-WKELM) are proposed in this paper. The experimental results show the feasibility and validity of US-WKELM and SS-WKELM in clustering and classification.  相似文献   

12.
In many classification applications and face recognition tasks, there exist unlabelled data available for training along with labelled samples. The use of unlabelled data can improve the performance of a classifier. In this paper, a semi-supervised growing neural gas is proposed for learning with such partly labelled datasets in face recognition applications. The classifier is first trained on the labelled data and then gradually unlabelled data is classified and added to the training data. The classifier is retrained; and so on. The proposed iterative algorithm conforms to the EM framework and is demonstrated, on both artificial and real datasets, to significantly boost the classification rate with the use of unlabelled data. The improvement is particularly great when the labelled dataset is small. Comparison with support vector machine classifiers is also given. The algorithm is computationally efficient and easy to implement.  相似文献   

13.
主要研究对称正定矩阵群上的内蕴最速下降算法的收敛性问题.首先针对一个可转化为对称正定矩阵群上无约束优化问题的半监督度量学习模型,提出对称正定矩阵群上一种自适应变步长的内蕴最速下降算法.然后利用李群上的光滑函数在任意一点处带积分余项的泰勒展开式,证明所提算法在对称正定矩阵群上是线性收敛的.最后通过在分类问题中的数值实验说明算法的有效性.  相似文献   

14.
受推荐系统在电子商务领域重大经济利益的驱动,恶意用户以非法牟利为目的实施托攻击,操纵改变推荐结果,使推荐系统面临严峻的信息安全威胁,如何识别和检测托攻击成为保障推荐系统信息安全的关键。传统支持向量机(SVM)方法同时受到小样本和数据不均衡两个问题的制约。为此,提出一种半监督SVM和非对称集成策略相结合的托攻击检测方法。首先训练初始SVM,然后引入K最近邻法优化分类面附近样本的标记质量,利用标记数据和未标记数据的混合样本集减少对标记数据的需求。最后,设计一种非对称加权集成策略,重点关注攻击样本的分类准确率,降低集成分类器对数据不均衡的敏感性。实验结果表明,本文方法有效地解决了小样本问题和数据不均衡分布问题,获得了较好的检测效果。  相似文献   

15.
Previous studies on financial distress prediction (FDP) almost construct FDP models based on a balanced data set, or only use traditional classification methods for FDP modelling based on an imbalanced data set, which often results in an overestimation of an FDP model’s recognition ability for distressed companies. Our study focuses on support vector machine (SVM) methods for FDP based on imbalanced data sets. We propose a new imbalance-oriented SVM method that combines the synthetic minority over-sampling technique (SMOTE) with the Bagging ensemble learning algorithm and uses SVM as the base classifier. It is named as SMOTE-Bagging-based SVM-ensemble (SB-SVM-ensemble), which is theoretically more effective for FDP modelling based on imbalanced data sets with limited number of samples. For comparative study, the traditional SVM method as well as three classical imbalance-oriented SVM methods such as cost-sensitive SVM, SMOTE-SVM, and data-set-partition-based SVM-ensemble are also introduced. We collect an imbalanced data set for FDP from the Chinese publicly traded companies, and carry out 100 experiments to empirically test its effectiveness. The experimental results indicate that the new SB-SVM-ensemble method outperforms the traditional methods and is a useful tool for imbalanced FDP modelling.  相似文献   

16.
The multi-category classification algorithms play an important role in both theory and practice of machine learning.In this paper,we consider an approach to the multi-category classification based on minimizing a convex surrogate of the nonstandard misclassification loss.We bound the excess misclassification error by the excess convex risk.We construct an adaptive procedure to search the classifier and furthermore obtain its convergence rate to the Bayes rule.  相似文献   

17.
解决不平衡数据分类问题,在现实中有着深远的意义。马田系统利用单一的正常类别构建基准空间和测量基准尺度,并由此建立数据分类模型,十分适合不平衡数据分类问题的处理。本文以传统马田系统方法为基础,结合信噪比及F-value、G-mean等分类精度,建立了基于遗传算法的基准空间优化模型,同时运用Bagging集成化算法,构造了改进马田系统模型算法GBMTS。通过对不同分类方法及相关数据集的实验分析,表明:GBMTS算法较其他分类算法,更能够有效的处理不平衡数据的分类问题。  相似文献   

18.
In credit scoring, low-default portfolios (LDPs) are those for which very little default history exists. This makes it problematic for financial institutions to estimate a reliable probability of a customer defaulting on a loan. Banking regulation (Basel II Capital Accord), and best practice, however, necessitate an accurate and valid estimate of the probability of default. In this article the suitability of semi-supervised one-class classification (OCC) algorithms as a solution to the LDP problem is evaluated. The performance of OCC algorithms is compared with the performance of supervised two-class classification algorithms. This study also investigates the suitability of over sampling, which is a common approach to dealing with LDPs. Assessment of the performance of one- and two-class classification algorithms using nine real-world banking data sets, which have been modified to replicate LDPs, is provided. Our results demonstrate that only in the near or complete absence of defaulters should semi-supervised OCC algorithms be used instead of supervised two-class classification algorithms. Furthermore, we demonstrate for data sets whose class labels are unevenly distributed that optimising the threshold value on classifier output yields, in many cases, an improvement in classification performance. Finally, our results suggest that oversampling produces no overall improvement to the best performing two-class classification algorithms.  相似文献   

19.
Unsupervised classification is a highly important task of machine learning methods. Although achieving great success in supervised classification, support vector machine (SVM) is much less utilized to classify unlabeled data points, which also induces many drawbacks including sensitive to nonlinear kernels and random initializations, high computational cost, unsuitable for imbalanced datasets. In this paper, to utilize the advantages of SVM and overcome the drawbacks of SVM-based clustering methods, we propose a completely new two-stage unsupervised classification method with no initialization: a new unsupervised kernel-free quadratic surface SVM (QSSVM) model is proposed to avoid selecting kernels and related kernel parameters, then a golden-section algorithm is designed to generate the appropriate classifier for balanced and imbalanced data. By studying certain properties of proposed model, a convergent decomposition algorithm is developed to implement this non-covex QSSVM model effectively and efficiently (in terms of computational cost). Numerical tests on artificial and public benchmark data indicate that the proposed unsupervised QSSVM method outperforms well-known clustering methods (including SVM-based and other state-of-the-art methods), particularly in terms of classification accuracy. Moreover, we extend and apply the proposed method to credit risk assessment by incorporating the T-test based feature weights. The promising numerical results on benchmark personal credit data and real-world corporate credit data strongly demonstrate the effectiveness, efficiency and interpretability of proposed method, as well as indicate its significant potential in certain real-world applications.  相似文献   

20.
In this paper, we propose a kernel-free semi-supervised quadratic surface support vector machine model for binary classification. The model is formulated as a mixed-integer programming problem, which is equivalent to a non-convex optimization problem with absolute-value constraints. Using the relaxation techniques, we derive a semi-definite programming problem for semi-supervised learning. By solving this problem, the proposed model is tested on some artificial and public benchmark data sets. Preliminary computational results indicate that the proposed method outperforms some existing well-known methods for solving semi-supervised support vector machine with a Gaussian kernel in terms of classification accuracy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号