首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于Spark的改进随机森林算法
引用本文:孙悦,袁健.基于Spark的改进随机森林算法[J].电子科技,2019,32(4):60-64.
作者姓名:孙悦  袁健
作者单位:上海理工大学 光电信息与计算机工程学院,上海 200093
基金项目:国家自然科学基金(61775139)
摘    要:针对基于单机的经典随机森林算法无法满足海量数据处理需求的问题,文中采用Spark分布式存储计算技术设计并实现了改进的随机森林算法。首先计算特征的重要程度,将特征分为公共特征、独有特征和非重要特征;然后按顺序和比例分别在各个特征子空间中随机选择特征;最后通过Spark集群进行实验,分析改进的随机森林算法分类性能、加速比和效率。结果证实改进的算法提高了随机森林构建效率,可以用来解决海量数据挖掘问题,具有良好的可扩展性。

关 键 词:随机森林  Spark  特征空间  ReliefF算法  高维数据  分类模型  
收稿时间:2018-03-18

Improved Random Forest Algorithm Based on Spark
SUN Yue,YUAN Jian.Improved Random Forest Algorithm Based on Spark[J].Electronic Science and Technology,2019,32(4):60-64.
Authors:SUN Yue  YUAN Jian
Institution:School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology, Shanghai 210000,China
Abstract:For the classical random forest algorithm based on single machine couldn't meet the demand of dealing with massive data, an improved random forest algorithm based on Spark was designed and implemented by using Spark distributed memory computing technology. Firstly, after calculating the importance of features the features were divided into public features, unique features, and non-important features;. Then, randomly features were selected in each feature subspace in order and proportion. Finally, the experiment was performed through Spark clusters to analyze the improved classification performance, speedup ratio and efficiency of the random forest algorithm. The result demonstrated that the improved algorithm could improve the efficiency of random forest construction and could be used to solve the massive data mining problem with good scalability.
Keywords:random forest  spark  feature space  ReliefF algorithm  high dimensional data  classification model  
本文献已被 万方数据 等数据库收录!
点击此处可从《电子科技》浏览原始摘要信息
点击此处可从《电子科技》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号