基于Spark的改进随机森林算法 Improved Random Forest Algorithm Based on Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

基于Spark的改进随机森林算法

引用本文：	孙悦,袁健.基于Spark的改进随机森林算法[J].电子科技,2019,32(4):60-64.

作者姓名：	孙悦袁健

作者单位：	上海理工大学光电信息与计算机工程学院,上海 200093

基金项目：	国家自然科学基金(61775139)

摘要：	针对基于单机的经典随机森林算法无法满足海量数据处理需求的问题,文中采用Spark分布式存储计算技术设计并实现了改进的随机森林算法。首先计算特征的重要程度,将特征分为公共特征、独有特征和非重要特征;然后按顺序和比例分别在各个特征子空间中随机选择特征;最后通过Spark集群进行实验,分析改进的随机森林算法分类性能、加速比和效率。结果证实改进的算法提高了随机森林构建效率,可以用来解决海量数据挖掘问题,具有良好的可扩展性。
关键词：	随机森林 Spark 特征空间 ReliefF算法高维数据分类模型
收稿时间：	2018-03-18
Improved Random Forest Algorithm Based on Spark

SUN Yue,YUAN Jian.Improved Random Forest Algorithm Based on Spark[J].Electronic Science and Technology,2019,32(4):60-64.

Authors:	SUN Yue YUAN Jian

Institution:	School of Optical Electrical and Computer Engineering,University of Shanghai for Science and Technology, Shanghai 210000,China

Abstract:	For the classical random forest algorithm based on single machine couldn't meet the demand of dealing with massive data, an improved random forest algorithm based on Spark was designed and implemented by using Spark distributed memory computing technology. Firstly, after calculating the importance of features the features were divided into public features, unique features, and non-important features;. Then, randomly features were selected in each feature subspace in order and proportion. Finally, the experiment was performed through Spark clusters to analyze the improved classification performance, speedup ratio and efficiency of the random forest algorithm. The result demonstrated that the improved algorithm could improve the efficiency of random forest construction and could be used to solve the massive data mining problem with good scalability.

Keywords:	random forest spark feature space ReliefF algorithm high dimensional data classification model
本文献已被万方数据等数据库收录！
	点击此处可从《电子科技》浏览原始摘要信息
	点击此处可从《电子科技》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏