一种基于三维可变换CNN加速结构的并行度优化搜索算法 A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

一种基于三维可变换CNN加速结构的并行度优化搜索算法

引用本文：	屈心媛,徐宇.一种基于三维可变换CNN加速结构的并行度优化搜索算法[J].电子与信息学报,2022,44(4):1503-1512.

作者姓名：	屈心媛徐宇

作者单位：	1.中国科学院空天信息创新研究院北京 1001902.中国科学院大学电子电气与通信工程学院北京 100049

基金项目：	国家自然科学基金;北京市科技重大专项

摘要：	现场可编程门阵列(FPGA)被广泛应用于卷积神经网络(CNN)的硬件加速中。为优化加速器性能，Qu等人(2021)提出了一种3维可变换的CNN加速结构，但该结构使得并行度探索空间爆炸增长，搜索最优并行度的时间开销激增，严重降低了加速器实现的可行性。为此该文提出一种细粒度迭代优化的并行度搜索算法，该算法通过多轮迭代的数据筛选，高效地排除冗余的并行度方案，压缩了超过99%的搜索空间。同时算法采用剪枝操作删减无效的计算分支，成功地将计算所需时长从106 h量级减少到10 s内。该算法可适用于不同规格型号的FPGA芯片，其搜索得到的最优并行度方案性能突出，可在不同芯片上实现平均(R1, R2)达(0.957, 0.962)的卓越计算资源利用率。
关键词：	现场可编程门阵列卷积神经网络硬件加速
收稿时间：	2021-01-08
A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture

QU Xinyuan,XU Yu.A Parallelism Strategy Optimization Search Algorithm Based on Three-dimensional Deformable CNN Acceleration Architecture[J].Journal of Electronics & Information Technology,2022,44(4):1503-1512.

Authors:	QU Xinyuan XU Yu

Institution:	1.Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China2.School of Electronic, Electrical, and Communication Engineering, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China

Abstract:	Field Programmable Gate Array (FPGA) is widely used in Convolutional Neural Network (CNN) hardware acceleration. For better performance, a three-dimensional transformable CNN acceleration structure is proposed by Qu et al (2021). However, this structure brings an explosive growth of the parallelism strategy exploration space, thus the time cost to search the optimal parallelism has surged, which reduces severely the feasibility of accelerator implementation. To solve this issue, a fine-grained iterative optimization parallelism search algorithm is proposed in this paper. The algorithm uses multiple rounds of iterative data filtering to eliminate efficiently the redundant parallelism schemes, compressing more than 99% of the search space. At the same time, the algorithm uses pruning operation to delete invalid calculation branches, and reduces successfully the calculation time from 106 h to less than 10 s. The algorithm can achieve outstanding performance in different kinds of FPGAs, with an average computing resource utilization (R1, R2) up to (0.957, 0.962).

Keywords:
本文献已被万方数据等数据库收录！
	点击此处可从《电子与信息学报》浏览原始摘要信息
	点击此处可从《电子与信息学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏