首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
粗粒度可重构架构在能效比方面具有明显优势,然而其指令存储与传输过程的功耗代价过高.实验发现指令间具有明显的相似性,由此本文提出一种基于指令相似性的压缩技术,通过对指令的压缩、传输与解压,可以在不降低性能的前提下,优化架构的功耗和面积.针对同构和异构平台分别提出了指令分发模型和指令寄存器模型的解决方案,结合编译策略优化,最终与两种传统结构相比,面积效率比分别提升36%和181%,功耗效率比分别提升33%和118%.  相似文献   

2.
目前针对粗粒度可重构结构循环映射的研究主要集中在操作布局和临时数据路由,缺乏考虑数据映射的研究,该文提出一种基于存储划分和路径重用的模调度映射流程。首先进行细粒度的存储划分找到合适的数据映射,提高数据存取的并行性,再用模调度寻找操作布局和临时数据路由,最后利用构建的路由开销模型平衡存储器路由和处理单元路由的使用,引入路径重用策略优化路由资源。实验结果表明,该方法在循环的启动间隔、每周期指令数和执行延迟等方面均具有良好的性能。  相似文献   

3.
互连网络在粗粒度可重构结构(Coarse-Grained Reconfigurable Array, CGRA)中非常重要,对CGRA的性能、面积和功耗均有较大影响。为了减小互连网络导致的面积开销和功耗并提升CGRA的性能,该文提出一种具有自路由和无阻塞特性的互连网络,构建了一种层次型的网络拓扑结构。通过这种互连网络,任意一对处理单元之间均可以建立连接和交换数据,而且这种连接是自路由和无阻塞的。实验结果显示,与已有结构相比,该结构以至多增加14.1%的面积开销为代价,获得最高可达46.2%的整体性能提升。  相似文献   

4.
首先阐述了粗粒度可配置计算结构在进行数据处理时的特点。依据此特点,提出了CTaiJi粗粒度可配置结构,详细说明了结构中各个组成部分。在系统结构中,将环形结构与网格结构相结合,PEZ(处理单元组合)非常适合处理循环、条件语句。能够将循环、条件及简单算术运算集中于同一PEZ中实现,利于系统实现实时配置和提高PE(处理单元)利用率。介绍了常用的数字信号处理算法在此粗粒度可配置结构上的映射。  相似文献   

5.
在视频信号的编解码流程中,离散余弦变换(DCT)是一个至关重要的环节,其决定了视频压缩的质量和效率。针对88尺寸的2维离散余弦变换,该文提出一种基于粗粒度可重构阵列结构(Coarse-Grained Reconfigurable Array, CGRA)的硬件电路结构。利用粗粒度可重构阵列的可重配置的特性,实现在单一平台支持多个视频压缩编码标准的88 2维离散余弦变换。实验结果显示,这种结构每个时钟周期可以并行处理8个像素,吞吐率最高可达1.157109像素/s。与已有结构相比,设计效率和功耗效率最高可分别提升4.33倍和12.3倍,并能够以最高30帧/s的帧率解码尺寸为40962048,格式为4:2:0的视频序列。  相似文献   

6.
配置时间过长是制约可重构系统整体性能提升的重要因素,而合理的任务调度技术可有效降低系统配置时间。该文针对粗粒度动态可重构系统(CGDRS)和具有数据依赖关系的流应用,提出了一种3维任务调度模型。首先基于该模型,设计了一种基于预配置策略的任务调度算法(CPSA);然后根据任务间的配置重用性,提出了间隔配置重用与连续配置重用策略,并据此对CPSA算法进行改进。实验结果证明,CPSA算法能够有效解决调度死锁问题、降低流应用执行时间并提高调度成功率。与其它调度算法相比,对流应用执行时间的平均优化比例达到6.13%~19.53%。  相似文献   

7.
硬件木马检测已成为当前芯片安全领域的研究热点,现有检测算法大多面向ASIC电路和FPGA电路,且依赖于未感染硬件木马的黄金芯片,难以适应于由大规模可重构单元组成的粗粒度可重构阵列电路。因此,该文针对粗粒度可重构密码阵列的结构特点,提出基于分区和多变体逻辑指纹的硬件木马检测算法。该算法将电路划分为多个区域,采用逻辑指纹特征作为区域的标识符,通过在时空两个维度上比较分区的多变体逻辑指纹,实现了无黄金芯片的硬件木马检测和诊断。实验结果表明,所提检测算法对硬件木马检测有较高的检测成功率和较低的误判率。  相似文献   

8.
面向视频处理的粗粒度可重构单元设计   总被引:1,自引:1,他引:0  
针对视频处理算法,设计了一种面向视频处理的粗粒度可重构处理单元.它可以执行8位数据的加法、减法、乘法、乘加和求两数差的绝对值等操作,可以有效地支持高计算密度的视频处理算法.可重构处理单元使用Verilog设计,采用CMOS 0.18μm工艺DC综合,面积为97 913μm,关键路径为4.51ns,总的动态功耗为4.2mW.完成一次8×8像素块的2D-DCT算法和全搜索块匹配MAD算法分别需要10和15个时钟周期.  相似文献   

9.
媒体处理算法内在的并行性推动了媒体处理器朝着运算阵列架构的方向发展.在分析了算法映射对电路执行效果的影响后,将运算阵列设计与算法映射相结合,针对如何有效利用阵列提出了一种流水线映射的方案,并分析了该映射方法对系统性能的影响.在此基础之上,以H 264中的IDCT算法为例提取流水线模型,并基于该模型设计出了粗粒度的可重构阵列.实验结果表明,该阵列在功耗、速度、器件利用率等方面具有明显优势,具有较好的应用价值.  相似文献   

10.
可重构结构设计空间快速搜索方法   总被引:1,自引:0,他引:1  
在可重构结构评估模型的基础上,研究了在算法级估计可重构结构的面积、性能和功耗的方法。根据面积、性能和功耗,分两步搜索可重构结构的设计空间。首先,搜索结构域中每个结构实现所有算法时的最小代价,其次,在结构设计空间中搜索最优结构。该方法不依赖任何具体的架构,全面评价可重构结构的优劣,能快速获得全局最优的搜索结果。应用实例表明,在可重构结构设计初期,该方法能有效地指导可重构结构的设计。  相似文献   

11.
With a huge increase in demand for various kinds of compute-intensive applications in electronic systems, researchers have focused on coarse-grained reconfigurable architectures because of their advantages: high performance and flexibility. This paper presents FloRA, a coarse-grained reconfigurable architecture with floating-point support. A two-dimensional array of integer processing elements in FloRA is configured at run-time to perform floating-point operations as well as integer operations. Fabricated using 130 nm process, the total area overhead due to additional hardware for floating-point operations is about 7.4% compared to the previous architecture which does not support floating-point operations. The fabricated chip runs at 125 MHz clock frequency and 1.2 V power supply. Experiments show 11.6× speedup on average compared to ARM9 with a vector-floating-point unit for integer-only benchmark programs as well as programs containing floating-point operations. Compared with other similar approaches including XPP and Butter, the proposed architecture shows much higher performance for integer applications, while maintaining about half the performance of Butter for floating-point applications.  相似文献   

12.
介绍一种异步可重构结构,研究了异步可重构单元的设计。通过提前产生求值完成信号,使用DSDCVS逻辑实现可重构单元的运算电路,改进了异步可重构单元的控制电路。用三输入的C元件实现异步可重构单元的控制电路。仿真结果表明,异步可重构结构具有低功耗、高性能的优点,适合作为IP集成到系统芯片上,组成低功耗、高性能的可重构计算平台。  相似文献   

13.
在可重构计算芯片设计初期,确定芯片的各种互连资源数目是一个关键问题.如果设计的互连资源过少,可能导致应用领域中的部分算法无法实现,而过多的互连资源会造成芯片面积的浪费.基于可重构计算的特点,分析了可重构计算的相邻连接、路由连接和近邻连接三种类型互连资源.通过建立互连资源估计的随机模型,提出了可重构计算芯片中各种互连资源数目的估计方法.仿真结果表明,该方法能够比较准确地估计各种互连资源的数目,从而指导可重构计算互连资源的设计,降低设计风险.  相似文献   

14.
In this paper, we propose a methodology for accelerating application segments by partitioning them between reconfigurable hardware blocks of different granularity. Critical parts are speeded-up on the coarse-grain reconfigurable hardware for meeting the timing requirements of application code mapped on the reconfigurable logic. The reconfigurable processing units are embedded in a generic hybrid system architecture which can model a large number of existing heterogeneous reconfigurable platforms. The fine-grain reconfigurable logic is realized by an FPGA unit, while the coarse-grain reconfigurable hardware by our developed high-performance data-path. The methodology mainly consists of three stages; the analysis, the mapping of the application parts onto fine and coarse-grain reconfigurable hardware, and the partitioning engine. A prototype software framework realizes the partitioning flow. In this work, the methodology is validated using five real-life applications. Analytical partitioning experiments show that the speedup relative to the all-FPGA mapping solution ranges from 1.5 to 4.0, while the specified timing constraints are satisfied for all the applications.  相似文献   

15.
This paper describes a novel reconfigurable architecture for digital signal processing (DSP). This architecture consists of a two-level array of cells and interconnections. On the upper level, fundamental DSP operations such as multiplication and addition are mapped onto blocks of 4-bit cells. On the lower level, each cell uses a 4 × 4 matrix of smaller “elements” to perform the necessary computations. Cells also contain pipeline latches for increased throughput. The architecture features a simple VLSI implementation that combines the flexibility of memory elements with the speed of DOMINO logic. Initial prototypes have been fabricated using a modest 0.5-μm CMOS technology. Circuit simulations of the cell in 0.25-μm technology indicate that the design achieves a clock frequency of 200 MHz.  相似文献   

16.
High efficiency video coding (HEVC) transform algorithm for residual coding uses 2-dimensional (2D) 4×4 transforms with higher precision than H.264's 4×4 transforms, resulting in increased hardware complexity. In this paper, we present a shared architecture that can compute the 4×4 forward discrete cosine transform (DCT) and inverse discrete cosine transform (IDCT) of HEVC using a new mapping scheme in the video processor array structure. The architecture is implemented with only adders and shifts to an area-efficient design. The proposed architecture is synthesized using ISE14.7 and implemented using the BEE4 platform with the Virtex-6 FF1759 LX550T field programmable gate array (FPGA). The result shows that the video processor array structure achieves a maximum operation frequency of 165.2 MHz. The architecture and its implementation are presented in this paper to demonstrate its programmable and high performance.  相似文献   

17.
In this paper, we have analyzed the register complexity of direct-form and transpose-form structures of FIR filter and explored the possibility of register reuse. We find that direct-form structure involves significantly less registers than the transpose-form structure, and it allows register reuse in parallel implementation. We analyze further the LUT consumption and other resources of DA-based parallel FIR filter structures, and find that the input delay unit, coefficient storage unit and partial product generation unit are also shared besides LUT words when multiple filter outputs are computed in parallel. Based on these finding, we propose a design approach, and used that to derive a DA-based architecture for reconfigurable block-based FIR filter, which is scalable for larger block-sizes and higher filter-lengths. Interestingly, the number of registers of the proposed structure does not increase proportionately with the block-size. This is a major advantage for area-delay and energy efficient high-throughput implementation of reconfigurable FIR filters of higher block-sizes. Theoretical comparison shows that the proposed structure for block-size 8 and filter-length 64 involves 60% more flip-flops, 6.2 times more adders, 3.5 times more AND-OR gates, and offers 8 times higher throughput. ASIC synthesis result shows that the proposed structure for block-size 8 and filter-length 64 involves 1.8 times less area-delay product (ADP) and energy per sample (EPS) than the existing design, and it can support 8 times higher throughput. The proposed structure for block sizes 4 and 8, respectively, consumes 38% and 50% less power than the exiting structure for the same throughput rates on average for different supply voltages.  相似文献   

18.
文中对多传感器视觉信息处理算法进行分析,根据可重构处理器的并行计算参数模型提出了一种并行计算仿真的方法。多核处理器环境中,每个线程在独立的核上运行,线程间具有并发性。利用并发的线程模拟可重构阵列单元(PE)的运算方式,调用OpenMP设置多个线程并行执行,在多核计算机平台上模拟可重构处理器的计算过程。利用此方法能在没有具体的PE连接方案前,通过使用计算核模拟PE单元,将算法映射到多核处理器环境中。通过分析算法在多核计算机上的并发执行效率,来优化视觉信息算法在可重构阵列上的映射方案。  相似文献   

19.
郭振华  吴艳霞  张国印  戴葵 《电子学报》2016,44(8):1956-1961
为了解决目前可重构编译技术在为类仿射型数组下标应用生成循环流水阵列时,生成的存储系统对数据并行与重用支持不完善的问题,本文提出了一种参数化并行存储结构模板。此模板采用模块化设计思想,根据数据访存特征生成由多体交叉并行存储子模块、单体串行存储子模块、RAW Buffer缓存子模块及Smart Buffer缓存子模块构成的存储结构。为灵活生成存储结构及充分挖掘数据的并行性和重用性,本文采用访存数据依赖图方法计算存储模板的参数值。和相关工作相比,根据本文提出的存储结构模板生成的硬件,可以在占用较少的硬件资源情况下,获得较高的硬件执行速度。  相似文献   

20.
一种可重构体系结构用于高速实现DES、3DES和AES   总被引:1,自引:2,他引:1       下载免费PDF全文
高娜娜  李占才  王沁 《电子学报》2006,34(8):1386-1390
可重构密码芯片提高了密码芯片的安全性和灵活性,具有良好的应用前景.然而目前的可重构密码芯片吞吐率均大大低于专用芯片,因此,如何提高处理速度是可重构密码芯片设计的关键问题.本文分析了常用对称密码算法DES、3DES和AES的可重构性,利用流水线、并行处理和可重构技术,提出了一种可重构体系结构.基于该体系结构实现的DES、3DES和AES吞吐率在110MHz工作频率下分别可达到7Gbps、2.3Gbps和1.4Gbps.与其他同类设计相比,本文设计在处理速度上有较大优势,可以很好地应用到可重构密码芯片设计中.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号