首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Many of the current applications used in battery powered devices are from digital signal processing, telecommunication, and multimedia domains. These applications typically set high requirements for computational performance and often parallelism is the key solution to meet the performance requirements. In order to exploit the parallel processing units, memory should be able to feed the data path with data. This calls for a memory organization supporting parallel memory accesses. In this paper, a conflict resolving parallel data memory system for application-specific instruction-set processors is described. The memory structure is generic and reusable to support various application-specific designs. The proposed memory system does not employ any predefined access format signals for memory addressing. The proposed parallel memory system is attached to an application-specific instruction-set processor core and comparison on area, power, and critical path are shown. The experiments show that significant power savings can be obtained by exploiting the parallel memory system instead of multi-port memory.
Jarmo TakalaEmail:
  相似文献   

2.
基于扩展控制流图的片上存储器分配策略   总被引:1,自引:0,他引:1       下载免费PDF全文
王学香  浦汉来  杨军 《电子学报》2007,35(8):1558-1562
本文提出一种基于扩展控制流图(ECFG)的片上存储器(Scratch-Pad Memory,SPM)分配策略,该策略首先把程序划分为全局变量、全局堆栈、指令块等节点,用包含节点和节点间关系的ECFG来描述应用程序,接着采用考虑了节点间关系的改进的背包算法把选中的节点分配到SPM中.实验表明该策略比采用单纯背包算法的SPM分配策略减少应用程序执行时间11%,比不使用SPM时减少56%,大大提高了SoC存储子系统的性能.  相似文献   

3.
Many commercially available embedded processors are capable of extending their base instruction set for a specific domain of applications. While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck. In this paper, we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. Specifically, we embed a single control bit in the instruction op-codes to selectively copy the execution results to a set of hash-mapped shadow registers in the write-back stage. This can efficiently reduce the communication overhead due to data transfers between the core processor and the custom logic. We also present a novel simultaneous global shadow register binding with a hash function generation algorithm to take full advantage of the extension. The application of our approach leads to a nearly optimal performance speedup  相似文献   

4.
Providing quality mobile video applications in hand-held mobile devices requires increased computational capability. Using Single Instruction Multiple Data (SIMD) techniques to expose and accelerate the data parallelism inherent in video processing increases performance in handheld and wireless systems. The paper introduces a new 64-bit SIMD coprocessor of the Intel® XScale® microarchitecture which is optimized for low-power handheld applications. The architecture blends the SIMD media processing style with the capabilities of the XScale microarchitecture. This paper provides an overview of the architecture, its instruction set, programming model, the pipeline organization and functional units. The paper also describes how key features of architecture improve the performance of video applications as compared to a scalar implementation. The performance and power improvements based upon measured results are analyzed to show how the opportunities of power savings by reducing the frequency and voltage can be realized.Nigel C. Paver has 13 years experience with the ARM architecture, and in the Intel PCA Components group in Austin, Texas, he is responsible for the architecture and implementation of multimedia coprocessors for the Intel XScale micro-architecture. He is also involved in product architecture and definition of Intel PCA processors. Before Intel, Nigel was one of the lead designers of the early AMULET asynchronous ARM microprocessors at the University of Manchester. He was also vice president in a startup company which used asynchronous design techniques to produce a low-power asynchronous DSP core. Nigel holds a Master of Science degree and Ph.D. in computer science from the University of Manchester and a Bachelor of Science degree in electronics from UMIST.Moinul Khan is a multimedia product architect at Intel Corporation PCA Components group. He is responsible PCA graphics and security architecture. His research interests are virtual prototyping, signal processing algorithms and architecture and communications networking. Before joining Intel he was a technology specialist and founding member of a startup at ATDC, Georgia. He worked on his doctoral research at Georgia Center for Advanced Telecommunications Technology at Georgia Institute of Technology. He received his B.Tech form Indian Insti-ture of Technology and MSEE from Georgia Tech. He also worked as a research member for Canadian Institute for Telecommunications Research and Bell Communications Laboratories.Bradley C. Aldrich joined Intel in 1997 where he is currently an architect within the PCA Components Group. His current work includes the development of coprocessor instruction support in addition to image capture and display technologies for XScale based application processors. He was previously a member of the Intel/Analog Devices joint development architecture team responsible for video enhancements for the Micro Signal Architecture. Prior to that he was a video system architect in Intel’s Digital Imaging and Video Division working on CMOS sensors, still cameras, and tethered PC based video peripherals. He has also worked as a device engineer for Motorola and as a test engineer for Tektronix. He received a BSEE in 1988 and MSEE in 1994 from the University of Texas at San Antonio.Christopher D. Emmons received a Bachelor of Science degree in Computer Science from the University of Texas at Austin in 2003. He joined Intel in 2001 and is currently a multimedia architect responsible for algorithm development and performance optimization for handheld products within the PCA Components Group. Prior to this he worked as an applications engineer providing performance and power analysis in support of product marketing groups. His research interests include video compression, operating system design, and dynamic resource management.  相似文献   

5.
Many different video processor architectures exist. Its architecture gives a processor strength for a particular application. Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and peripheral support.
Jonah ProbellEmail:
  相似文献   

6.
This paper presents a Computational Memory architecture for MPEG-4 applications with mobile devices. The proposed architecture is used for real-time block-based motion estimation, which is the most computational intensive task in the video encoder. It uses the exhaustive block-matching algorithm (EBMA) for motion estimation. The proposed architecture consists of embedded SRAMs and a number of block-matching units working in parallel to process video data while stored in the memory. The block-matching units access the embedded SRAMs simultaneously, which increases the speed of the architecture. The architecture processes CIF format video sequences (i.e., the frame size is 352 × 288 pixels) with block size of 16 × 16 pixels and ±15 pixels search range. The proposed architecture has been designed, prototyped, and simulated for 0.18 μm TSMC CMOS technology. The simulation shows that the proposed architectures processes up to 126 CIF frames per second with clock frequency 100 MHz. The synthesized prototype of the proposed architecture includes 200 KB memory and it has an area of 33.75 mm2 and consumes 986.96 mW @100 MHz. Mohammed Sayed received his B.Sc. degree from Zagazig University, Zagazig, Egypt, in 1997 and a postgraduate diploma in VLSI design from the Information Technology Institute (ITI), Cairo, Egypt, in 1998. In 2003 he received his M.Sc. degree from University of Calgary, Calgary, Canada. From 1998 to 2001 he was a research and teaching assistant at the Electronics & Communications Engineering Department, Zagazig University, Egypt. In 2001 he became a research assistant at the Department of Electrical and Computer Engineering, University of Calgary, Canada. His current research interests are System-on-Chip, Embedded Memories, and Digital Video Processing. Mr. Sayed received a number of scholarships and awards such as iCORE Scholarship from 2003 to 2005, SMC Industrial Collaboration Award in June 2003, and the Micronet Annual Workshop Best Paper Award in April 2002. He has a number of journal and conference publications and a number of contributions to the MPEG-4 standard (ISO/IEC JTC1/SC29/WG11 MPEG2002/ M8562 and M8563). Wael Badawy is an associate professor in the Department of Electrical and Computer Engineering. He holds an adjunct professor in the Department of Mechanical Engineering, University of Alberta. Dr. Badawy's research interests are in the areas of: Microelectronics, VLSI architectures for video applications with low-bit rate applications, digital video processing, low power design methodologies, and VLSI prototyping. His research involves designing new models, techniques, algorithms, architectures and low power prototype for novel system and consumer products. Dr. Badawy authored and co-authored more than 100 peer reviewed Journal and Conference papers and about 30 technical reports. He is the Guest Editor for the special issue on System on Chip for Real-Time Applications in the Canadian Journal on Electrical and Computer Engineering, the Technical Chair for the 2002 International Workshop on SoC for real-time applications, and a technical reviewer in several IEEE journals and conferences. He is currently a member of the IEEE-CAS Technical Committee on Communication. Dr. Badawy was honored with the “2002 Petro Canada Young Innovator Award”, “2001 Micralyne Microsystems Design Award” and the 1998 Upsilon Pi Epsilon Honor Society and IEEE Computer Society Award for Academic Excellence in Computer Disciplines. He is currently the Chairman of the Canadian Advisor Committee (CAC) and Head of the Canadian Delegation on ISO/IEC/JTC1/SC6 “Telecommunications and Information Exchange Between Systems”. Member, The Canadian Advisory Committee for the Standards Council of Canada—Subcommittee 29: Coding of Audio, Picture Multimedia and Hypermedia Information, and Canadian Delegate, The ISO/IEC MPEG standard committee. He is a voting Member on the VSI Alliance. He is also the Chair of the IEEE-Southern Alberta Society-Computer Chapter.  相似文献   

7.
Efforts to reduce high-speed memory interface power have led to the adoption of data bus inversion or bus-invert coding. This study compares two popular algorithms, which seek to limit the number of simultaneously transitioning signals and bias the state of transmitted data toward a preferred binary level, respectively. A new algorithm, which provides a compromise between transition frequency and preferred signal level, is proposed, and the three algorithms are compared in terms of their impact on power consumption, power supply noise reduction, and general signal integrity enhancement when used in conjunction with a variety of link topologies.  相似文献   

8.
A dynamic-reconfigurable memory chip is fabricated, by which on-chip memories of an SoC chip can be moved to the memory chip to increase the efficiency of memory usage, and stacked on a logic chip by using three dimensional packaging technology. In the memory chip, many RAM-macros are arrayed and they are connected through two dimensional mesh network interconnects. By using memory-specified network interconnects, area overhead of network interconnects for the memory chip is reduced by 63% and the latency overhead by 43%. Signal lines between the two chips are directly connected by 10-?m-pitch inter-chip electrodes, resulting in fast and low-energy inter-chip transmission.  相似文献   

9.
This paper examines and investigates the relationship between bioinformatics data processing and its underlying computing architecture within the context of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC exchanges sequence data on a daily basis across its three member organizations in USA, UK and Japan. We studied how this sequence database in MySQL can best take advantage of the increased transfer bandwidth of a grid-based storage architecture. Within the context of the UK Government Project “Grid-oriented Storage (GOS)” and the EC Project “EuroAsiaGrid,” GOS has been developed in our lab, which melds parallel streaming technique to meet the needs of WAN/Grid-based virtual organizations. A real-world test shows that the INSDC sequence database backuping operation, mysqldump, over the pipelined GOS architecture beats those over the classic infrastructures by six times over the link between Cambridge and Tokyo. When performing genomic sequence search against one million records via the underlying GOS architecture, the performance improvement of 67.3% has been achieved.
Frank Zhigang WangEmail:
  相似文献   

10.
This paper presents a pipelined, reduced memory and low power CORDIC-based architecture for fast Fourier transform implementation. The proposed algorithm utilizes a new addressing scheme and the associated angle generator logic in order to remove any ROM usage for storing twiddle factors. As a case study, the radix-2 and radix-4 FFT algorithms have been implemented on FPGA hardware. The synthesis results match the theoretical analysis and it can be observed that more than 20% reduction can be achieved in total memory logic. In addition, the dynamic power consumption can be reduced by as much as 15% by reducing memory accesses.  相似文献   

11.
介绍了广电行业大数据的典型应用,包括收视行为分析、客户特征洞察和市场营销分析,阐述了大数据下企业数据应用中心的系统架构,包括技术架构、数据架构、功能架构和部署架构.  相似文献   

12.
This paper presents a memory organization for SDR inner modem baseband processors that focus on exploiting ILP. This memory organization uses power-efficient, single-ported, interleaved scratch-pad memory banks to provide enough bandwidth to a high-ILP processors. A system of queues in the memory interface is used to resolve bank conflicts among the single-ported banks, and to spread long bursts of conflicting accesses to the same bank over time. Bank address rotation is used to spread long bursts of conflicting accesses over multiple banks. All proposed techniques have been implemented in hardware, and are evaluated for a number of different wireless communication standards. For the 11a|n benchmarks, the overhead of stall cycles resulting from unresolved bank conflicts can be reduced to below 2% with the proposed organization. For 3GPP-LTE, the most demanding wireless standard we evaluated, the overhead is reduced to less than 0.13%. This is achieved with little energy and area overhead, and without any bank-aware compiler support.  相似文献   

13.
Nowadays, the multicore processor is watched with interest by people all over the world. As the design technology of system on chip has developed, observing and controlling the processor core's internal state has not been easy. Therefore, multicore processor debugging is very difficult and time‐consuming. Thus, we need a reliable and efficient debugger to find the bugs. In this paper, we propose an on‐chip debug architecture for multicore processors that is easily adaptable and flexible. It is based on the JTAG standard and supports monitoring mode debugging, which is different from run‐stop mode debugging. Compared with the debug architecture that supports the run‐stop mode debugging, the proposed architecture is easily applied to a debugger and has the advantage of having a desirable gate count and execution cycle. To verify the on‐chip debug architecture, it is applied to the debugger of the prototype multicore processor and is tested by interconnecting it with a software debugger based on GDB and configured for the target processor.  相似文献   

14.
多核处理器已经成为处理器的主流,并发展成为各种通信与媒体应用的主流处理平台。通讯结构是多核系统中的核心技术之一,核间通信的效率是影响多核处理器性能的重要指标。目前有3种主要的通讯架构:总线系统结构、交叉开关网络和片上网络。总线结构设计相对方便、硬件消耗较少、成本较低;交叉开关是适合用于构建大容量系统的交换网络结构;而片上网络是更高层次、更大规模的片上网络系统,目前可以解决多核体系结构问题,是多核系统最有前途的解决方案之一。文中在分析了NoC结构的基本原理、系统结构和功能的同时,也提供了部分单元的设计实现。  相似文献   

15.
As the network environment is rapidly changing, network interfaces demand highly intelligent traffic management (on control plane) in addition to the basic requirement of wire speed packet forwarding (on data plane). Several vendors are releasing various network processors (NPS) in order to handle these demands, but they are optimized for throughputs mostly in data plane. As demands for control plane applications (e.g., quality of service) grow, efficient control plane processing will become increasingly important to good performance of network interface. In this paper, we explore acceleration techniques to improve the performance of control plane network applications. Three applications including media transcoding and transaction applications are analyzed in detail. The result of workload characterization shows that wide-issue configuration shows early saturation in performance, and there is no common bottleneck among applications based on sensitivity analysis. Therefore, we study to get each application have its own hardware acceleration module in order to accomplish the required throughput on OC-768 or higher. Our approach includes array style accelerator for media transcoding applications and partitioned lookup mechanism for lookup-table-related applications. Performance analysis of the proposed techniques shows significant improvement over the baseline configuration. Such hardware accelerators provide large packet-level parallelism proportional to the number of processing elements added. Our analyses of the proposed techniques suggest future directions for the design of high-performance NPs.   相似文献   

16.
李琪  钟将  李雪  李青 《电子学报》2019,47(3):664-670
随着互联网和云计算技术的迅猛发展,现有动态随机存储器(Dynamic Random Access Memory,DRAM)已无法满足一些实时系统对性能、能耗的需求.新型非易失存储器(Non-Volatile Memory,NVM)的出现为计算机存储体系的发展带来了新的契机.本文针对NVM和DRAM混合内存系统架构,提出一种高效的混合内存页面管理机制.该机制针对内存介质写特性的不同,将具有不同访问特征的数据页保存在合适的内存空间中,以减少系统的迁移操作次数,从而提升系统性能.同时该机制使用一种两路链表使得NVM介质的写操作分布更加均匀,以提升使用寿命.最后,本文在Linux内核中对所提机制进行仿真实验.并与现有内存管理机制进行对比,实验结果证明了所提方法的有效性.  相似文献   

17.
18.
Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core.  相似文献   

19.
根据玉环县广播电视台非编文稿网的运行实践,分别从非编文稿网的安全架构、非编文稿两网间的数据安全交换、文稿网数据安全存储备份、非编网数据库安全定时备份、非编网数据恢复软件应用、磁盘碎片整理方法等方面,介绍确保电视台非编制作系统安全高效运行的经验和方法。  相似文献   

20.
本文提出一种结合位操作分析和变换的扩展指令自动选择方法。该方法在数据流图中引入新的位操作中间表示结点,可精简地描述位访问操作。编译器可对程序数据流图进行选择性循环展开和位操作分析优化,并将其转换为带有直接表示位赋值操作结点的数据流图。实验结果表明,基于新的数据流图进行扩展指令选择可有效提升位操作密集型应用的性能  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号