共查询到20条相似文献,搜索用时 15 毫秒
1.
Teemu Pitkänen Jarno K. Tanskanen Risto Mäkinen Jarmo Takala 《Journal of Signal Processing Systems》2009,57(1):21-32
Many of the current applications used in battery powered devices are from digital signal processing, telecommunication, and
multimedia domains. These applications typically set high requirements for computational performance and often parallelism
is the key solution to meet the performance requirements. In order to exploit the parallel processing units, memory should
be able to feed the data path with data. This calls for a memory organization supporting parallel memory accesses. In this
paper, a conflict resolving parallel data memory system for application-specific instruction-set processors is described.
The memory structure is generic and reusable to support various application-specific designs. The proposed memory system does
not employ any predefined access format signals for memory addressing. The proposed parallel memory system is attached to
an application-specific instruction-set processor core and comparison on area, power, and critical path are shown. The experiments
show that significant power savings can be obtained by exploiting the parallel memory system instead of multi-port memory.
相似文献
Jarmo TakalaEmail: |
2.
3.
Architecture and Compiler Optimizations for Data Bandwidth Improvement in Configurable Processors 总被引:1,自引:0,他引:1
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(9):986-997
Many commercially available embedded processors are capable of extending their base instruction set for a specific domain of applications. While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck. In this paper, we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. Specifically, we embed a single control bit in the instruction op-codes to selectively copy the execution results to a set of hash-mapped shadow registers in the write-back stage. This can efficiently reduce the communication overhead due to data transfers between the core processor and the custom logic. We also present a novel simultaneous global shadow register binding with a hash function generation algorithm to take full advantage of the extension. The application of our approach leads to a nearly optimal performance speedup 相似文献
4.
N. C. Paver M. H. Khan B. C. Aldrich C. D. Emmons 《The Journal of VLSI Signal Processing》2005,41(1):21-34
Providing quality mobile video applications in hand-held mobile devices requires increased computational capability. Using Single Instruction Multiple Data (SIMD) techniques to expose and accelerate the data parallelism inherent in video processing increases performance in handheld and wireless systems. The paper introduces a new 64-bit SIMD coprocessor of the Intel® XScale® microarchitecture which is optimized for low-power handheld applications. The architecture blends the SIMD media processing style with the capabilities of the XScale microarchitecture. This paper provides an overview of the architecture, its instruction set, programming model, the pipeline organization and functional units. The paper also describes how key features of architecture improve the performance of video applications as compared to a scalar implementation. The performance and power improvements based upon measured results are analyzed to show how the opportunities of power savings by reducing the frequency and voltage can be realized.Nigel C. Paver has 13 years experience with the ARM architecture, and in the Intel PCA Components group in Austin, Texas, he is responsible for the architecture and implementation of multimedia coprocessors for the Intel XScale micro-architecture. He is also involved in product architecture and definition of Intel PCA processors. Before Intel, Nigel was one of the lead designers of the early AMULET asynchronous ARM microprocessors at the University of Manchester. He was also vice president in a startup company which used asynchronous design techniques to produce a low-power asynchronous DSP core. Nigel holds a Master of Science degree and Ph.D. in computer science from the University of Manchester and a Bachelor of Science degree in electronics from UMIST.Moinul Khan is a multimedia product architect at Intel Corporation PCA Components group. He is responsible PCA graphics and security architecture. His research interests are virtual prototyping, signal processing algorithms and architecture and communications networking. Before joining Intel he was a technology specialist and founding member of a startup at ATDC, Georgia. He worked on his doctoral research at Georgia Center for Advanced Telecommunications Technology at Georgia Institute of Technology. He received his B.Tech form Indian Insti-ture of Technology and MSEE from Georgia Tech. He also worked as a research member for Canadian Institute for Telecommunications Research and Bell Communications Laboratories.Bradley C. Aldrich joined Intel in 1997 where he is currently an architect within the PCA Components Group. His current work includes the development of coprocessor instruction support in addition to image capture and display technologies for XScale based application processors. He was previously a member of the Intel/Analog Devices joint development architecture team responsible for video enhancements for the Micro Signal Architecture. Prior to that he was a video system architect in Intel’s Digital Imaging and Video Division working on CMOS sensors, still cameras, and tethered PC based video peripherals. He has also worked as a device engineer for Motorola and as a test engineer for Tektronix. He received a BSEE in 1988 and MSEE in 1994 from the University of Texas at San Antonio.Christopher D. Emmons received a Bachelor of Science degree in Computer Science from the University of Texas at Austin in 2003. He joined Intel in 2001 and is currently a multimedia architect responsible for algorithm development and performance optimization for handheld products within the PCA Components Group. Prior to this he worked as an applications engineer providing performance and power analysis in support of product marketing groups. His research interests include video compression, operating system design, and dynamic resource management. 相似文献
5.
Jonah Probell 《Journal of Signal Processing Systems》2008,50(1):33-39
Many different video processor architectures exist. Its architecture gives a processor strength for a particular application.
Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support
multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor
architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor
level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized
for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth
rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video
processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and
peripheral support.
相似文献
Jonah ProbellEmail: |
6.
This paper presents a Computational Memory architecture for MPEG-4 applications with mobile devices. The proposed architecture
is used for real-time block-based motion estimation, which is the most computational intensive task in the video encoder.
It uses the exhaustive block-matching algorithm (EBMA) for motion estimation. The proposed architecture consists of embedded
SRAMs and a number of block-matching units working in parallel to process video data while stored in the memory. The block-matching
units access the embedded SRAMs simultaneously, which increases the speed of the architecture.
The architecture processes CIF format video sequences (i.e., the frame size is 352 × 288 pixels) with block size of 16 × 16
pixels and ±15 pixels search range. The proposed architecture has been designed, prototyped, and simulated for 0.18 μm TSMC
CMOS technology. The simulation shows that the proposed architectures processes up to 126 CIF frames per second with clock
frequency 100 MHz. The synthesized prototype of the proposed architecture includes 200 KB memory and it has an area of 33.75
mm2 and consumes 986.96 mW @100 MHz.
Mohammed Sayed received his B.Sc. degree from Zagazig University, Zagazig, Egypt, in 1997 and a postgraduate diploma in VLSI design from
the Information Technology Institute (ITI), Cairo, Egypt, in 1998. In 2003 he received his M.Sc. degree from University of
Calgary, Calgary, Canada. From 1998 to 2001 he was a research and teaching assistant at the Electronics & Communications Engineering
Department, Zagazig University, Egypt. In 2001 he became a research assistant at the Department of Electrical and Computer
Engineering, University of Calgary, Canada. His current research interests are System-on-Chip, Embedded Memories, and Digital
Video Processing.
Mr. Sayed received a number of scholarships and awards such as iCORE Scholarship from 2003 to 2005, SMC Industrial Collaboration
Award in June 2003, and the Micronet Annual Workshop Best Paper Award in April 2002. He has a number of journal and conference
publications and a number of contributions to the MPEG-4 standard (ISO/IEC JTC1/SC29/WG11 MPEG2002/ M8562 and M8563).
Wael Badawy is an associate professor in the Department of Electrical and Computer Engineering. He holds an adjunct professor in the
Department of Mechanical Engineering, University of Alberta.
Dr. Badawy's research interests are in the areas of: Microelectronics, VLSI architectures for video applications with low-bit
rate applications, digital video processing, low power design methodologies, and VLSI prototyping. His research involves designing
new models, techniques, algorithms, architectures and low power prototype for novel system and consumer products. Dr. Badawy
authored and co-authored more than 100 peer reviewed Journal and Conference papers and about 30 technical reports. He is the
Guest Editor for the special issue on System on Chip for Real-Time Applications in the Canadian Journal on Electrical and
Computer Engineering, the Technical Chair for the 2002 International Workshop on SoC for real-time applications, and a technical
reviewer in several IEEE journals and conferences. He is currently a member of the IEEE-CAS Technical Committee on Communication.
Dr. Badawy was honored with the “2002 Petro Canada Young Innovator Award”, “2001 Micralyne Microsystems Design Award” and
the 1998 Upsilon Pi Epsilon Honor Society and IEEE Computer Society Award for Academic Excellence in Computer Disciplines.
He is currently the Chairman of the Canadian Advisor Committee (CAC) and Head of the Canadian Delegation on ISO/IEC/JTC1/SC6
“Telecommunications and Information Exchange Between Systems”. Member, The Canadian Advisory Committee for the Standards Council
of Canada—Subcommittee 29: Coding of Audio, Picture Multimedia and Hypermedia Information, and Canadian Delegate, The ISO/IEC
MPEG standard committee. He is a voting Member on the VSI Alliance. He is also the Chair of the IEEE-Southern Alberta Society-Computer
Chapter. 相似文献
7.
Efforts to reduce high-speed memory interface power have led to the adoption of data bus inversion or bus-invert coding. This study compares two popular algorithms, which seek to limit the number of simultaneously transitioning signals and bias the state of transmitted data toward a preferred binary level, respectively. A new algorithm, which provides a compromise between transition frequency and preferred signal level, is proposed, and the three algorithms are compared in terms of their impact on power consumption, power supply noise reduction, and general signal integrity enhancement when used in conjunction with a variety of link topologies. 相似文献
8.
Saito H. Nakajima M. Okamoto T. Yamada Y. Ohuchi A. Iguchi N. Sakamoto T. Yamaguchi K. Mizuno M. 《Solid-State Circuits, IEEE Journal of》2010,45(1):15-22
A dynamic-reconfigurable memory chip is fabricated, by which on-chip memories of an SoC chip can be moved to the memory chip to increase the efficiency of memory usage, and stacked on a logic chip by using three dimensional packaging technology. In the memory chip, many RAM-macros are arrayed and they are connected through two dimensional mesh network interconnects. By using memory-specified network interconnects, area overhead of network interconnects for the memory chip is reduced by 63% and the latency overhead by 43%. Signal lines between the two chips are directly connected by 10-?m-pitch inter-chip electrodes, resulting in fast and low-energy inter-chip transmission. 相似文献
9.
Frank Zhigang Wang Na Helian Sining Wu Yuhui Deng Vineet Khare Chris Thompson 《The Journal of VLSI Signal Processing》2007,48(3):311-324
This paper examines and investigates the relationship between bioinformatics data processing and its underlying computing
architecture within the context of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC exchanges sequence
data on a daily basis across its three member organizations in USA, UK and Japan. We studied how this sequence database in
MySQL can best take advantage of the increased transfer bandwidth of a grid-based storage architecture. Within the context
of the UK Government Project “Grid-oriented Storage (GOS)” and the EC Project “EuroAsiaGrid,” GOS has been developed in our
lab, which melds parallel streaming technique to meet the needs of WAN/Grid-based virtual organizations. A real-world test
shows that the INSDC sequence database backuping operation, mysqldump, over the pipelined GOS architecture beats those over
the classic infrastructures by six times over the link between Cambridge and Tokyo. When performing genomic sequence search
against one million records via the underlying GOS architecture, the performance improvement of 67.3% has been achieved.
相似文献
Frank Zhigang WangEmail: |
10.
This paper presents a pipelined, reduced memory and low power CORDIC-based architecture for fast Fourier transform implementation.
The proposed algorithm utilizes a new addressing scheme and the associated angle generator logic in order to remove any ROM
usage for storing twiddle factors. As a case study, the radix-2 and radix-4 FFT algorithms have been implemented on FPGA hardware.
The synthesis results match the theoretical analysis and it can be observed that more than 20% reduction can be achieved in
total memory logic. In addition, the dynamic power consumption can be reduced by as much as 15% by reducing memory accesses. 相似文献
11.
12.
Bjorn De Sutter Osman Allam Praveen Raghavan Roeland Vandebriel Hans Cappelle Tom Vander Aa Bingfeng Mei 《Journal of Signal Processing Systems》2010,61(2):157-179
This paper presents a memory organization for SDR inner modem baseband processors that focus on exploiting ILP. This memory
organization uses power-efficient, single-ported, interleaved scratch-pad memory banks to provide enough bandwidth to a high-ILP
processors. A system of queues in the memory interface is used to resolve bank conflicts among the single-ported banks, and
to spread long bursts of conflicting accesses to the same bank over time. Bank address rotation is used to spread long bursts
of conflicting accesses over multiple banks. All proposed techniques have been implemented in hardware, and are evaluated
for a number of different wireless communication standards. For the 11a|n benchmarks, the overhead of stall cycles resulting
from unresolved bank conflicts can be reduced to below 2% with the proposed organization. For 3GPP-LTE, the most demanding
wireless standard we evaluated, the overhead is reduced to less than 0.13%. This is achieved with little energy and area overhead,
and without any bank-aware compiler support. 相似文献
13.
Nowadays, the multicore processor is watched with interest by people all over the world. As the design technology of system on chip has developed, observing and controlling the processor core's internal state has not been easy. Therefore, multicore processor debugging is very difficult and time‐consuming. Thus, we need a reliable and efficient debugger to find the bugs. In this paper, we propose an on‐chip debug architecture for multicore processors that is easily adaptable and flexible. It is based on the JTAG standard and supports monitoring mode debugging, which is different from run‐stop mode debugging. Compared with the debug architecture that supports the run‐stop mode debugging, the proposed architecture is easily applied to a debugger and has the advantage of having a desirable gate count and execution cycle. To verify the on‐chip debug architecture, it is applied to the debugger of the prototype multicore processor and is tested by interconnecting it with a software debugger based on GDB and configured for the target processor. 相似文献
14.
多核处理器已经成为处理器的主流,并发展成为各种通信与媒体应用的主流处理平台。通讯结构是多核系统中的核心技术之一,核间通信的效率是影响多核处理器性能的重要指标。目前有3种主要的通讯架构:总线系统结构、交叉开关网络和片上网络。总线结构设计相对方便、硬件消耗较少、成本较低;交叉开关是适合用于构建大容量系统的交换网络结构;而片上网络是更高层次、更大规模的片上网络系统,目前可以解决多核体系结构问题,是多核系统最有前途的解决方案之一。文中在分析了NoC结构的基本原理、系统结构和功能的同时,也提供了部分单元的设计实现。 相似文献
15.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(12):1691-1697
16.
随着互联网和云计算技术的迅猛发展,现有动态随机存储器(Dynamic Random Access Memory,DRAM)已无法满足一些实时系统对性能、能耗的需求.新型非易失存储器(Non-Volatile Memory,NVM)的出现为计算机存储体系的发展带来了新的契机.本文针对NVM和DRAM混合内存系统架构,提出一种高效的混合内存页面管理机制.该机制针对内存介质写特性的不同,将具有不同访问特征的数据页保存在合适的内存空间中,以减少系统的迁移操作次数,从而提升系统性能.同时该机制使用一种两路链表使得NVM介质的写操作分布更加均匀,以提升使用寿命.最后,本文在Linux内核中对所提机制进行仿真实验.并与现有内存管理机制进行对比,实验结果证明了所提方法的有效性. 相似文献
17.
18.
Yi Wang Linfeng Pan Zili Shao Yong Guan Minyi Guo 《Journal of Signal Processing Systems》2014,74(2):137-150
Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core. 相似文献
19.
根据玉环县广播电视台非编文稿网的运行实践,分别从非编文稿网的安全架构、非编文稿两网间的数据安全交换、文稿网数据安全存储备份、非编网数据库安全定时备份、非编网数据恢复软件应用、磁盘碎片整理方法等方面,介绍确保电视台非编制作系统安全高效运行的经验和方法。 相似文献