The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Variable Pipeline Cool Mega Array (VPCMA) is an low power Coarse Grained Reconfigurable Architecture (CGRA) based on the concept of CMA (Cool Mega Array). It implements a pipeline structure that can be configured depending on performance requirements, and the silicon on thin buried oxide (SOTB) technology that allows to control its body bias voltage to balance performance and leakage power. In this...
For the purpose of writing and constructing compilers, interpreters and optimizers the parsing of code is mandatory. The syntactic analysis of the input intermediate code into its component parts is known as parsing. And for the implementation of such parsers we need a Context Free Grammar (CFG) which helps to analyses the input code. This paper mainly focuses on extension of the CFG in an EBNF form...
Graphics Processing Units (GPU) have been used extensively for accelerating parallelizable applications in general, and scientific computations in particular. Stencil based algorithms are used intensively in various research areas and represent good candidates for GPU based acceleration. Since scientific computations have high accuracy requirements, herein we focus on stencil based double precision...
The Partitioned Global Address Space (PGAS) programming model strikes a balance between the locality-aware, but explicit, message-passing model (e.g. MPI) and the easy-to-use, but locality-agnostic, shared memory model (e.g. OpenMP). However, the PGAS rich memory model comes at a performance cost which can hinder its potential for scalability and performance. To contain this overhead and achieve full...
Compiler writers have developed various techniques, such as constant folding, subexpresison elimination, loop transformation and vectorization, to help compilers in code optimization for performance improvement. Yet, they have been far less successful in developing techniques or cost models that compilers can rely on to simplify parallel programming and tune the performance of parallel applications...
For set-associative caches, accessing cache ways in parallel results in significant energy waste, as only one way contains the desired data. In this paper, we propose Tag Check Elision (TCE): a non-speculative approach for accessing set-associative caches without a tag check to save energy. TCE can eliminate up to 86% of the tag checks (67% on average), without sacrificing any performance. These direct...
The problem of obtaining high computational throughput from sparse matrix multiple-vector multiplication routines is considered. Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance. We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without...
In these days, every new added hardware feature must not change the underlying instruction set architecture (ISA), in order to avoid adaptation or recompilation of existing code. Nevertheless, this need for compatibility imposes a great number of restrictions to the designers, because it keeps them tied to a specific ISA and all its legacy hardware issues. Considering that the market is mainly dominated...
On chip memory in today's networking SoCs takes up >50% of total area and consumes >40% of total power. As demand for high-performance networks grows, so will the memory content on future SoCs. This paper presents IBM's 32nm HKMG SOI embedded memory offering discussing the considerations associated with the design of key networking memory functions. With memory limiting system performance and...
High Level Synthesis for Systems on Chip is a challenging way to cut off development time, while assuming a good level of performance. But the HLS tools are limited by the abstraction level of the description to perform some high level transforms. This paper evaluates the impact of such high level transforms for ASICs. We have evaluated recursive and non recursive filters for signal processing an...
Most modern processors have some much faster cache memories than a main memory, and therefore, it is important to effectively utilize it for the efficient execution. The cache memories work well through enhancing temporal or spatial localities in the program. Therefore, the cache efficiency can be improved by making accesses to the same array or structure continuous. We propose the new cache optimization...
According to the application of Super word Level Paralleism (SLP) auto-vectorization compiling system in Digital Signal Processing (DSP), due to the specialized functions of DSP frame, such as the specific addressing model, a wide variety of registers, irregular data branch, the obstacle of dependence relation to vectorization non-aligned data or other reasons, which resulted in the compiler can not...
Main memory latencies have always been a concern for system performance. Given that reads are on the critical path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism between reads and writes. This is because a single off-chip memory...
Many processors, such as Intel Xeon processor 5100 series, AMD Athlon 64, support SIMD computation model with the Streaming SIMD Extensions (SSE), SSE2 and SSE3. Using double-precision SSE/SSE2/SSE3 instructions simultaneously can handle two packed double-precision floating-point data elements with 128-bit XMM vector registers, which greatly improves floating-point performance. Sometimes non-consecutive...
Fast Fourier Transform (FFT) is the basis of Digital Signal Processing (DSP). In this paper, a high performance FFT library using radix-2 decimation in frequency (DIF) algorithm is presented which is well suited for SIMD architecture. SIMD architecture microprocessors, such as Intel and AMD, allow parallel floating point operations on contiguous data in memory. A 128-point FFT based radix-2 DIF algorithm...
Recent computer architectures can be configured in lots of different ways. To explore this huge design space, system simulators are typically used. As performance is no longer the only decisive factor but also e.g. power usage or the resource usage of the system it became very hard for designers to select optimal configurations. In this article we use a multi-objective design space exploration tool...
Just-in-Time (JIT) compiler technology offers portability while facilitating target- and context-specific specialization. Single-Instruction-Multiple-Data (SIMD) hardware is ubiquitous and markedly diverse, but can be difficult for JIT compilers to efficiently target due to resource and budget constraints. We present our design for a synergistic auto-vectorizing compilation scheme. The scheme is composed...
Parallel Sparse Matrix Vector Multiplication (PSpMV) is a compute intensive kernel used in iterative solvers like Conjugate Gradient, GMRES and Lanzcos. Numerous attempts at optimizing this function have been made that require fine tuning of many hardware and software parameters to achieve optimal performance. We attempt to offer a simple framework that involves (i) Employing a greedy algorithm to...
The reuse-rate of vector register is one of the most important aspects that influence the SIMD performance. However, the reuse of vector register probably can lead ton on-continuous memory access and non-hit of cache. Based on register reuse analysis, we establish a cost analysis that guide the multiple code generation. Then we perform NPB test based on different problem-scale and the result shows...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.