The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Hardware customization is an effective approach for meeting application performance requirements while achieving high levels of energy efficiency. Application-specific processors achieve high performance at low energy by tailoring their designs towards a specific workload, i.e., an application or application domain of interest. A fundamental question that has remained unanswered so far though is to...
In the race towards computational efficiency, accelerators are achieving prominence. Among the different types, accelerators built using reconfigurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes the portability and reusability across...
Multiprocessor System-on-Chip (MPSoC) platforms have become increasingly popular for high-performance embedded applications. Each processing element (PE) on such platforms can be tuned to match the computational demands of the tasks executing on it, creating a heterogeneous multiprocessor system. Extensible processor cores, where the base instruction-set architecture can be augmented with application-specific...
When a processor implementation is synthesized from a specification using an automatic framework, this implementation still should be verified against its specification to ensure the automatic framework introduced no error. This paper presents our effort in integrating fully automated formal verification with a high-level processor pipeline synthesis framework. As an integral part of the pipeline...
Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture...
This paper presents efficient techniques for designing high-throughput, low-latency sorting units for FPGA implementation. Our sorting units use modular design techniques that hierarchically construct large sorting units from smaller building blocks. They are optimized for situations in which only the M largest numbers from N inputs are needed; this situation commonly occurs in high-energy physics...
Ray tracing within a uniform grid volume is a fundamental process invoked frequently by many radiation dose calculation methods in radiotherapy. Recent advances of the graphics processing units (GPU) help real-time dose calculation become a reachable goal. However, the performance of the known GPU methods for volume ray tracing is all bounded by the memory-throughput, which leads to inefficient usage...
Elliptic Curve Cryptography (ECC) is popular for digital signatures and other public-key crypto-applications in embedded contexts. However, ECC is computationally intensive, and in particular the performance of the underlying modular arithmetic remains a concern. We investigate the design space of ECC on TI's OMAP 3530 platform, with a focus on using OMAP's DSP core to accelerate ECC computations...
The ability to naturally interact with devices is becoming increasingly important. Speech recognition is one well-known solution to provide easy, hands-free user-device interaction. However, speech recognition has significant computation and memory bandwidth requirements, making it challenging to offer at high performance, real-time and ultra-low power for handheld devices. In this paper, we present...
Application-specific programmable processors are increasingly being replaced by FPGAs, which offer high levels of logic density, rich sets of embedded hardware blocks, and a high degree of customizability and reconfigurability. New FPGA features such as Dynamic Partial Reconfiguration (DPR) can be leveraged to reduce resource utilization and power consumption while still providing high levels of performance...
Digital halftoning is a crucial technique used in digital printers to convert a continuous-tone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This...
GPU device typically has a higher off-chip bandwidth than FPGA-based systems. Thus typically GPU should perform better for bandwidth-bounded massive parallel applications. In this paper, we present our implementations of a 3D recursive Gaussian IIR on multi-core CPU, many-core GPU and multi-FPGA platforms. Our baseline implementation on the CPU features the smallest arithmetic computation (2 MADDs...
The availability of huge amounts of nucleotide sequences catalyzes the development of fast algorithms for approximate DNA and RNA string matching. However, most existing online algorithms can only handle small scale problems. When querying large genomes, their performance becomes unacceptable. Offline algorithms such as Bowtie and BWA require building indexes, and their memory requirement is high...
In the last decade, there has been a dramatic growth in research and development of massively parallel many-core architectures like graphics hardware, both in academia and industry. This changed also the way programs are written in order to leverage the processing power of a multitude of cores on the same hardware. In the beginning, programmers had to use special graphics programming interfaces to...
The graphics processor unit (GPU) is able to provide a low-cost and flexible software-based multi-core architecture for high performance computing. However, it is still very challenging to efficiently map the real-world applications to GPU and fully utilize the computational power of GPU. As a case study, we present a GPU-based implementation of a real-world digital signal processing (DSP) application:...
Programmable multi-core and many-core platforms increase exponentially the challenge of task mapping and scheduling, provided that enough task-parallelism does exist for each application. This problem worsens when dealing with small ecosystems such as embedded systems-on-chip. In fact, in this case, the assumption of exploiting a traditional operating system is out of context given the memory available...
Application Robustification, a promising approach for reducing processor power, converts applications into numerical optimization problems and solves them using gradient descent and conjugate gradient algorithms [1]. The improvement in robustness, however, comes at the expense of performance when compared to the baseline non-iterative versions of these applications. To mitigate the performance loss...
Throughput of wireless communication standards ever increases. Computation requirements for systems implementing those standards increase even more. On battery operated devices, next to high performance a low power implementation is also crucial. Reaching this is only possible by utilizing parallelizations at all levels. The ADRES processor is an embedded coarse-grained reconfigurable baseband processor...
This paper proposes a new hardware accelerator to speed up the performance of vector graphics applications on complex embedded systems. The resulting hardware accelerator is synthesized on a field-programmable gate array (FPGA) and integrated with software components. The paper also introduces a hardware/software co-verification environment which provides in-system at-speed functional verification...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.