The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This, paper presents the FISH (FPGA-Initiated Software-Handled) framework which allows FPGA accelerators to make system calls to the Linux operating system in CPU-FPGA systems. A special FISH Linux kernel module running on the CPU provides a system call interface for FPGA accelerators, much like the ABI which exists for software programs. We provide a proof-of-concept implementation of this framework...
Modern high performance processors are equipped with very wide SIMD instruction set. SVE (Scalable Vector Extension) is an ARM® SIMD technology that supports vector lengths from 128 bits to 2048 bits. One of its promising features is to offer "vector-length agnostic" programming to allow the same SVE code to run on hardware of any vector length without any modification of the code. This...
Implementing complex arithmetic routines with Single Instruction Multiple Data (SIMD) instructions requires the use of instructions that are usually not found in their real arithmetic counter-parts. These instructions, such as shuffles and addsub, are often bottlenecks for many complex arithmetic kernels as modern architectures usually can perform more real arithmetic operations than execute instructions...
In this paper we propose a vectorized sorted set intersection approach for the task of counting the exact number of triangles of a graph on CPU cores. The computation is factorized into reordering and counting kernels where the reordering kernel builds upon the Reverse Cuthill-McKee heuristic.
Today's standard embedded device technology is not robust against Fault Injection (FI) attacks such as Voltage Fault Injection (V-FI). FI attacks can be used to alter the intended behavior of software and hardware of embedded devices. Most FI research focuses on breaking the implementation of cryptographic algorithms. However, this paper's contribution is in showing that FI attacks are effective at...
The Beaglebone Black single-board computer is well-suited for real-time embedded applications because its system-on-a-chip contains two "Programmable Real-time Units" (PRUs): 200-MHz microcontrollers that run concurrently with the main 1-GHz CPU that runs Linux. This paper introduces "Cyclops": a web-browser-based IDE that facilitates the development of embedded applications on...
The home-grown SW26010 many-core processor enabled the production of China’s first independently developed number-one ranked supercomputer – the Sunway TaihuLight. The design of the limited off-chip memory bandwidth, however, renders the SW26010 a highly memory-bound processor. To compensate for this limitation, the processor was designed with a unique hardware feature, "Register Level Communication"...
Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical...
GPUs have been widely used in the past decade to speed up the execution of general purpose applications with high level of parallelism. The efficiency of running general purpose applications on GPUs depends on how well the processing and memory demands of the application is balanced with the hardware resources available on the target GPU and it can significantly affect the power and performance of...
In order to facilitate the development and maintenance of device drivers integrated into the operating system, a model driven approach is proposed in this pater for driver design and verification before codding. Architecture model and behavior model are created to illustrate both static and dynamic characteristics of device drivers, in company with device model and device-driver-O.S. interaction model...
Due to increasing demand of low power computing, and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy-efficient programmable accelerators. This paper proposes an Integrated Programmable-Array accelerator (IPA) architecture based on an innovative execution model, targeted to accelerate both data and control-flow parts of deeply embedded...
This work presents an efficient hardware accelerator design of deep residual learning algorithms, which have shown superior image recognition accuracy (>90% top-5 accuracy on ImageNet database). Two key objectives of the acceleration strategy are to (1) maximize resource utilization and minimize data movements, and (2) employ scalable and reusable computing primitives to optimize physical design...
The fast evolution of mobile devices has made them the center of attention for not only the research industry, but also malicious actors, as smartphones are used to store, transmit and process sensitive information. The diversity and number of typically installed applications create windows of opportunity for attackers. Attackers can use vulnerable applications to gain control over the device or change...
CNNs (Convolutional Neural Networks) have demonstrated superior results in a wide range of applications. However, the time-consuming convolution operations required by CNNs pose great challenges to designers. GPGPUs (General Purpose Graphic Processing Units) have been widely used to exploiting the massive parallelism of convolution operations. This paper proposes a software-based loop-unrolling technique...
NVIDIA GPUDirect is a family of technologiesaimed at optimizing data movement among GPUs (P2P) orbetween GPUs and third-party devices (RDMA). GPUDirectAsync, introduced in CUDA 8.0, is a new addition whichallows direct synchronization between GPU and third partydevices. For example, Async allows an NVIDIA GPU to directlytrigger and poll for completion of communication operationsqueued to an InfiniBand...
Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced...
Performance and power consumption are key features for evaluating any processor design. In this paper, we present close attention to the impact on power and energy consumption of customized Instruction SetArchitecture (ISA) designed by means of High Level Synthesis (HLS) tools. We compare these results against a full ISA soft processor, Microblaze. Our customized ISA processors greatly reduce the...
As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors caused by high-energy particle strikes will grow increasingly important. GPU designers must develop tools and techniques to understand the effect of these soft errors on applications. This paper presents an error injection-based...
GPUs are capable of running a variety of applications, however their generic parallel-architecture can lead to inefficient use of resources and reduced power efficiency, due to algorithmic or architectural constraints. In this work, taking inspiration from CGRAs (coarse-grained reconfigurable architectures), we demonstrate resource sharing and re-distribution as a solution that can be leveraged by...
Video scaling is a process of resizing a digital frame for preferred view-ability without losing the original content of the video, involving a trade-off between efficiency, smoothness and sharpness. In this research paper, a Generic Algorithm is proposed for enhancement of a motion picture with a given scaling factor without compromising on the picture quality. The proposed algorithm has been verified...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.