Search results

chapter

FISH: Linux system calls for FPGA accelerators

Kevin Nam, Blair Fort, Stephen Brown

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 4

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

This, paper presents the FISH (FPGA-Initiated Software-Handled) framework which allows FPGA accelerators to make system calls to the Linux operating system in CPU-FPGA systems. A special FISH Linux kernel module running on the CPU provides a system call interface for FPGA accelerators, much like the ABI which exists for software programs. We provide a proof-of-concept implementation of this framework...

chapter

Preliminary Performance Evaluation of Application Kernels Using ARM SVE with Multiple Vector Lengths

Yuetsu Kodama, Tetsuya Odajima, Motohiko Matsuda, Miwako Tsuji, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 677 - 684

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Modern high performance processors are equipped with very wide SIMD instruction set. SVE (Scalable Vector Extension) is an ARM® SIMD technology that supports vector lengths from 128 bits to 2048 bits. One of its promising features is to offer "vector-length agnostic" programming to allow the same SVE code to run on hardware of any vector length without any modification of the code. This...

chapter

Mixed data layout kernels for vectorized complex arithmetic

Doru T. Popovici, Franz Franchetti, Tze Meng Low

2017 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 7

2017 IEEE High Performance Extreme Computing Conference (HPEC)

Implementing complex arithmetic routines with Single Instruction Multiple Data (SIMD) instructions requires the use of instructions that are usually not found in their real arithmetic counter-parts. These instructions, such as shuffles and addsub, are often bottlenecks for many complex arithmetic kernels as modern architectures usually can perform more real arithmetic operations than execute instructions...

chapter

Triangle counting via vectorized set intersection

Shahir Mowlaei

2017 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 5

2017 IEEE High Performance Extreme Computing Conference (HPEC)

In this paper we propose a vectorized sorted set intersection approach for the task of counting the exact number of triangles of a graph on CPU cores. The computation is factorized into reordering and counting kernels where the reordering kernel builds upon the Reverse Cuthill-McKee heuristic.

chapter

Escalating Privileges in Linux Using Voltage Fault Injection

Niek Timmers, Cristofaro Mune

2017 Workshop on Fault Diagnosis and Tolerance in Cryptography (FDTC) > 1 - 8

2017 Workshop on Fault Diagnosis and Tolerance in Cryptography (FDTC)

Today's standard embedded device technology is not robust against Fault Injection (FI) attacks such as Voltage Fault Injection (V-FI). FI attacks can be used to alter the intended behavior of software and hardware of embedded devices. Most FI research focuses on breaking the implementation of cryptographic algorithms. However, this paper's contribution is in showing that FI attacks are effective at...

chapter

Cyclops: PRU programming framework for precise timing applications

Amr Alanwar, Fatima M. Anwar, Yi-Fan Zhang, Justin Pearson, more

2017 IEEE International Symposium on Precision Clock Synchronization for Measurement, Control, and Communication (ISPCS) > 1 - 6

2017 IEEE International Symposium on Precision Clock Synchronization for Measurement, Control, and Communication (ISPCS)

The Beaglebone Black single-board computer is well-suited for real-time embedded applications because its system-on-a-chip contains two "Programmable Real-time Units" (PRUs): 200-MHz microcontrollers that run concurrently with the main 1-GHz CPU that runs Linux. This paper introduces "Cyclops": a web-browser-based IDE that facilitates the development of embedded applications on...

chapter

Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor

James Lin, Zhigeng Xu, Akira Nukada, Naoya Maruyama, more

2017 46th International Conference on Parallel Processing (ICPP) > 432 - 441

2017 46th International Conference on Parallel Processing (ICPP)

The home-grown SW26010 many-core processor enabled the production of China’s first independently developed number-one ranked supercomputer – the Sunway TaihuLight. The design of the limited off-chip memory bandwidth, however, renders the SW26010 a highly memory-bound processor. To compensate for this limitation, the processor was designed with a unique hardware feature, "Register Level Communication"...

chapter

Autotuning GPU Kernels via Static and Predictive Analysis

Robert Lim, Boyana Norris, Allen Malony

2017 46th International Conference on Parallel Processing (ICPP) > 523 - 532

2017 46th International Conference on Parallel Processing (ICPP)

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical...

chapter

Optimum Power-Performance GPU Configuration Prediction Based on Code Attributes

Ali Jooya, Nikitas Dimopoulos, Amirali Baniasadi

2017 International Conference on High Performance Computing & Simulation (HPCS) > 418 - 425

2017 International Conference on High Performance Computing & Simulation (HPCS)

GPUs have been widely used in the past decade to speed up the execution of general purpose applications with high level of parallelism. The efficiency of running general purpose applications on GPUs depends on how well the processing and memory demands of the application is balanced with the hardware resources available on the target GPU and it can significantly affect the power and performance of...

chapter

A Model Driven Approach for Device Driver Development

Yunwei Dong, Yuanyuan He, Yin Lu, Hong Ye

2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C) > 122 - 129

2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C)

In order to facilitate the development and maintenance of device drivers integrated into the operating system, a model driven approach is proposed in this pater for driver design and verification before codding. Architecture model and behavior model are created to illustrate both static and dynamic characteristics of device drivers, in company with device model and device-driver-O.S. interaction model...

chapter

A 142MOPS/mW integrated programmable array accelerator for smart visual processing

Satyajit Das, Davide Rossi, Kevin J. M. Martin, Philippe Coussy, more

2017 IEEE International Symposium on Circuits and Systems (ISCAS) > 1 - 4

2017 IEEE International Symposium on Circuits and Systems (ISCAS)

Due to increasing demand of low power computing, and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy-efficient programmable accelerators. This paper proposes an Integrated Programmable-Array accelerator (IPA) architecture based on an innovative execution model, targeted to accelerate both data and control-flow parts of deeply embedded...

chapter

End-to-end scalable FPGA accelerator for deep residual networks

Yufei Ma, Minkyu Kim, Yu Cao, Sarma Vrudhula, more

2017 IEEE International Symposium on Circuits and Systems (ISCAS) > 1 - 4

2017 IEEE International Symposium on Circuits and Systems (ISCAS)

This work presents an efficient hardware accelerator design of deep residual learning algorithms, which have shown superior image recognition accuracy (>90% top-5 accuracy on ImageNet database). Two key objectives of the acceleration strategy are to (1) maximize resource utilization and minimize data movements, and (2) employ scalable and reusable computing primitives to optimize physical design...

chapter

Droidsentry: Efficient Code Integrity and Control Flow Verification on TrustZone Devices

Darius-Andrei Suciu, Radu Sion

2017 21st International Conference on Control Systems and Computer Science (CSCS) > 156 - 158

2017 21st International Conference on Control Systems and Computer Science (CSCS)

The fast evolution of mobile devices has made them the center of attention for not only the research industry, but also malicious actors, as smartphones are used to store, transmit and process sensitive information. The diversity and number of typically installed applications create windows of opportunity for attackers. Attackers can use vulnerable applications to gain control over the device or change...

chapter

A software technique to enhance register utilization of Convolutional Neural Networks on GPGPUs

Che-Huai Lin, An-Ting Cheng, Bo-Cheng Lai

2017 International Conference on Applied System Innovation (ICASI) > 614 - 617

2017 International Conference on Applied System Innovation (ICASI)

CNNs (Convolutional Neural Networks) have demonstrated superior results in a wide range of applications. However, the time-consuming convolution operations required by CNNs pose great challenges to designers. GPGPUs (General Purpose Graphic Processing Units) have been widely used to exploiting the massive parallelism of convolution operations. This paper proposes a software-based loop-unrolling technique...

chapter

Offloading Communication Control Logic in GPU Accelerated Applications

Elena Agostini, Davide Rossetti, Sreeram Potluri

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) > 248 - 257

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

NVIDIA GPUDirect is a family of technologiesaimed at optimizing data movement among GPUs (P2P) orbetween GPUs and third-party devices (RDMA). GPUDirectAsync, introduced in CUDA 8.0, is a new addition whichallows direct synchronization between GPU and third partydevices. For example, Async allows an NVIDIA GPU to directlytrigger and poll for completion of communication operationsqueued to an InfiniBand...

chapter

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

Jie Wang, Xinfeng Xie, Jason Cong

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 72 - 81

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced...

chapter

Power Analysis of HLS-Designed Customized Instruction Set Architectures

Tejaswini Ananthanarayana, Sonia Lopez, Marcin Lukowiak

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 207 - 212

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Performance and power consumption are key features for evaluating any processor design. In this paper, we present close attention to the impact on power and energy consumption of customized Instruction SetArchitecture (ISA) designed by means of High Level Synthesis (HLS) tools. We compare these results against a full ISA soft processor, Microblaze. Our customized ISA processors greatly reduce the...

chapter

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation

Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, Stephen W. Keckler, more

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) > 249 - 258

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

As GPUs become more pervasive in both scalable high-performance computing systems and safety-critical embedded systems, evaluating and analyzing their resilience to soft errors caused by high-energy particle strikes will grow increasingly important. GPU designers must develop tools and techniques to understand the effect of these soft errors on applications. This paper presents an error injection-based...

chapter

Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration

Ashutosh Dhar, Deming Chen

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) > 48 - 55

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

GPUs are capable of running a variety of applications, however their generic parallel-architecture can lead to inefficient use of resources and reduced power efficiency, due to algorithmic or architectural constraints. In this work, taking inspiration from CGRAs (coarse-grained reconfigurable architectures), we demonstrate resource sharing and re-distribution as a solution that can be leveraged by...

chapter

VLSI Realization of Lanczos Interpolation for a Generic Video Scaling Algorithm

S. Safinaz, A. V. Ravi Kumar

2017 International Conference on Recent Advances in Electronics and Communication Technology (ICRAECT) > 17 - 23

2017 International Conference on Recent Advances in Electronics and Communication Technology (ICRAECT)

Video scaling is a process of resizing a digital frame for preferred view-ability without losing the original content of the video, involving a trade-off between efficiency, smoothness and sharpness. In this research paper, a Generic Algorithm is proposed for enhancement of a motion picture with a given scaling factor without compromising on the picture quality. The proposed algorithm has been verified...

INFONA - science communication portal

Search results

FISH: Linux system calls for FPGA accelerators

Preliminary Performance Evaluation of Application Kernels Using ARM SVE with Multiple Vector Lengths

Mixed data layout kernels for vectorized complex arithmetic

Triangle counting via vectorized set intersection

Escalating Privileges in Linux Using Voltage Fault Injection

Cyclops: PRU programming framework for precise timing applications

Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor

Autotuning GPU Kernels via Static and Predictive Analysis

Optimum Power-Performance GPU Configuration Prediction Based on Code Attributes

A Model Driven Approach for Device Driver Development

A 142MOPS/mW integrated programmable array accelerator for smart visual processing

End-to-end scalable FPGA accelerator for deep residual networks

Droidsentry: Efficient Code Integrity and Control Flow Verification on TrustZone Devices

A software technique to enhance register utilization of Convolutional Neural Networks on GPGPUs

Offloading Communication Control Logic in GPU Accelerated Applications

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

Power Analysis of HLS-Designed Customized Instruction Set Architectures

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation

Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration

VLSI Realization of Lanczos Interpolation for a Generic Video Scaling Algorithm

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options