Search results

chapter

Assessing Sparse Triangular Linear System Solvers on GPUs

Daniel Erguiz, Ernesto Dufrechou, Pablo Ezzatti

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 37 - 42

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

An important number of Numerical Linear Algebra methods to tackle problems in diverse fields of science and engineering, rely heavily on the solution of one or many sparse triangular linear systems. Since the early years, this has motivated numerous efforts that seek to produce efficientimplementations of this kernel for most hardware platforms. However, this operation implies strong data dependencies...

chapter

Linear time complexity GF(256) RaptorQ implementation on GPU

Sunwoong Joo

2017 International Conference on Information and Communication Technology Convergence (ICTC) > 282 - 284

2017 International Conference on Information and Communication Technology Convergence (ICTC)

RaptorQ is the most advanced raptor code and has an overhead-failure curve close to the random fountain code over the GF(256) finite field. Theoretically, it is possible to encode and decode with linear time complexity by an inactivation decoding algorithm, which is a hybrid algorithm of belief propagation and Gaussian elimination. However, achieving linear time complexity in a real-world implementation...

chapter

Optimization of GPU and CPU acceleration for neural networks layers implemented in python

Radu Dogaru, Ioana Dogaru

2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE) > 1 - 6

2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE)

Many neural architectures including RBF, SVM, FSVC classifiers, or deep-learning solutions require the efficient implementation of neurons layers, each of them having a given number of m neurons, a specific set of parameters and operating on a training or test set of N feature vectors having each a dimension n. Herein we investigate how to allocate the computation on GPU kernels and how to better...

chapter

Evaluation of GPU/CPU co-processing models for JPEG 2000 packetization

Volker Bruns, Miguel A. Martinez-del-Amor, Heiko Sparenberg

2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP) > 1 - 6

2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP)

With the bottom-line goal of increasing the throughput of a GPU-accelerated JPEG 2000 encoder, this paper evaluates whether the post-compression rate control and packetization routines should be carried out on the CPU or on the GPU. Three co-processing models that differ in how the workload is split among the CPU and GPU are introduced. Both routines are discussed and algorithms for executing them...

chapter

Dataflow Programming for Stream Processing

Marcos P. Rocha, Felipe M.G. Franca, Alexandre S. Nery, Leandro S. Guedes

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 103 - 108

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Stream processing applications have high-demanding performance requirements that are hard to tackle using traditional parallel models on modern many-core architectures, such as GPUs. On the other hand, recent dataflow computing models can naturally exploit parallelism for a wide class of applications. This work presents an extension to an existing dataflow library for Java. The library extension implements...

chapter

Parallel MLEM algorithm using GPU

T. A. Valencia-Perez

2017 14th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE) > 1 - 6

2017 14th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE)

The Computed Tomography (CT) is a imaging method based on X-rays to obtain cross-sectional images from an object. It is a widely used method in several areas, such as medicine, archeology or material sciences. Tomographic reconstruction techniques, use the projections of images from multiple directions. There are several algorithms for this purpose but can be classified according to their reconstruction...

chapter

Introducing parallel computing concepts in computer system related courses

Han Wan, Xiaopeng Gao, Xiang Long, Bo Jiang

2017 IEEE Frontiers in Education Conference (FIE) > 1 - 7

2017 IEEE Frontiers in Education Conference (FIE)

All semiconductor market domains are converging to concurrent platforms. This trend has certainly led real challenge to develop applications software that effectively uses these concurrent processors to achieve efficiency and performance goals. This paper argues that the Computer System related courses are natural places to introduce the parallelism, and the earlier to parallel computing concepts...

chapter

Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference

Shin-Ying Lee, Carole-Jean Wu

2017 IEEE International Symposium on Workload Characterization (IISWC) > 43 - 53

2017 IEEE International Symposium on Workload Characterization (IISWC)

Modern computer systems are accelerator-rich, equipped with many types of hardware accelerators to speed up computation. For example, graphics processing units (GPUs) are a type of accelerators that are widely employed to accelerate parallel workloads. In order to well utilize different accelerators to gain better execution time speedup or reduce total energy consumption, many scheduling algorithms...

chapter

GScheduler: Optimizing resource provision by using GPU usage pattern extraction in cloud environments

Zhuqing Xu, Fang Dong, Jiahui Jin, Junzhou Luo, more

2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC) > 3225 - 3230

2017 IEEE International Conference on Systems, Man and Cybernetics (SMC)

GPU-based clusters are widely chosen for accelerating a variety of scientific applications in high-end cloud environments. With their growing popularity, there is a necessity for improving the system throughput and decreasing the turnaround time for co-executing applications on the same GPU device. However, resource contention among multiple applications on a multi-tasked GPU leads to the performance...

chapter

General-purpose computing on GPU: Pixel processing

Milos Ockay

2017 Communication and Information Technologies (KIT) > 1 - 4

2017 Communication and Information Technologies (KIT)

Presented paper explains general purpose approach to the parallel pixel processing on GPU. It presents essential dataset structuring, correct type assignment and kernel configuration for CUDA application interface. Paper also explains data movement and optimal computation saturation. Transfers are also analyzed in correlation with the computation especially for the embarrassingly parallel problem...

chapter

VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

Mohsen Kiani, Amir Rajabzadeh

2017 7th International Conference on Computer and Knowledge Engineering (ICCKE) > 260 - 265

2017 7th International Conference on Computer and Knowledge Engineering (ICCKE)

Performance modeling plays an important role for optimal hardware design and optimized application implementation. This paper presents a very low overhead performance model, called VLAG, to approximate the data localities exploited by GPU kernels. VLAG receives source code-level information to estimate per memory-access instruction, per data array, and per kernel localities within GPU kernels. VLAG...

chapter

Moka: Model-based concurrent kernel analysis

Leiming Yu, Xun Gong, Yifan Sun, Qianqian Fang, more

2017 IEEE International Symposium on Workload Characterization (IISWC) > 197 - 206

2017 IEEE International Symposium on Workload Characterization (IISWC)

GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device...

chapter

FLiT: Cross-platform floating-point result-consistency tester and workload

Geof Sawaya, Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, more

2017 IEEE International Symposium on Workload Characterization (IISWC) > 229 - 238

2017 IEEE International Symposium on Workload Characterization (IISWC)

Understanding the extent to which computational results can change across platforms, compilers, and compiler flags can go a long way toward supporting reproducible experiments. In this work, we offer the first automated testing aid called FLiT (Floating-point Litmus Tester) that can show how much these results can vary for any user-given collection of computational kernels. Our approach is to take...

chapter

Fault simulation acceleration for TRAX dictionary construction using GPUs

Matthew Beckler, R. D. Blanton

2017 IEEE International Test Conference (ITC) > 1 - 9

2017 IEEE International Test Conference (ITC)

To ensure robustness of integrated systems, the TRAnsition-X (TRAX) fault model has been used with on-chip test and diagnosis hardware, utilizing fault dictionaries for diagnosis. Generating a fault dictionary requires fault simulation with no fault dropping, requiring extensive computational resources. This paper presents the design and implementation of an efficient fault simulator for the TRAX...

chapter

Unrolled Memory Inner-Products: An Abstract GPU Operator for Efficient Vision-Related Computations

Yu-Sheng Lin, Wei-Chao Chen, Shao-Yi Chien

2017 IEEE International Conference on Computer Vision (ICCV) > 4587 - 4595

2017 IEEE International Conference on Computer Vision (ICCV)

Recently, convolutional neural networks (CNNs) have achieved great success in fields such as computer vision, natural language processing, and artificial intelligence. Many of these applications utilize parallel processing in GPUs to achieve higher performance. However, it remains a daunting task to optimize for GPUs, and most researchers have to rely on vendor-provided libraries for such purposes...

chapter

A programming model and runtime system for approximation-aware heterogeneous computing

Ioannis Parnassos, Nikolaos Bellas, Nikolaos Katsaros, Nikolaos Patsiatzis, more

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 4

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Heterogeneous platforms that include diverse architectures such as multicore CPUs, FPGAs and GPUs are becoming very popular due to their superior performance and energy efficiency. Besides heterogeneity, a promising approach for minimizing energy consumption is through approximate computing which relaxes the requirement that all parts of a program are considered equally important to the output quality,...

chapter

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

Yingchao Huang, Dong Li

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 166 - 177

2017 IEEE International Conference on Cluster Computing (CLUSTER)

A heterogeneous memory system (HMS) consists of multiple memory components with different properties. GPU is a representative architecture with HMS. It is challenging to decide optimal placement of data objects on HMS because of the large exploration space and complicated memory hierarchy on HMS. In this paper, we introduce performance modeling techniques to predict performance of various data placements...

chapter

cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs

Behnam Pourghassemi, Aparna Chandramowlishwaran

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 725 - 732

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although...

chapter

Exploration of OpenCL for FPGAs using SDAccel and comparison to GPUs and multicore CPUs

Lester Kalms, Diana Gohringer

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 4

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Due to energy efficiency, heterogeneous computing is gaining more and more attention. Since FPGA implementations are time consuming, high-level synthesis (HLS) is used to close the productivity gap. OpenCL has become accepted as a good programming model for HLS, due to its portability, good capability of design verification and rich instruction set. This work implements different optimization strategies...

chapter

GPU-accelerated fault dictionary generation for the TRAX fault model

Matthew Beckler, R. D. Shawn Blanton

2017 International Test Conference in Asia (ITC-Asia) > 34 - 39

2017 International Test Conference in Asia (ITC-Asia)

This paper presents the design and implementation of a fault simulator for the TRAnsition-X fault model (TRAX for short) on a graphics processing unit (GPU). Fault dictionaries are an important aspect of on-chip fault detection and diagnosis. Generating a fault dictionary requires fault simulation with no fault dropping, requiring extensive computational resources. The inherent parallelism of the...

INFONA - science communication portal

Search results

Assessing Sparse Triangular Linear System Solvers on GPUs

Linear time complexity GF(256) RaptorQ implementation on GPU

Optimization of GPU and CPU acceleration for neural networks layers implemented in python

Evaluation of GPU/CPU co-processing models for JPEG 2000 packetization

Dataflow Programming for Stream Processing

Parallel MLEM algorithm using GPU

Introducing parallel computing concepts in computer system related courses

Performance characterization, prediction, and optimization for heterogeneous systems with multi-level memory interference

GScheduler: Optimizing resource provision by using GPU usage pattern extraction in cloud environments

General-purpose computing on GPU: Pixel processing

VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

Moka: Model-based concurrent kernel analysis

FLiT: Cross-platform floating-point result-consistency tester and workload

Fault simulation acceleration for TRAX dictionary construction using GPUs

Unrolled Memory Inner-Products: An Abstract GPU Operator for Efficient Vision-Related Computations

A programming model and runtime system for approximation-aware heterogeneous computing

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs

Exploration of OpenCL for FPGAs using SDAccel and comparison to GPUs and multicore CPUs

GPU-accelerated fault dictionary generation for the TRAX fault model

Filter options

Publication date

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options