Search results

chapter

Implementing Lattice QCD Application with XcalableACC Language on Accelerated Cluster

Masahiro Nakao, Hitoshi Murai, Hidetoshi Iwashita, Akihiro Tabuchi, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 429 - 438

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Accelerated clusters, which are distributed memory systems equipped with accelerators, have been used in various fields. For accelerated clusters, programmers often implement their applications by a combination of MPI and CUDA (MPI+CUDA). However, the approach faces programming complexity issues. This paper introduces the XcalableACC (XACC) language, which is a hybrid model of XcalableMP (XMP) and...

chapter

A GPU-Friendly Skiplist Algorithm

Nurit Moscovici, Nachshon Cohen, Erez Petrank

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 246 - 259

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

We propose a design for a fine-grained lock-based skiplist optimized for Graphics Processing Units (GPUs). While GPUs are often used to accelerate streaming parallel computations, it remains a significant challenge to efficiently offload concurrent computations with more complicated data-irregular access and fine-grained synchronization. Natural building blocks for such computations would be concurrent...

chapter

Parallel triangle counting and k-truss identification using graph-centric methods

Chad Voegele, Yi-Shan Lu, Sreepathi Pai, Keshav Pingali

2017 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 7

2017 IEEE High Performance Extreme Computing Conference (HPEC)

We describe CPU and GPU implementations of parallel triangle-counting and k-truss identification in the Galois and IrGL systems. Both systems are based on a graph-centric abstraction called the operator formulation of algorithms. Depending on the input graph, our implementations are two to three orders of magnitude faster than the reference implementations provided by the IEEE HPEC static graph challenge.

chapter

In-memory Data Flow Processor

Daichi Fujiki, Scott Mahlke, Reetuparna Das

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 375

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Recent development of Non-Volatile Memories (NVMs) has opened up a new horizon for in-memory computing. By re-purposing memory structures, certain NVMs have been shown to have in-situ analog computation capability. For example, resistive memories (ReRAMs) store the data in the form of resistance of titanium oxides, and by injecting voltage into the word line and sensing the resultant current on the...

chapter

Directive-Based Partitioning and Pipelining for Graphics Processing Units

Xuewen Cui, Thomas R. W. Scogland, Bronis R. de Supinski, Wu-chun Feng

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 575 - 584

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

The community needs simpler mechanisms to access the performance available in accelerators, such as GPUs, FPGAs, and APUs, due to their increasing use in state-of-the-art supercomputers. Programming models like CUDA, OpenMP, OpenACC and OpenCL can efficiently offload compute-intensive workloads to these devices. By default these models naively offload computation without overlapping it with communication...

chapter

AnalyzeThat: A Programmable Shared-Memory System for an Array of Processing-In-Memory Devices

Sangkuen Lee, Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) > 619 - 624

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

Processing In Memory (PIM), the concept of integrating processing directly with memory, has been attracting a lot of attention since PIM can assist in overcoming the throughput limitation caused by data movement between CPU and memory. The challenge, however, is that it requires the programmers to have a deep understanding of the PIM architecture to maximize the benefits such as data locality and...

chapter

Towards a GraphBLAS Library in Chapel

Ariful Azad, Aydin Buluc

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1095 - 1104

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

The adoption of a programming language is positively influenced by the breadth of its software libraries. Chapel is a modern andrelatively young parallel programming language. Consequently, not many domain-specific software libraries exists that are written for Chapel. Graph processing is an important domain with many applications in cyber security, energy, social networking, and health. Implementing...

chapter

HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems

Yonghong Yan, Jiawen Liu, Kirk W. Cameron, Mariam Umar

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 788 - 798

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Heterogeneous computing systems, e.g., those with accelerators than the host CPUs, offer the accelerated performance for a variety of workloads. However, most parallel programming models require platform dependent, time-consuming hand-tuning efforts for collectively using all the resources in a system to achieve efficient results. In this work, we explore the use of OpenMP parallel language extensions...

chapter

A scalable and composable map-reduce system

Mahwish Arif, Hans Vandierendonck, Dimitrios S. Nikolopoulos, Bronis R. de Supinski

2016 IEEE International Conference on Big Data (Big Data) > 2233 - 2242

2016 IEEE International Conference on Big Data (Big Data)

This paper presents a novel map-reduce runtime system that is designed for scalability and for composition with other parallel software. We use a modified programming interface that expresses reduction operations over data containers as opposed to key-value pairs. This design choice admits higher efficiency as the programmer can select appropriate data structures. Our runtime targets shared memory...

chapter

A 28nm HKMG super low power embedded NVM technology based on ferroelectric FETs

M. Trentzsch, S. Flachowsky, R. Richter, J. Paul, more

2016 IEEE International Electron Devices Meeting (IEDM) > 11.5.1 - 11.5.4

2016 IEEE International Electron Devices Meeting (IEDM)

We successfully implemented a one-transistor (1T) ferroelectric field effect transistor (FeFET) eNVM into a 28nm gate-first super low power (28SLP) CMOS technology platform using two additional structural masks. The electrical baseline properties remain the same for the FeFET integration and the JTAG-controlled 64 kbit memory shows clearly separated states. High temperature retention up to 250 °C...

chapter

A Directive-Based Data Layout Abstraction for Performance Portability of OpenACC Applications

Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka

2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) > 1147 - 1154

2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

Directive-based programming interfaces such as OpenACC and OpenMP are becoming more prevalent in application development targeting accelerators, in particular when porting existing CPU-only code. Unlike vendor-specific alternatives such as CUDA, they are designed to be portable across different accelerators, and therefore once necessary directives are added to an existing CPU-only code, it can be...

chapter

OpenACC Cache Directive: Opportunities and Optimizations

Ahmad Lashgar, Amirali Baniasadi

2016 Third Workshop on Accelerator Programming Using Directives (WACCPD) > 46 - 56

2016 Third Workshop on Accelerator Programming Using Directives (WACCPD)

OpenACC's programming model presents a simple interface to programmers, offering a trade-off between performance and development effort. OpenACC relies on compiler technologies to generate efficient code and optimize for performance. Among the difficult to implement directives, is the cache directive. The cache directive allows the programmer to utilize accelerator's hardware- or software-managed...

chapter

PGAS Communication Runtime for Extreme Large Data Computation

Ryo Matsumiya, Toshio Endo

2016 Second International Workshop on Extreme Scale Programming Models and Middlewar (ESPM2) > 10 - 16

2016 Second International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)

For partitioned global address space (PGAS) runtimes, supporting out-of-core data computation is an important issue. Some researchers showed that flash SSDs are useful for out-of-core data computation.In this paper, we introduce ComEx-PM, a PGAS communication runtime. ComEx-PM supports out-of-core data computation using a flash SSD. ComEx-PM launched multiple processes in each node. Memory region...

chapter

Teaching MPI from Mental Models

Victor Eijkhout

2016 Workshop on Education for High-Performance Computing (EduHPC) > 14 - 18

2016 Workshop on Education for High-Performance Computing (EduHPC)

The Message Passing Interface (MPI) is the de facto standard for programming large scale parallelism, with up to millions of individual processes. Its dominant paradigm of Single Program Multiple Data (SPMD) programming is different from threaded and multicore parallelism, to an extent that students have a hard time switching models. In contrast to threaded programming, which allows for a view of...

chapter

Application of PGAS Programming to Power Grid Simulation

Bruce Palmer

2016 PGAS Applications Workshop (PAW) > 33 - 40

2016 PGAS Applications Workshop (PAW)

This paper will describe the application of the PGAS Global Arrays (GA) library to power grid simulations. The GridPACK™ framework has been designed to enable power grid engineers to develop parallel simulations of the power grid by providing a set of templates and libraries that encapsulate most of the details of parallel programming in higher level abstractions. The communication portions of the...

chapter

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

Nobuhiro Miki, Fumihiko Ino, Kenichi Hagihara

2016 Third Workshop on Accelerator Programming Using Directives (WACCPD) > 36 - 45

2016 Third Workshop on Accelerator Programming Using Directives (WACCPD)

In this paper, aiming at realizing directive-based temporal blocking for out-of-core stencil computation, we present an extension of OpenACC directives and a source-to-source translator capable of accelerating out-of-core stencil computation on a graphics processing unit (GPU). Out-of-core stencil computation here deals with large data that cannot be entirely stored in GPU memory. Given an OpenACC-like...

chapter

The BigDAWG polystore system and architecture

Vijay Gadepally, Peinan Chen, Jennie Duggan, Aaron Elmore, more

2016 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 6

2016 IEEE High Performance Extreme Computing Conference (HPEC)

Organizations are often faced with the challenge of providing data management solutions for large, heterogenous datasets that may have different underlying data and programming models. For example, a medical dataset may have unstructured text, relational data, time series waveforms and imagery. Trying to fit such datasets in a single data management system can have adverse performance and efficiency...

chapter

Exploiting recent SIMD architectural advances for irregular applications

Linchuan Chen, Peng Jiang, Gagan Agrawal

2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 47 - 58

2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

A broad class of applications involve indirect or data-dependent memory accesses and are referred to as irregular applications. Recent developments in SIMD architectures — specifically, the emergence of wider SIMD lanes, combination of SIMD parallelism with many-core MIMD parallelism, and more flexible programming APIs — are providing new opportunities as well as challenges for this class of applications...

chapter

DT-CGRA: Dual-track coarse-grained reconfigurable architecture for stream applications

Xitian Fan, Huimin Li, Wei Cao, Lingli Wang

2016 26th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 9

2016 26th International Conference on Field Programmable Logic and Applications (FPL)

This paper presents a new type of coarse-grained reconfigurable architecture (CGRA) for the object inference domain in machine learning. The proposed CGRA is optimized for stream processing and a correspondent programming model called dual-track model is proposed. The CGRA is realized in Verilog HDL and implemented in SMIC 55 nm process, with the footprint of 3.79 mm² and consuming 1.79 W at 500 MHz...

chapter

An OpenACC Optimizer for Accelerating Histogram Computation on a GPU

Kei Ikeda, Fumihiko Ino, Kenichi Hagihara

2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) > 468 - 477

2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)

This paper presents a source-to-source OpenACC optimizer that automatically optimizes a histogram computation code for a graphics processing unit (GPU). Parallel histogram computation codes typically deploy multiple copies of histograms and update them with atomic operations. This duplication method can be implemented as an OpenACC code. However, the structure of sequential code blocks must be manually...

INFONA - science communication portal

Search results

Implementing Lattice QCD Application with XcalableACC Language on Accelerated Cluster

A GPU-Friendly Skiplist Algorithm

Parallel triangle counting and k-truss identification using graph-centric methods

In-memory Data Flow Processor

Directive-Based Partitioning and Pipelining for Graphics Processing Units

AnalyzeThat: A Programmable Shared-Memory System for an Array of Processing-In-Memory Devices

Towards a GraphBLAS Library in Chapel

HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems

A scalable and composable map-reduce system

A 28nm HKMG super low power embedded NVM technology based on ferroelectric FETs

A Directive-Based Data Layout Abstraction for Performance Portability of OpenACC Applications

OpenACC Cache Directive: Opportunities and Optimizations

PGAS Communication Runtime for Extreme Large Data Computation

Teaching MPI from Mental Models

Application of PGAS Programming to Power Grid Simulation

An Extension of OpenACC Directives for Out-of-Core Stencil Computation with Temporal Blocking

The BigDAWG polystore system and architecture

Exploiting recent SIMD architectural advances for irregular applications

DT-CGRA: Dual-track coarse-grained reconfigurable architecture for stream applications

An OpenACC Optimizer for Accelerating Histogram Computation on a GPU

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options