Search results

chapter

Performance and Energy Analysis of OpenMP Runtime Systems with Dense Linear Algebra Algorithms

Joao V.F. Lima, Issam Rais, Laurent Lefevre, Thierry Gautier

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 7 - 12

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

In this paper, we analyse performance and energy consumption of four OpenMP runtime systems over a NUMA platform. We present an experimental study to characterize OpenMP runtime systems on the three main kernels in dense linear algebra algorithms (Cholesky, LU and QR) in terms of performance and energy consumption. Our experimental results suggest that OpenMP runtime systems can be considered as a...

chapter

Vectorization-Aware Loop Optimization with User-Defined Code Transformations

Hiroyuki Takizawa, Thorsten Reimann, Kazuhiko Komatsu, Takashi Soga, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 685 - 692

2017 IEEE International Conference on Cluster Computing (CLUSTER)

The cost of maintaining an application code would significantly increase if the application code is branched into multiple versions, each of which is optimized for a different architecture. In this work, default and vector versions of a realworld application code are refactored to be a single version, and the differences between the versions are expressed as userdefined code transformations. As a...

chapter

LibHSA: One step towards mastering the era of heterogeneous hardware accelerators using FPGAs

Marc Reichenbach, Philipp Holzinger, Konrad Haublein, Tobias Lieske, more

2017 Conference on Design and Architectures for Signal and Image Processing (DASIP) > 1 - 6

2017 Conference on Design and Architectures for Signal and Image Processing (DASIP)

Various signal and image processing applications require vast acceleration in order to enable real-time processing and meet constraints in power consumption. On FPGAs these applications can be implemented as application-specific circuit. Although IP cores for various applications exist, even interfacing these usually requires experienced knowledge in hardware design. Using FPGAs or other accelerators...

chapter

Loop Overhead Reduction Techniques for Coarse Grained Reconfigurable Architectures

Kanishkan Vadivel, Mark Wijtvliet, Roel Jordans, Henk Corporaal

2017 Euromicro Conference on Digital System Design (DSD) > 14 - 21

2017 Euromicro Conference on Digital System Design (DSD)

Due to their flexibility and high performance, Coarse Grained Reconfigurable Array (CGRA) are a topic of increasing research interest. However, CGRAs also have the potential to achieve very high energy efficiency in comparison to other reconfigurable architectures when hardware optimizations are applied. Some of these optimizations are common for more traditional processors but can also lead to large...

chapter

Automatic Control Flow Generation for OpenVX Graphs

Merten Popp, Stef van Son, Orlando Moreira

2017 Euromicro Conference on Digital System Design (DSD) > 198 - 204

2017 Euromicro Conference on Digital System Design (DSD)

Heterogeneous platforms with large numbers of processing elements (PEs) have been proposed to satisfy the computational requirements of computer vision applications. Limiting the incurred communication cost here is key to meet the power constraints of embedded devices.We present a new heuristic to reduce communication among PEs and to external memory by aggregating inter-process communication and...

chapter

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

Athena Elafrou, Georgios Goumas, Nectarios Koziris

2017 46th International Conference on Parallel Processing (ICPP) > 292 - 301

2017 46th International Conference on Parallel Processing (ICPP)

This paper presents a low-overhead optimizer for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel. Architectural diversity among different processors together with structural diversity among different sparse matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is both matrix- and architecture-adaptive through runtime specialization. To this direction, we present...

chapter

Hierarchical Read/Write Analysis for Pointer-Based OpenCL Programs on RRAM

Lin-Ya Yu, Shao-Chung Wang, Jenq-Kuen Lee

2017 46th International Conference on Parallel Processing Workshops (ICPPW) > 45 - 52

2017 46th International Conference on Parallel Processing Workshops (ICPPW)

Heterogeneous computing platforms containing a wide range of computing resources from CPUs to specialized hardware accelerators is the trend today resulting from the physical limitations on processors speed and the increasing demand for computing performance. Hence many optimization strategies are studied to get better throughput and lower energy consumption in heterogeneous systems. Various memory...

chapter

Understanding the Performances of Sparse Compression Formats Using Data Parallel Programming Model

Ichrak Mehrez, Olfa Hamdi-Larbi, Thomas Dufaud, Nahid Emad

2017 International Conference on High Performance Computing & Simulation (HPCS) > 667 - 674

2017 International Conference on High Performance Computing & Simulation (HPCS)

Several applications in numerical scientific computing involve very large sparse matrices with a regular or irregular sparse structure. These matrices can be stored using special compression formats (storing only non-zero elements) to reduce memory space and processing time. The choice of the optimal format is a critical process that involves several criteria. The general context of this work is to...

chapter

Evaluation of HPC-Big Data Applications Using Cloud Platforms

Shweta Salaria, Kevin Brown, Hideyuki Jitsumoto, Satoshi Matsuoka

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) > 1053 - 1061

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

The path to HPC-Big Data convergence has resulted in numerous researches that demonstrate the performance trade-off between running applications on supercomputers and cloud platforms. Previous studies typically focus either on scientific HPC benchmarks or previous cloud configurations, failing to consider all the new opportunities offered by current cloud offerings. We present a comparative study...

chapter

vStarCloud: An operating system architecture for Cloud computing

Zhang Lufei, Chen Zuoning

2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA) > 271 - 275

2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)

In recent years, cloud computing is increasingly abstracting away the operating system, allowing developers to focus higher up the stack on applications, not infrastructure. In this position paper, we analyzed the deficiencies of the traditional operating systems in the cloud environment and the features of existing cloud operating systems. We have proposed an operating system architecture for cloud...

chapter

An improved automatic MPI code generation algorithm for parallelizing compilation

Yangxia Xiang, Caisen Chen, Hongyan Wang, Zeyun Zhou

2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) > 1623 - 1626

2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)

Open64 is an open source compiler with powerful analysis and widely used as a research and commercial development platform. However, it has not been designed and developed to realize MPI parallelization. There are many contributions in the paper. Firstly, the Open64 compiler infrastructure is showed. Secondly, the location of MPI code generation in the Open64 compiler architecture is analyzed. Thirdly,...

chapter

From exaflop to exaflow

Tobias Becker, Pavel Burovskiy, Anna Maria Nestorov, Hristina Palikareva, more

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 404 - 409

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Exascale computing is facing a gap between the ever increasing demand for application performance and the underlying chip technology that does no longer deliver the expected exponential increases in CPU performance. The industry is now progressively moving towards dedicated accelerators to deliver high performance and better energy efficiency. However, the question of programmability still remains...

chapter

Exploiting loop-dependent Stream Reuse for stream processors

Xuejun Yang, Ying Zhang, Jingling Xue, Ian Rogers, more

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) > 22 - 31

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the...

chapter

NetSlices: Scalable multi-core packet processing in user-space

Tudor Marian, Ki Suh Lee, Hakim Weatherspoon

2012 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS) > 27 - 38

2012 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS)

Modern commodity operating systems do not provide developers with user-space abstractions for building high-speed packet processing applications. The conventional raw socket is inefficient and unable to take advantage of the emerging hardware, like multi-core processors and multi-queue network adapters. In this paper we present the NetSlice operating system abstraction. Unlike the conventional raw...

chapter

Automatic generation of fast BLAS3-GEMM: A portable compiler approach

Xing Su, Xiangke Liao, Jingling Xue

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 122 - 133

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in assembly code or generated from C code by general-purpose compilers (guided by architecture-specific directives or auto-tuning). Therefore, either performance or portability suffers. We present a POrtable Compiler Approach, Poca, implemented in LLVM, to automatically generate and optimize this micro-kernel in...

chapter

NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture

Mohammad Alian, Ahmed H. M. O. Abulila, Lokesh Jindal, Daehoon Kim, more

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) > 25 - 36

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

The rate of network packets encapsulating requests from clients can significantly affect the utilization, and thus performance and sleep states of processors in servers deploying a power management policy. To improve energy efficiency, servers may adopt an aggressive power management policy that frequently transitions a processor to a low-performance or sleep state at a low utilization. However, such...

chapter

Accelerating Integrity Verification on Secure Processors by Promissory Hash

Mizuki Miyanaga, Hidetsugu Irie, Shuichi Sakai

2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC) > 20 - 29

2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC)

Most digital content nowadays is protected by a digital rights management (DRM) framework to prevent piracy. Since much content is distributed throughout the world, modern DRM frameworks must protect the confidentiality and integrity of the data, even from rootkits or physical tampering. Secure processors have been proposed to ensure secure executions by performing memory encryption and integrity...

chapter

Extending FreeRTOS to support dynamic and distributed mapping in multiprocessor systems

G. Abich, M. G. Mandelli, F. R. Rosa, F. Moraes, more

2016 IEEE International Conference on Electronics, Circuits and Systems (ICECS) > 712 - 715

2016 IEEE International Conference on Electronics, Circuits and Systems (ICECS)

With the ever-increasing complexity of both embedded application workloads and multiprocessor platforms grows the demand for efficient mapping heuristics able of allocating several application workloads at runtime. The majority of promoted mapping techniques are bespoke implementations that consider an in-house operating system, which is developed to a particular architecture, restricting its adoption...

chapter

Reliable and Efficient Performance Monitoring in Linux

Maria Dimakopoulou, Stephane Eranian, Nectarios Koziris, Nicholas Bambos

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis > 396 - 408

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis

Processor hardware performance counters have recently improved in quality and features, while performance monitoring support in Linux has been significantly revamped with the development of the perf_events subsystem, which contributed in making performance analysis an increasingly common practice among developers. However, no performance analysis is possible without an efficient monitoring interface...

chapter

A Comparative Study of SYCL, OpenCL, and OpenMP

Hercules Cardoso Da Silva, Flavia Pisani, Edson Borin

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 61 - 66

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Recent trends indicate that future computing systems will be composed by a group of heterogeneous computing devices, including CPUs, GPUs, and other hardware accelerators. These devices provide increased processing performance, however, creating efficient code for them may require that programmers manage memory assignments and use specialized APIs, compilers, or runtime systems, thus making their...

INFONA - science communication portal

Search results

Performance and Energy Analysis of OpenMP Runtime Systems with Dense Linear Algebra Algorithms

Vectorization-Aware Loop Optimization with User-Defined Code Transformations

LibHSA: One step towards mastering the era of heterogeneous hardware accelerators using FPGAs

Loop Overhead Reduction Techniques for Coarse Grained Reconfigurable Architectures

Automatic Control Flow Generation for OpenVX Graphs

Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors

Hierarchical Read/Write Analysis for Pointer-Based OpenCL Programs on RRAM

Understanding the Performances of Sparse Compression Formats Using Data Parallel Programming Model

Evaluation of HPC-Big Data Applications Using Cloud Platforms

vStarCloud: An operating system architecture for Cloud computing

An improved automatic MPI code generation algorithm for parallelizing compilation

From exaflop to exaflow

Exploiting loop-dependent Stream Reuse for stream processors

NetSlices: Scalable multi-core packet processing in user-space

Automatic generation of fast BLAS3-GEMM: A portable compiler approach

NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture

Accelerating Integrity Verification on Secure Processors by Promissory Hash

Extending FreeRTOS to support dynamic and distributed mapping in multiprocessor systems

Reliable and Efficient Performance Monitoring in Linux

A Comparative Study of SYCL, OpenCL, and OpenMP

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options