Advanced search

From:

To:

Items from 1 to 14 out of 14 results

chapter

StructSlim: A lightweight profiler to guide structure splitting

Probir Roy, Xu Liu

2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 36 - 46

2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Memory access latency continues to be a dominant bottleneck in a large class of applications on modern architectures. To optimize memory performance, it is important to utilize the locality in the memory hierarchy. Structure splitting can significantly improve memory locality. However, pinpointing inefficient code and providing insightful guidance for structure splitting is challenging. Existing tools...

chapter

Performance Evaluation and Tuning of 2D Jacobi Iteration on Many-Core Machines

Zhengxiong Hou, Christian Perez

2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing > 603 - 610

2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC)

Multi-core machines are dominating the HPC (High Performance Computing) community, some many-core architectures are newly emerging. Whether HPC applications scale well with the number of cores is a main concern. For the performance evaluation and tuning of many-core machines, the 2D Jacobi iteration was chosen as a typical HPC application of stencil computation. We present performance oriented tuning...

chapter

Cache prefetching and speculation on multi-threaded processors

Tarik Ono, Mark R. Greenstreet

2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM) > 206 - 211

2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM)

Data prefetching is an important mechanism for hiding memory latency in single-threaded, desktop workloads. For multi-threaded, commercial workloads, prefetching offers much more modest improvements in performance at a high cost in cache power and bandwidth to the higher level caches. This paper shows that by combining speculation with a selective prefetching scheme, we can reduce the cache access...

chapter

Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing

Nadathur Satish, Changkyu Kim, Jatin Chhugani, Pradeep Dubey

2012 International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 11

2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal on single...

chapter

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Nadathur Satish, Changkyu Kim, Jatin Chhugani, Hideki Saito, more

2012 39th Annual International Symposium on Computer Architecture (ISCA) > 440 - 451

2012 ACM/IEEE 39th International Symposium on Computer Architecture (ISCA)

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking...

chapter

Performance comparison of Single Board Computer: A case study of kernel on ARM architecture

Naufal Alee, Mostafijur Rahman, R. B. Ahmad

2011 6th International Conference on Computer Science & Education (ICCSE) > 521 - 524

2011 6th International Conference on Computer Science & Education (ICCSE 2011)

Due to increase computer hardware technologies, software developers are more focusing on to developing embedded operating system. GNU/Linux becomes a common operating system widely use in embedded technologies. In this paper, we report performance results on a TS-7800 Single Board Computer with different version of kernel that has been released by hardware provider. We compare the performance between...

chapter

Implementing the Himeno benchmark with CUDA on GPU clusters

Everett H Phillips, Massimiliano Fatica

2010 IEEE International Symposium on Parallel&Distributed Processing (IPDPS) > 1 - 10

2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)

This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers...

chapter

Accelerating numerical calculation on the Cray XMT

C. Scherrer, T. Shippert, A. Marquez

2009 IEEE International Symposium on Parallel&Distributed Processing > 1 - 7

2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)

The Cray XMT provides hardware support for parallel algorithms that would be communication- or memory-bound on other machines. Unfortunately, even if an algorithm meets these criteria, performance suffers if the algorithm is too numerically intensive. We present a lookup-based approach that achieves a significant performance advantage over explicit calculation. We describe an approach to balancing...

chapter

The world's fastest CPU and SMP node: Some performance results from the NEC SX-9

T. Zeiser, G. Hager, G. Wellein

2009 IEEE International Symposium on Parallel&Distributed Processing > 1 - 8

2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)

Classic vector systems have all but vanished from recent TOP500 lists. Looking at the newly introduced NEC SX-9 series, we benchmark its memory subsystem using the low level vector triad and employ an advanced lattice Boltzmann flow solver kernel to demonstrate that classic vectors still combine excellent performance with a well-established optimization approach. Results for commodity x86-based systems...

chapter

Way Stealing: Cache-assisted automatic Instruction Set Extensions

T. Kluter, P. Brisk, P. Ienne, E. Charbon

2009 46th ACM/IEEE Design Automation Conference > 31 - 36

2009 46th ACM/IEEE Design Automation Conference (DAC)

This paper introduces way stealing, a simple architectural modification to a cache-based processor to increase data bandwidth to and from application-specific instruction set extensions (ISEs). Way stealing provides more bandwidth to the ISE-logic than the register file alone and does not require expensive coherence protocols, as it does not add memory elements to the processor. When enhanced with...

chapter

Effects of MSHR and Prefetch Mechanisms on an On-Chip Cache of the Vector Architecture

A. Musa, Y. Sato, T. Soga, R. Egawa, more

2008 IEEE International Symposium on Parallel and Distributed Processing with Applications > 335 - 342

2008 IEEE International Symposium on Parallel and Distributed Processing with Applications

Vector supercomputers have been encountering the memory wall problem and their memory bandwidth per flop/s rate has decreased. To cover the insufficient memory bandwidth per flop/s rate, an on-chip vector cache has been proposed for the vector processors. Although vector caching is effective to increase the sustained performance to a certain degree, it still needs software and hardware supporting...

chapter

Jetter: a multi-pattern parallel I/O benchmark

Liqiang Cao, Hongbing Luo, Baoyin Zhang

2008 IEEE International Conference on Cluster Computing > 459 - 463

2008 IEEE International Conference on Cluster Computing (CLUSTER)

This paper proposes a new multi-pattern parallel I/O benchmark called Jetter, which evaluates parallel I/O throughput with either the contiguous I/O pattern or the non-contiguous I/O pattern, in either the share-one-file model or the file-per-process model, by either the POSIX interface or the MPI-I/O interface. Jetter helps end users make sense of the pattern performance law, and helps them develop...

chapter

DLM: A distributed Large Memory System using remote memory swapping over cluster nodes

H. Midorikawa, M. Kurokawa, R. Himeno, M. Sato

2008 IEEE International Conference on Cluster Computing > 268 - 273

2008 IEEE International Conference on Cluster Computing (CLUSTER)

Emerging 64 bitOSpsilas supply a huge amount of memory address space that is essential for new applications using very large data. It is expected that the memory in connected nodes can be used to store swapped pages efficiently, especially in a dedicated cluster which has a high-speed network such as 10 GbE and Infiniband. In this paper, we propose the distributed large memory system (DLM), which...

chapter

A genetic algorithms approach to modeling the performance of memory-bound computations

Mustafa M Tikir, Laura Carrington, Erich Strohmaier, Allan Snavely

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '7) > 1 - 12

2007 SC - International conference for High Performance Computing, Networking, Storage and Analysis

Benchmarks that measure memory bandwidth, such as STREAM, Apex-MAPS and MultiMAPS, are increasingly popular due to the "Von Neumann" bottleneck of modern processors which causes many calculations to be memory-bound. We present a scheme for predicting the performance of HPC applications based on the results of such benchmarks. A Genetic Algorithm approach is used to "learn" bandwidth...

Filter options

Keywords:
ARRAYS
BANDWIDTH
BENCHMARK TESTING

Publication date

Set your own date range

Keywords

OPTIMIZATION (5)
KERNEL (3)
PREFETCHING (3)
PROGRAM PROCESSORS (3)
COMPUTERS (2)
DATA MINING (2)
DATA TRANSFER (2)
MEMORY ARCHITECTURE (2)
MEMORY BANDWIDTH (2)
MESSAGE PASSING (2)
PARALLEL MACHINES (2)
PARALLEL PROGRAMMING (2)
REGISTERS (2)
VECTOR ARCHITECTURE (2)
ADDRESS SAMPLING (1)
APPLICATION SPECIFIC INTEGRATED CIRCUITS (1)
APPLICATION-SPECIFIC INSTRUCTION SET EXTENSIONS (1)
APPLICATION-SPECIFIC PROCESSORS (1)
ARCHITECTURAL MODIFICATION (1)
AUTOMATIC IDENTIFICATION (1)
CACHE BANDWIDTH (1)
CACHE STORAGE (1)
CACHE-ASSISTED AUTOMATIC INSTRUCTION SET EXTENSIONS (1)
CACHE-BASED PROCESSOR (1)
CLUSTER NODE (1)
CMOS INTEGRATED CIRCUITS (1)
CMOS TECHNOLOGY (1)
COHERENCE (1)
COMPUTATIONAL MODELING (1)
COMPUTER GRAPHIC EQUIPMENT (1)
CPU (1)
CRAY XMT (1)
CUDA CLUSTER (1)
DATA LOCALITY (1)
DATA MODELS (1)
DISTRIBUTED LARGE MEMORY SYSTEM (1)
DISTRIBUTED SHARED MEMORY SYSTEMS (1)
EMBEDDED LINUX (1)
ENERGY CONSUMPTION (1)
EQUATIONS (1)
FILE-PER-PROCESS MODEL (1)
GENETIC ALGORITHMS (1)
GFLOPS (1)
GPU CLUSTER (1)
GPU EXECUTION (1)
GRAPHICS PROCESSING UNIT (1)
GROUND PENETRATING RADAR (1)
HARDWARE SUPPORT (1)
HIGH-SPEED NETWORK (1)
HIMENO BENCHMARK (1)
INPUT-OUTPUT PROGRAMS (1)
INSTRUCTION SET EXTENSIONS (1)
INSTRUCTION SETS (1)
INSTRUMENTS (1)
JACOBIAN MATRICES (1)
JETTER (1)
KERNEL SWAP PARAMETER (1)
LATTICE BOLTZMANN FLOW SOLVER KERNEL (1)
LATTICES (1)
LIGHTWEIGHT PROFILING (1)
LINEAR SCALING (1)
LINUX (1)
LOOKUP-BASED APPROACH (1)
MACHINE LEARNING (1)
MATHEMATICAL MODEL (1)
MEMORY ADDRESS (1)
MEMORY BANDWIDTH UTILIZATION (1)
MEMORY BOUND APPLICATIONS (1)
MEMORY COHERENCE (1)
MEMORY MANAGEMENT (1)
MEMORY SUBSYSTEM (1)
MEMORY SYSTEM (1)
MEMORY WALL PROBLEM (1)
MICROPROCESSOR CHIPS (1)
MISS STATUS HANDLING REGISTERS (1)
MONITORING (1)
MPI (1)
MPI-I/O INTERFACE (1)
MSHR (1)
MULTI-THREADING (1)
MULTIGPU IMPLEMENTATION (1)
MULTIPATTERN PARALLEL I/O SYSTEM (1)
NEC SX-9 (1)
NVIDIA TESLA C1060 GPU (1)
ON-CHIP FLOATING POINT CAPABILITIES (1)
ONCHIP CACHE (1)
PARALLEL ALGORITHMS (1)
PARALLEL I/O THROUGHPUT (1)
PARALLEL PROCESSING (1)
PATTERN PERFORMANCE LAW (1)
PERFORMANCE CHARACTERIZATION (1)
PERFORMANCE EVALUATION (1)
PERFORMANCE MEASURMENT (1)
PERFORMANCE MODELING AND PREDICTION (1)
POSIX INTERFACE (1)
PREDICTIVE MODELS (1)
PREFETCH (1)
more

INFONA - science communication portal

Advanced search

Advanced search

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options