The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Memory access latency continues to be a dominant bottleneck in a large class of applications on modern architectures. To optimize memory performance, it is important to utilize the locality in the memory hierarchy. Structure splitting can significantly improve memory locality. However, pinpointing inefficient code and providing insightful guidance for structure splitting is challenging. Existing tools...
Multi-core machines are dominating the HPC (High Performance Computing) community, some many-core architectures are newly emerging. Whether HPC applications scale well with the number of cores is a main concern. For the performance evaluation and tuning of many-core machines, the 2D Jacobi iteration was chosen as a typical HPC application of stencil computation. We present performance oriented tuning...
Data prefetching is an important mechanism for hiding memory latency in single-threaded, desktop workloads. For multi-threaded, commercial workloads, prefetching offers much more modest improvements in performance at a high cost in cache power and bandwidth to the higher level caches. This paper shows that by combining speculation with a selective prefetching scheme, we can reduce the cache access...
Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal on single...
Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking...
Due to increase computer hardware technologies, software developers are more focusing on to developing embedded operating system. GNU/Linux becomes a common operating system widely use in embedded technologies. In this paper, we report performance results on a TS-7800 Single Board Computer with different version of kernel that has been released by hardware provider. We compare the performance between...
This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers...
The Cray XMT provides hardware support for parallel algorithms that would be communication- or memory-bound on other machines. Unfortunately, even if an algorithm meets these criteria, performance suffers if the algorithm is too numerically intensive. We present a lookup-based approach that achieves a significant performance advantage over explicit calculation. We describe an approach to balancing...
Classic vector systems have all but vanished from recent TOP500 lists. Looking at the newly introduced NEC SX-9 series, we benchmark its memory subsystem using the low level vector triad and employ an advanced lattice Boltzmann flow solver kernel to demonstrate that classic vectors still combine excellent performance with a well-established optimization approach. Results for commodity x86-based systems...
This paper introduces way stealing, a simple architectural modification to a cache-based processor to increase data bandwidth to and from application-specific instruction set extensions (ISEs). Way stealing provides more bandwidth to the ISE-logic than the register file alone and does not require expensive coherence protocols, as it does not add memory elements to the processor. When enhanced with...
Vector supercomputers have been encountering the memory wall problem and their memory bandwidth per flop/s rate has decreased. To cover the insufficient memory bandwidth per flop/s rate, an on-chip vector cache has been proposed for the vector processors. Although vector caching is effective to increase the sustained performance to a certain degree, it still needs software and hardware supporting...
This paper proposes a new multi-pattern parallel I/O benchmark called Jetter, which evaluates parallel I/O throughput with either the contiguous I/O pattern or the non-contiguous I/O pattern, in either the share-one-file model or the file-per-process model, by either the POSIX interface or the MPI-I/O interface. Jetter helps end users make sense of the pattern performance law, and helps them develop...
Emerging 64 bitOSpsilas supply a huge amount of memory address space that is essential for new applications using very large data. It is expected that the memory in connected nodes can be used to store swapped pages efficiently, especially in a dedicated cluster which has a high-speed network such as 10 GbE and Infiniband. In this paper, we propose the distributed large memory system (DLM), which...
Benchmarks that measure memory bandwidth, such as STREAM, Apex-MAPS and MultiMAPS, are increasingly popular due to the "Von Neumann" bottleneck of modern processors which causes many calculations to be memory-bound. We present a scheme for predicting the performance of HPC applications based on the results of such benchmarks. A Genetic Algorithm approach is used to "learn" bandwidth...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.