The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches. Near-memory processing (NMP) architectures are reemerging as promising candidates...
Heterogeneous memory management combined with server virtualization in datacenters is expected to increase the software and OS management complexity. State-of-the-art solutions rely exclusively on the hypervisor (VMM) for expensive page hotness tracking and migrations, limiting the benefits from heterogeneity. To address this, we design HeteroOS, a novel application-transparent OS-level solution for...
We have been experiencing two very important movements in computing. On the one hand, a tremendous amount of resource has been invested into innovative applications such as first-principle-based methods, deep learning and cognitive computing. On the other hand, the industry has been taking a technological path where application performance and energy efficiency vary by more than two orders of magnitude...
The intense demand for video applications from mobile devices brings a challenge to the hardware design, especially in the energy consumption. This work presents a design space exploration to define energy-efficient cache memory configurations for the ME process considering different video sequences and HEVC encoder configurations. We focus on the Motion Estimation (ME) process, known as the most...
This paper presents an efficient new snapshot mechanism, briefed as Live Save, to make real time backup of the VM state to the local host. The proposed Live Save will iteratively send the state data, store the snapshot file in the local host and send the entire file directly to a remote host when necessary - to save significant bandwidth consumption. We also set up an advanced Improved Live Save which...
Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small...
This paper presents the design and test of 2×2 channel emulator, which is especially optimized for LTE-Hi testing. It adopts traditional convention method of the channel impulse response in time domain. The design features lower memory resource with sharing delay data buffering, which could reduce the cost of efficient VLSI Implement. In addition, Bidirectional design offers flexible and reliable...
The presented study analyses 563 representative benchmark sparse matrices with respect to their partitioning into uniformly-sized blocks. The aim is to minimize memory footprints of matrices. Different block sizes and different ways of storing blocks in memory are considered and statistically evaluated. Memory footprints of partitioned matrices are additionally compared with lower bounds and the CSR...
Iterative stencils are kernels in various application domains such as numerical simulations and medical imaging, that merit FPGA acceleration. The best architecture depends on many factors such as the target platform, off-chip memory bandwidth, problem size, and performance requirements. We generate a family of FPGA stencil accelerators targeting emerging System on Chip platforms, (e.g., Xilinx Zynq...
Latency-critical workloads such as web search engines, social networks and finance market applications are sensitive to tail latencies for meeting Service Level Objectives (SLOs). Since unexpected tail latencies are caused by sharing hardware resources with other co-executing workloads, a service provider executes the latency-critical workload alone. Thus, the data center for the latency-critical...
Most applications running on supercomputers achieve only a fraction of a system's peak performance. It has been demonstrated that the co-scheduling of applications can improve the overall system utilization. However, following this approach, applications need to fulfill certain criteria such that the mutual slowdown is kept at a minimum. In this paper, we present an HPC scheduler that applies co-scheduling...
The increase in memory capacity is substantially behind the increase in computing power in today's supercomputers. In order to alleviate the effect of this gap, diverse options such as NVM - non-volatile memory (less expensive but slow) and HBM - high bandwidth memory (fast but expensive) are being explored. In this paper, we present a common approach using parallel runtime techniques for utilizing...
Comparison of most mature and promising emerging memory technologies respect to mainstream NAND and DRAM and challenges for the introduction in the market for high density applications.
As the SIMD width of modern microprocessors has been widening for keeping up with the computational demand for HPC systems, recently the vector architecture comes back to spotlight. Besides, a modern vector architecture that has been keeping a large SIMD width and a high B/F ratio has survived and evolved in the HPC community. In this paper, to clarify the potential of the modern vector architecture,...
Die-stacked DRAM (a.k.a., on-chip DRAM) provides much higher bandwidth and lower latency than off-chip DRAM. It is a promising technology to break the "memory wall". Die-stacked DRAM can be used either as a cache (i.e., DRAM cache) or as a part of memory (PoM). A DRAM cache design would suffer from more page faults than a PoM design as the DRAM cache cannot contribute towards capacity of...
This paper describes a pipelined stochastic gradient descent (SGD) algorithm and its hardware architecture with a memory distributed structure. In the proposed architecture, a pipeline stage takes charge of multiple layers: a “layer block.” The layer-block-wise pipeline has much less weight parameters for network training than conventional multithreading because weight memory is distributed to workers...
The trend of using heterogeneous computing and HW/SW-Codesign approaches allows increasing performance significantly while reducing power consumption. One of the main challenges when combining multiple processing devices is the communication, as an inefficient communication configu-ration can pose a bottleneck to the overall system performance. To address this problem, we present a methodology that...
We address the problem of optimizing global shared memory usage in deeply heterogeneous accelerators in the context of HPC systems running multiple applications with different quality of service levels. We explore predictive memory allocation algorithms, allowing to serve up to 28% more high priority requests when using a moving average based prediction in a low-workload scenario.
A novel efficient inplace, multithreaded, and cachefriendly parallel 2-D wavelet transform algorithm based on the lifting transform is introduced. In order to maximize the cache utilization and consequently minimize the memory bus bandwidth use, the threads compete to work on a small memory area maximizing the chance of finding it in the cache and their synchronization is done with very low overhead...
Memory interference is a critical impediment to system performance in CMP systems. To address this problem, we first propose a Dynamically Proportional Bandwidth Throttling policy (DPBT), which dynamically throttles back memory-intensive applications based on their memory access behavior. DPBT achieves a more balance memory bandwidth partitioning. Moreover, we improve the previous memory channel partitioning...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.