The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
He VMware ESXi hypervisor attracts a wide range of customers and is deployed in domains ranging from desktop computing to server computing. While the software systems are increasingly moving towards consolidation, hardware has already transitioned into multi-socket Non-Uniform Memory Access (NUMA)-based systems. The marriage of increasing consolidation and the multi-socket based systems warrants low-overhead,...
The increasing use of digital signal processors (DSPs) in wireless communications and signal processing necessitates the optimization of compilers to support special hardware features. In this paper, we propose a compiler transformation method for zero overhead loop (ZOL). It supports very long instruction word (VLIW), internal branches and the loops whose iterative times are known at runtime and...
Speculative Multithreading (SpMT) technology is an effective mechanism for automatic parallelization of irregular programs. For a sequential program, they can be executed speculatively in parallel by speculating that many data dependences are unlikely during runtime. Although speculative parallelization can potentially deliver significant speedup, several overheads associated with this technique can...
Efficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of threads. In this work,...
Modern processors are becoming increasingly complex with features that improve performance and add new functionality. However, such improvements are a double-edged sword: they improve performance and functionality but also introduce security-critical bugs into the processor that attackers can leverage to bypass a system's security policies. Existing solutions require hardware extensions and often...
With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent Pascal architecture was the first GPU that offered FP16 support. However, when actual products were shipped, programmers soon realized that a naïve replacement of single precision (FP32) code with half precision led to disappointing...
In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to...
Code injection attacks are an undeniable threat in today's cyberworld. Instruction Set Randomization (ISR) was initially proposed in 2003. This technique was designed to protect systems against code injection attacks by creating an unique instruction set for each machine, thanks to randomization. It is a promising technique in the growing embedded system and Internet of Things (IoT) devices ecosystem,...
A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) [1] tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete...
Thread-Level Speculation (TLS) mechanism has been extensively studied due to its capability of simplifying parallel programming and achieving effective performance speedup. In this paper, we investigate the study of improving current TLS models for high efficiency on present multi-core architectures. Particularly, we propose a new TLS model called Cache Copy-on-Write (CCoW). The main features of our...
Nowadays, there are many embedded systems with different architectures that have incorporated GPUs. However, it is difficult to develop CPU-GPU embedded systems using component-based development (CBD), since existing CBD approaches have no support for GPU development. In this context, when targeting a particular CPU-GPU platform, the component developer is forced to construct hardware-specific components,...
The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general...
Power is a critical factor that limits the performance and scalability of modern high performance computer systems. Considering power as a first-order constraint and a scarce system resource, power-bounded computing represents a new perspective to address the power challenge in HPC.In this work we present an application-aware, multi-dimensional power allocation framework to support power-bounded parallel...
Auto-tuning has become increasingly popular for optimizing non-functional parameters of parallel programs. The typically large search space requires sophisticated techniques to find well performing parameter values in a reasonable amount of time. Different parts of a program often perform best with different parameter values. We therefore subdivide programs into several regions, and try to optimize...
A novel efficient inplace, multithreaded, and cachefriendly parallel 2-D wavelet transform algorithm based on the lifting transform is introduced. In order to maximize the cache utilization and consequently minimize the memory bus bandwidth use, the threads compete to work on a small memory area maximizing the chance of finding it in the cache and their synchronization is done with very low overhead...
Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical...
The conventional OpenCL 1.x style CPU-GPU heterogeneous computing paradigm treats the CPU and GPU processors as loosely connected separate entities. At best each executes independent tasks, but, more commonly, the CPU idles while waiting for results from the GPU. No data-sharing and communications are allowed during kernel execution. This model limits the number of applications that can harness the...
This paper presents a visionary proposal for a distributed (or decentralized) power/thermal control mechanism that applies the bio-inspired artificial intelligence paradigm of swarm intelligence. The target use case is a future many-core processor. The paper reports a high-level concept-phase specification of the proposed solution approach in a research setting. The emphasis is on highlighting the...
Future high-performance computing systems will need to include multiple specialized accelerators in a single heterogeneous system to overcome power-density limitations of CPU performance.
Continuous demand for higher performance is adding more pressure on hardware designers to provide faster machines with low energy consumption. Recent technological advancements allow placing a group of silicon dies on top of a conventional interposer (silicon layer), which provides space to integrate logic and interconnection resources to manage active processing cores. However, such large resource...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.