The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper proposes the algorithms for optimization of Remote Core Locking (RCL) synchronization method in multithreaded programs. The algorithm of initialization of RCL-locks and the algorithms for threads affinity optimization are developed. The algorithms consider the structures of hierarchical computer systems and non-uniform memory access (NUMA) to minimize execution time of RCL-programs. The...
This paper represents the heuristic algorithms for optimizing communications in parallel PGAS-programs and minimizes of its execution time. This is achieved by taking into account of hierarchical structure of computer systems while reduction. Developed algorithms are implemented for PGAS-language Cray Chapel.
The rapidly growing design complexity has become a big obstacle and dramatically increased the time required for SystemC simulation. In this case study, we exploit different levels of parallelism, including thread- and data-level parallelism, to accelerate the simulation of a Bitcoin miner model in SystemC. Our experiments are performed on two multi-core processors and one many-core Intel(g) Xeon...
Modern SoCs contain CPU and GPU cores to execute both general purpose and highly-parallel graphics workloads. While the primary use of the GPU is for rendering graphics, the effects of graphics workloads on the overall system have received little attention. The primary reason for this is the lack of efficient tools and simulators for modern graphics applications. In this work, we present GLTraceSim,...
Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although...
In this paper authors present the results of the effectiveness comparison between the variants of the Radix-2 Deci-mation in Time (DIT) Fast Fourier Transform (FFT) algorithm's implementations on graphics processing units (GPUs) which differ in the way the calculations are distributed among GPUs computational resources. The conducted experiments show that the partitioning of the FFT computational...
In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to...
The Go language lacks built-in data structures that allow fine-grained concurrent access. In particular, its map data type, one of only two generic collections in Go, limits concurrency to the case where all operations are read-only; any mutation (insert, update, or remove) requires exclusive access to the entire map. The tight integration of this map into the Go language and runtime precludes its...
Graphics processing units (GPUs) are increasingly applied to accelerate tasks such as graph problems and discreteevent simulation that are characterized by irregularity, i.e., a strong dependence of the control flow and memory accesses on the input. The core data structure in many of these irregular tasks are priority queues that guide the progress of the computations and which can easily become the...
The conventional OpenCL 1.x style CPU-GPU heterogeneous computing paradigm treats the CPU and GPU processors as loosely connected separate entities. At best each executes independent tasks, but, more commonly, the CPU idles while waiting for results from the GPU. No data-sharing and communications are allowed during kernel execution. This model limits the number of applications that can harness the...
Many dynamic hybrid race detectors aim at detecting violations of the lockset discipline in execution traces of multithreaded programs. They are designed to abstract memory accesses appearing in traces as contexts. Nonetheless, they keep these contexts in different extents and partition the sets of contexts into equivalent classes of different granularity. In our case study, we compare three detectors...
A worksharing model is presented to enhance parallel compression of data-intensive bitmap indices. To increase spatial locality, our approach interleaves multiple independent bitmaps in a combined file. Each file block, which fits entirely in cache, is processed by independent threads. Results show that our model significantly outperforms embarrassingly-parallel designs.
In this paper, we present an optimized framework that can efficiently perform massive spatial queries on the current GPUs. To benefit the widely adopted filter-and-verify paradigm from GPUs, the skewed workloads are first associated with certain cells in a scaled spatial grid, such that the following range verification cost against the massive spatial objects can be significantly reduced. Particularly...
Parallel programming is becoming more and more prevalent in this era of concurrent programming. Because of the nondeterministic nature of parallel programming, it is notoriously difficult to debug concurrency bugs, moreover attempt to fix one bug may result in deadlock or other concurrency bugs. Though many static and dynamic data race detection tool is proposed in recent years, none of them is interactive...
Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution where threads should signal other threads, (3) use global locks or perform busy waiting for operations that depend upon a condition that spans multiple objects. Transactional memory...
Dynamic vectors are among the most commonly used data structures in programming. They provide constant time random access and resizable data storage. Additionally, they provide constant time insertion (pushback) and deletion (popback) at the end of the sequence. However, in a multithreaded system, concurrent pushback and popback operations attempt to update the same shared object, creating a synchronization...
Modern high-performing systems make extensive use of multiple CPU cores. These multi-threaded systems are complex to design, build, and understand. Debugging performance of these multi-threaded systems is especially challenging. This requires the developer to understand the relative execution of dozens of threads and their inter-dependencies, including data-sharing and synchronization behaviors. We...
Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced...
In this paper, we provide comparison of languagefeatures and runtime systems of commonly used threadingparallel programming models for high performance computing, including OpenMP, Intel Cilk Plus, Intel TBB, OpenACC, NvidiaCUDA, OpenCL, C++11 and PThreads. We then report ourperformance comparison of OpenMP, Cilk Plus and C++11 fordata and task parallelism on CPU using benchmarks. The resultsshow...
With the spreading of multi-core architectures, operating systems and applications are becoming increasingly more concurrent and their scalability is often limited by the primitives used to synchronize the different hardware threads. In this paper, we address the problem of how to optimize the throughput of a system with multiple producer and consumer threads. Such applications typically synchronize...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.