The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
ForkJoin framework is a widely used parallel programming framework upon which both core concurrency libraries and real-world applications are built. Beneath its simple and user-friendly APIs, ForkJoin is a sophisticated managed parallel runtime unfamiliar to many application programmers: the framework core is a work-stealing scheduler, handles fine-grained tasks, and sustains the pressure from automatic...
The rapidly growing design complexity has become a big obstacle and dramatically increased the time required for SystemC simulation. In this case study, we exploit different levels of parallelism, including thread- and data-level parallelism, to accelerate the simulation of a Bitcoin miner model in SystemC. Our experiments are performed on two multi-core processors and one many-core Intel(g) Xeon...
Markov Chain Monte Carlo methods provide a tool for tackling high dimensional problems. With many-core systems readily available today, it is no surprise that leveraging parallelism in these samplers has been a subject of recent research. The focus has been on solutions for shared-memory architectures, however these perform poorly in a distributed-memory environment. This paper introduces a fully...
State Machine Replication (SMR) is a well-known technique to implement fault-tolerant systems. In SMR, servers are replicated and client requests are deterministically executed in the same order by all replicas. To improve performance in multi-processor systems, some approaches have proposed to parallelize the execution of non-conflicting requests. Such approaches perform remarkably well in workloads...
This paper reflects the results of developed sync system investigation, which is based on optimal receiver by average risk minimum criterion. This receiver is intended to detect the reflected radar impulse under conditions of Gaussian and Johnson noises. During a digital experiment we used M-sequence with single pulse phase-shift keying as a reference signal. The use of a developed self-tuning system...
Transactional Memory (TM) promises both to provide a scalable mechanism for synchronization in concurrent programs, and to offer ease-of-use benefits to programmers. The most straightforward use of TM in real-world programs is in the form of Transactional Lock Elision (TLE). In TLE, critical sections are attempted as transactions, with a fall-back to a lock if conflicts manifest. Thus TLE expects...
In this paper, we present an optimized framework that can efficiently perform massive spatial queries on the current GPUs. To benefit the widely adopted filter-and-verify paradigm from GPUs, the skewed workloads are first associated with certain cells in a scaled spatial grid, such that the following range verification cost against the massive spatial objects can be significantly reduced. Particularly...
The concept of complex event processing (CEP) and complex-event-aware service have been extensively studied to retrieve relevant information from massive amount of realtime streaming events. In mobile environment, the Mobilityaware CEP (MCEP) system was proposed to address the issue of synchronization problem between different query ranges and MCEP operators. We noticed that MCEP systems lack the...
In this paper, we provide comparison of languagefeatures and runtime systems of commonly used threadingparallel programming models for high performance computing, including OpenMP, Intel Cilk Plus, Intel TBB, OpenACC, NvidiaCUDA, OpenCL, C++11 and PThreads. We then report ourperformance comparison of OpenMP, Cilk Plus and C++11 fordata and task parallelism on CPU using benchmarks. The resultsshow...
The tasking model of OpenMP 4.0 supports both nesting and the definition of dependences between sibling tasks. A natural way to parallelize many codes with tasks is to first taskify the high-level functions and then to further refine these tasks with additional subtasks. However, this top-down approach has some drawbacks since combining nesting with dependencies usually requires additional measures...
Many applications have irregular behavior — e.g., input-dependent solvers, irregular memory accesses, or unbiased branches — that cannot be captured using today's automated performance modeling techniques. We describe new hierarchical critical path analyses for the Palm model generation tool. To obtain a good tradeoff between model accuracy, generality, and generation cost, we combine static and dynamic...
HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands. To address these challenging demands, HPC systems are turning to many-core architectures that feature a large number of energy-efficient cores backed by high-bandwidth memory. These features are exemplified...
This paper presents a GPU-accelerated parallel MAX-MIN Ant System (MMAS) algorithm based on an approach named grouped roulette wheel selection (G-Roulette). A data parallel model is adapted in the proposed approach with special consideration for GPU architecture. We propose a G-Roulette strategy to enhance the parallel computation of fitness-proportionate selection. The G-Roulette strategy includes...
The increase in speed and capacity of FPGAs is faster than the development of effective design tools to fully utilize it, and routing of nets remains as one of the most time-consuming stages of the FPGA design flow. While existing works have proposed methods of accelerating routing through parallelization, they are limited by the memory architecture of the system that they target. In this paper, we...
Parallel patterns (e.g., map, reduce) have gained traction as an abstraction for targeting parallel accelerators and are a promising answer to the performance portability problem. However, compiling high-level programs into efficient low-level parallel code is challenging. Current approaches start from a high-level parallel IR and proceed to emit GPU code directly in one big step. Fixed strategies...
As the number of processing elements in modern chips keeps increasing, the evaluation of new designs will need to account for various challenges at the NoC level. To cope with the impractically long run times when simulating large NoCs, we introduce a novel GPU-based parallel simulation method that can speed up simulations by over 250×, while offering RTL-like accuracy. These promising results make...
In Machine learning (ML) the model we use is increasingly important, and the model's parameters, the key point of the ML, are adjusted through iteratively processing a training dataset until convergence. Although data-parallel ML systems often engage a perfect error tolerance when synchronizing the model parameters for maximizing parallelism, the synchronization of model parameters may delay in completion,...
Extreme scale HPC systems are expected to reach exascale performance around the year 2020. While it is widely known that theses systems pose new challenges regarding energy efficiency of architectures, concurrency and resiliency, they also challenge developers of applications trying to efficiently utilizing resources: Managing parallel control flows, hardware resources and dependencies is a complex...
When parallel and distribution processing is used for wireless network emulation, the results are different in case that each of the receiver nodes independently run the events without synchronizing among these events. In our previous study for the parallel wireless network emulation, we proposed a synchronizing method to get the same result of collision event among the terminals that receive the...
Parallel Computing has been gaining interest nowadays due to physical constraints preventing frequency scaling. Therefore, in order to achieve high performance on multicore systems, programmers need to focus on parallelizing their programs. Although there are many available parallelized APIs written by experts that should improve coding, they do not automatically guarantee good performance. This paper...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.