The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
We discuss the feasibility of an in-house Schrödinger equation solver on the Intel Broadwell Xeon processor with a built-in FPGA, with a particular focus on the performance of large-scale sparse matrix-vector multiplication (SpMV) that is the core numerical operation of electronic structure simulations for multi-million atomic systems. The double-precision SpMV section in our solver is offloaded to...
HPC interconnect is a very crucial component of any HPC machine. Interconnect performance is one of the contributing factors for overall performance of HPC system. Most popular interface to connect Network Interface Card (NIC) to CPU is PCI express (PCIe). With denser core counts in compute servers and increasingly maturing fabric interconnect speeds, there is need to maximize the packet data movement...
Among the high-radix and low-diameter networks, fat-tree topology is commonly used in HPC and datacenter systems. Resource and job management is critically important to mitigate application interference in order to achieve high system performance and utilization. Preliminary studies have shown the effect of job placement on parallel scientific applications performance. In this work we study interference...
A new page swap protocol is proposed for a user-level remote memory paging system to accelerate the performance of out-of-core processing with multi-thread user programs and libraries written in OpenMP and pthread. The original swap protocol has a bottle-neck in efficient page swapping which is requested by multiple threads in a user program, because all MPI communications to memory servers and page...
Convolutional neural networks(CNNs) have been widely applied in various applications. However, the computation-intensive convolutional layers and memory-intensive fully connected layers have brought many challenges to the implementation of CNN on embedded platforms. To overcome this problem, this work proposes a power-efficient accelerator for CNNs, and different methods are applied to optimize the...
Due to the rapidly increasing use of big data, machines are stressed to provide more computing power at higher energy efficiency while maintaining simpler and more scalable computing paradigms. Transactional Memory (TM) is one such technique that can be used for synchronization instead of conventional locks used in critical sections since it has simpler paradigms, is scalable and has better energy...
Accelerated clusters, which are distributed memory systems equipped with accelerators, have been used in various fields. For accelerated clusters, programmers often implement their applications by a combination of MPI and CUDA (MPI+CUDA). However, the approach faces programming complexity issues. This paper introduces the XcalableACC (XACC) language, which is a hybrid model of XcalableMP (XMP) and...
Almost all performance analysis tools in the HPC space perform some form of aggregation to compute summary information of a series of performance measurements, from summations to more complex operations like histograms. Aggregation not only reduces data volumes and consequently storage space requirements and overheads, but is also crucial to extract insights from recorded measurement data. In current...
Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main memory are not lost when the system crashes because of the non-volatility nature of NVM. However, because of volatile caches, data must be logged and explicitly flushed from caches into NVM to ensure consistence and correctness...
System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological...
This paper proposes a novel hybrid transactional memory scheme based on both abort prediction and an adaptive retry policy. First, the proposed scheme can predict not only conflicts between transactions running concurrently, but also the capacity and other aborts of transactions by collecting the information of previously executed transactions. Second, the proposed scheme can provide an adaptive retry...
Scientific computing requires trust in results. In high-performance computing, trust is impeded by silent data corruption (SDC), in other words corruption that remains unnoticed. Numerical integration solvers are especially sensitive to SDCs because an SDC introduced in a certain step affects all the following steps. SDCs can even cause the solver to become unstable. Adaptive solvers can change the...
Parallelization on a GPU (graphics processing unit) cluster is an effective approach to reducing the huge computation time of backprojection, which is the most accurate SAR (synthetic aperture radar) imaging algorithm for reconstructing images with no errors caused by the platform motion. To obtain accurate imagery in real-time, we developed a distributed parallel backprojection algorithm for stripmap...
Efficiently programming shared-memory machines is a difficult challenge because mapping application threads onto the memory hierarchy has a strong impact on the performance. However, optimizing such thread placement is difficult: architectures become increasingly complex and application behavior changes with implementations and input parameters, e.g problem size and number of threads. In this work,...
We present EclipseMR, a novel MapReduce framework prototype that efficiently utilizes a large distributed memory in cluster environments. EclipseMR consists of double-layered consistent hash rings - a decentralized DHT-based file system and an in-memory key-value store that employs consistent hashing. The in-memory key-value store in EclipseMR is designed not only to cache local data but also remote...
High-radix, low-diameter, hierarchical networks based on the Dragonfly topology are common picks for building next generation HPC systems. However, effective tools are lacking for analyzing the network performance and exploring the design choices for such emerging networks at scale. In this paper, we present visual analytics methods that couple data aggregation techniques with interactive visualizations...
A heterogeneous memory system (HMS) consists of multiple memory components with different properties. GPU is a representative architecture with HMS. It is challenging to decide optimal placement of data objects on HMS because of the large exploration space and complicated memory hierarchy on HMS. In this paper, we introduce performance modeling techniques to predict performance of various data placements...
Multi-/many-core CPU based architectures are seeing widespread adoption due to their unprecedented compute performance in a small power envelope. With the increasingly large number of cores on each node, applications spend a significant portion of their execution time in intra-node communication. While shared memory is commonly used for intra-node communication, it needs to copy each message once...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.