The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC systems, providing efficient fault‐tolerance mechanisms for this class of applications is paramount. The global‐restart model has been proposed to decrease the time of failure...
Multi-/many-core CPU based architectures are seeing widespread adoption due to their unprecedented compute performance in a small power envelope. With the increasingly large number of cores on each node, applications spend a significant portion of their execution time in intra-node communication. While shared memory is commonly used for intra-node communication, it needs to copy each message once...
Studying the interaction among applications, MPI runtimes, and the fabric they run on is critical to understanding application performance. There exists no high-performance and scalable tool that enables understanding this interplay on modern multi-petaflop systems. Designing such a tool is non-trivial and involves multiple components including 1) data profiling/collection from network/MPI library,...
Deep Learning over Big Data (DLoBD) is becoming one of the most important research paradigms to mine value from the massive amount of gathered data. Many emerging deep learning frameworks start running over Big Data stacks, such as Hadoop and Spark. With the convergence of HPC, Big Data, and Deep Learning, these DLoBD stacks are taking advantage of RDMA and multi-/many-core based CPUs/GPUs. Even though...
Broadcast operations (e.g. MPI_Bcast) have been widely used in deep learning applications to exchange a large amount of data among multiple graphics processing units (GPUs). Recent studies have shown that leveraging the InfiniBand hardware-based multicast (IB-MCAST) protocol can enhance scalability of GPU-based broadcast operations. However, these initial designs with IB-MCAST are not optimized for...
While GPUs are becoming common in HPC systems, the CPU is still responsible for managing both GPU-side and CPU-side compute, communication, and synchronization operations. For instance, if a result from a GPU-side computation is to be transferred to a remote destination, then the CPU must synchronize on GPU compute completion issuing a communication operation. Both CPU cycles and energy are consumed...
Distributed key-value store-based caching solutions are being increasingly used to accelerate Big Data applications on modern HPC clusters. This has necessitated incorporating fault-tolerance capabilities into high-performance key-value stores such as Memcached that are otherwise volatile in nature. In-memory replication is being used as the primary mechanism to ensure resilient data operations. However,...
High-speed interconnects (e.g. InfiniBand) have been widely deployed on modern HPC clusters. With the emergence of HPC in the cloud, high-speed interconnects have paved their way into the cloud with recently introduced Single Root I/O Virtualization (SR-IOV) technology, which is able to provide efficient sharing of high-speed interconnect resources and achieve near-native I/O performance. However,...
Running Big Data applications in the cloud has become extremely popular in recent times. To enable the storage of data for these applications, cloud-based distributed storage solutions are a must. OpenStack Swift is an object storage service which is widely used for such purposes. Swift is one of the main components of the OpenStack software package. Although Swift has become extremely popular in...
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
Big Data Systems are becoming increasingly complex and generally have very high operational costs. Cloud computing offers attractive solutions for managing large scale systems. However, one of the major bottlenecks in VM performance is virtualized I/O. Since Big Data applications and middleware rely heavily on high performance interconnects such as InfiniBand, the performance of virtualized InfiniBand...
CUDA 6.0 introduced Managed Memory feature to boost the productivity on GPU systems. It removes the explicit memory management and data movement between the host and the accelerator burden from the programmer. However, these benefits restrict the pinning of the memory and hence limits its performance by depriving the usage of performance-centric features like CUDA-IPC and GPUDirect RDMA. On another...
The MPI programming model has been used by several graph processing systems. Out of these systems, only MPI two-sided programming model is used for datamovements. The MPI one-sided programming model that achieves better overlap ofcommunication and computation has been seen as advantageous for applicationswith irregular communication patterns. However, the benefits of MPI one-sidedprogramming model...
UPC++ is an emerging Partition Global Address Space (PGAS) library. The performance of UPC++ applications is highly dependent on the efficiency of the communication runtime. Current UPC++ runtime relies on MPI and native IB verbs based conduit implementations of GASNet communication middleware. Hybrid MPI+UPC++ programming model is seen as an attractive way to program next generation systems. However,...
The performance of Hadoop components can be significantly improved by leveraging advanced features such as Remote Direct Memory Access (RDMA) on modern HPC clusters, where high-performance networks like InfiniBand (IB) and RoCE have been deployed widely. With the emergence of high-performance computing in the cloud (HPC Cloud), high-performance networks have paved their way into the cloud with recently...
Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that...
Hadoop is gaining more and more popularity in virtualized environments because of the flexibility and elasticity offered by cloud-based systems. Hadoop supports topology-awareness through topology-aware designs in all of its major components. However, there exists no service that can automatically detect the underlying network topology in a scalable and efficient manner, and provide this information...
Deep learning frameworks have recently gained widespread popularity due to their highly accurate prediction capabilities and availability of low cost processors that can perform training over a large dataset quickly. Given the high core count in modern generation high performance computing systems, training deep networks over large data has now become practical. In this work, while targeting the Computational...
Existing InfiniBand drivers require the communication buffers to be pinned in physical memory during communication. Most runtimes leave these buffers pinned until the end of the run. Such situation limits the swappable memory space for applications. To address these concerns, Mellanox has recently introduced the On-Demand Paging (ODP) feature for InfiniBand. With ODP, communication buffers are paged...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.