2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

book

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

IEEE

chapter

Elastic Consistent Hashing for Distributed Storage Systems

Wei Xie, Yong Chen

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 876 - 885

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Elastic distributed storage systems have been increasingly studied in recent years because power consumption has become a major problem in data centers. Much progress has been made in improving the agility of resizing small- and large-scale distributed storage systems. However, most of these studies focus on metadata based distributed storage systems. On the other hand, emerging consistent hashing...

chapter

Community Detection on the GPU

Md. Naim, Fredrik Manne, Mahantesh Halappanavar, Antonino Tumeo

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 625 - 634

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

We present and evaluate a new GPU algorithm based on the Louvain method for community detection. Our algorithm is the first for this problem that parallelizes the access to individual edges. In this way we can fine tune the load balance when processing networks with nodes of highly varying degrees. This is achieved by scaling the number of threads assigned to each node according to its degree. Extensive...

chapter

SimProf: A Sampling Framework for Data Analytic Workloads

Jen-Cheng Huang, Lifeng Nai, Pranith Kumar, Hyojong Kim, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 595 - 604

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Today, there is a steep rise in the amount of data being collected from diverse applications. Consequently, data analytic workloads are gaining popularity to gain insight that can benefit the application, e.g., financial trading, social media analysis. To study the architectural behavior of the workloads, architectural simulation is one of the most common approaches. However, because of the long-running...

chapter

swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight

Jiarui Fang, Haohuan Fu, Wenlai Zhao, Bingwei Chen, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 615 - 624

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

To explore the potential of training complex deep neural networks (DNNs) on other commercial chips rather than GPUs, we report our work on swDNN, which is a highly-efficient library for accelerating deep learning applications on the newly announced world-leading supercomputer, Sunway TaihuLight. Targeting SW26010 processor, we derive a performance model that guides us in the process of identifying...

chapter

Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores

Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 635 - 645

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Interest has recently grown in efficiently analyzing unstructured data such as social network graphs and protein structures. A fundamental graph algorithm for doing such task is the Breadth-First Search (BFS) algorithm, the foundation for many other important graph algorithms such as calculating the shortest path or finding the maximum flow in graphs. In this paper, we share our experience of designing...

chapter

Generating Families of Practical Fast Matrix Multiplication Algorithms

Jianyu Huang, Leslie Rice, Devin A. Matthews, Robert A. van de Geijn

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 656 - 667

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due to the increased cost of memory movement, which is particularly noticeable for non-square matrices. Such implementations also require considerable workspace and...

chapter

Automatic-Signal Monitors with Multi-object Synchronization

Wei-Lun Hung, Vijay K. Garg

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 927 - 936

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution where threads should signal other threads, (3) use global locks or perform busy waiting for operations that depend upon a condition that spans multiple objects. Transactional memory...

chapter

Scalable Lock-Free Vector with Combining

Ivan Walulya, Philippas Tsigas

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 917 - 926

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Dynamic vectors are among the most commonly used data structures in programming. They provide constant time random access and resizable data storage. Additionally, they provide constant time insertion (pushback) and deletion (popback) at the end of the sequence. However, in a multithreaded system, concurrent pushback and popback operations attempt to update the same shared object, creating a synchronization...

chapter

Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 957 - 966

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated...

chapter

Optimal Algorithms for a Mesh-Connected Computer with Limited Additional Global Bandwidth

Yujie An, Quentin F. Stout

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 937 - 946

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

We give efficient algorithms to solve fundamental data movement problems on mesh-connected computers augmented with limited global bandwidth. Adding a small amount of global bandwidth makes a practical design that combines aspects of mesh and fully connected models to achieve the benefits of each. We give algorithms for sorting, finding the median, finding a spanning tree, and determining various...

chapter

A Robust Parallel Preconditioner for Indefinite Systems Using Hierarchical Matrices and Randomized Sampling

Pieter Ghysels, Xiaoye Sherry Li, Christopher Gorman, Francois-Henry Rouet

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 897 - 906

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

We present the design and implementation of a parallel and fully algebraic preconditioner based on an approximate sparse factorization using low-rank matrix compression. The sparse factorization uses a multifrontal algorithm with fill-in occurring in dense frontal matrices. These frontal matrices are approximated as hierarchically semi-separable matrices, which are constructed using a randomized sampling...

chapter

Runtime Aware Architectures

Mateo Valero

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 819

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

chapter

Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors

Benjamin Klenk, Holger Froening, Hans Eberle, Larry Dennison

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 855 - 865

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Accelerators, such as GPUs, have proven to be highly successful in reducing execution time and power consumption of compute-intensive applications. Even though they are already used pervasively, they are typically supervised by general-purpose CPUs, which results in frequent control flow switches and data transfers as CPUs are handling all communication tasks. However, we observe that accelerators...

chapter

FlexVC: Flexible Virtual Channel Management in Low-Diameter Networks

Pablo Fuentes, Enrique Vallejo, Ramon Beivide, Cyriel Minkenberg, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 842 - 854

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Deadlock avoidance mechanisms for lossless lowdistance networks typically increase the order of virtual channel (VC) index with each hop. This restricts the number of buffer resources depending on the routing mechanism and limits performance due to an inefficient use. Dynamic buffer organizations increase implementation complexity and only provide small gains in this context because a significant...

chapter

PaPar: A Parallel Data Partitioning Framework for Big Data Applications

Hao Wang, Jing Zhang, Da Zhang, Sarunya Pumma, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 605 - 614

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Today, big data applications can generate largescale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions...

chapter

Partitioning Trillion-Edge Graphs in Minutes

George M. Slota, Sivasankaran Rajamanickam, Karen Devine, Kamesh Madduri

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 646 - 655

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

We introduce XtraPuLP, a new distributed-memory graph partitioner designed to process trillion-edge graphs. XtraPuLP is based on the scalable label propagation community detection technique, which has been demonstrated as a viable means to produce high quality partitions with minimal computation time. On a collection of large sparse graphs, we show that XtraPuLP partitioning quality is comparable...

chapter

Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations

Tobias Wicky, Edgar Solomonik, Torsten Hoefler

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 678 - 687

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

We present a new parallel algorithm for solving triangular systems with multiple right hand sides (TRSM). TRSM is used extensively in numerical linear algebra computations, both to solve triangular linear systems of equations as well as to compute factorizations with triangular matrices, such as Cholesky, LU, and QR. Our algorithm achieves better theoretical scalability than known alternatives, while...

chapter

RCube: A Power Efficient and Highly Available Network for Data Centers

Zhenhua Li, Yuanyuan Yang

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 718 - 727

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Designing a cost-effective network for data centers that can deliver sufficient bandwidth and provide high availability has drawn tremendous attentions recently. In this paper, we propose a novel server-centric network structure called RCube, which is energy efficient and can deploy a redundancy scheme to improve the availability of data centers. Moreover, RCube shares many good properties with BCube,...

chapter

Cooling-Aware Job Scheduling and Node Allocation for Overprovisioned HPC Systems

Thang Cao, Wei Huang, Yuan He, Masaaki Kondo

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 728 - 737

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Limited power budget is becoming one of the most crucial challenges in developing supercomputer systems. Hardware overprovisioning which installs a larger number of nodes beyond the limitations of the power constraint is an attractive way to design next generation supercomputers. In air cooled HPC centers, about half of the total power is consumed by cooling facilities. Reducing cooling power and...

INFONA - science communication portal

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)