Similarity joins for high‐dimensional data using Spark

Chuitian Rong; Xiaohai Cheng; Ziliang Chen; Na Huo

doi:10.1002/cpe.5339

RESEARCH ARTICLE
Similarity joins for high‐dimensional data using Spark

Chuitian Rong, Xiaohai Cheng, Ziliang Chen, Na Huo

Source

Concurrency and Computation: Practice and Experience > 31 > 20 > n/a - n/a

Abstract

Similarity join on high‐dimensional data is a primitive operation. It is used to find all data pairs that with distance no more than ϵ from the given data set according to a specific distance measure. As the data set scale and dimension increase, computation cost increases vastly. Hadoop and Spark have become the popular platforms for big‐data analysis. Because Spark has native advantages in iterative computations, we adopted it as our platform to perform similarity joins on high‐dimensional data sets. In order to resolve problems such as data imbalance, data duplication, and redundant computation of existing works, we have proposed a new algorithm based on Symbolic aggregation and vertical decomposition. We first conduct dimension‐reduction using symbolic aggregation method. Then, we applied vertical partition operation on processed data. The join operations are performed on each vertical partition in parallel manner and the proposed new filters are utilized to prune false positives in early stage. Finally, the partial results generated from each partition will be aggregated and verified to get final results. Our proposed algorithm can significantly improve the efficiency of similarity joins on high‐dimensional data. In order to verify the efficiency and scalability of our methods, we implemented it using MapReduce and Spark. We compared our methods with existing works on public data sets, and the experimental results showed that the new methods were more efficient and scalable under different running environments.