Author(s):
Ninad Adi

Abstract:
This paper presents a comparative performance analysis of distributed table join algorithms using the MPI4PY library and the MapReduce framework. Two MPI-based hash joins—one with point-to-point and one with collective communication—are benchmarked against a reduce-side MapReduce join. Results reveal no universally optimal algorithm: the naive nested join outperforms others for datasets i20 rows, single process MPI is optimal for mid-sized data (20–100,000 rows), and multi-process MPI excels beyond that. The study highlights performance trade-offs, the impact of cluster inconsistencies, and proposes future improvements, including the use of Hadoop for MapReduce scalability.

Pages: 740-751

Read Full Article