ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

PipelineJoin:A new MapReduce-based multi-table join algorithm

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2015.10.006
  • Received Date: 27 August 2015
  • Accepted Date: 29 September 2015
  • Rev Recd Date: 29 September 2015
  • Publish Date: 30 October 2015
  • MapReduce, a parallel and distributed computing model, has been widely used to process join operations for two or more large tables. The existing MapReduce-based multi-table join algorithms all have some limitations when dealing with chain join. Some methods can not process join operations for multi large tables, and others involve sequentially running too many MapReduce tasks, which leads to low efficiency. Here a new MapReduce-based multi-table join algorithm, PipelineJoin, is proposed to process chain join of a number of tables. PipelineJoin adopts a pipeline model and a scheduler to allow the overlapping execution of a series of Map tasks and Reduce tasks in the whole join process so as to enhance the efficiency of multi-table join, while effectively overcoming the deficiency of the existing methods. Extensive experimental results based on various synthetic datasets show that the proposed algorithm can greatly reduce join operation time compared with the existing chain join algorithms.
    MapReduce, a parallel and distributed computing model, has been widely used to process join operations for two or more large tables. The existing MapReduce-based multi-table join algorithms all have some limitations when dealing with chain join. Some methods can not process join operations for multi large tables, and others involve sequentially running too many MapReduce tasks, which leads to low efficiency. Here a new MapReduce-based multi-table join algorithm, PipelineJoin, is proposed to process chain join of a number of tables. PipelineJoin adopts a pipeline model and a scheduler to allow the overlapping execution of a series of Map tasks and Reduce tasks in the whole join process so as to enhance the efficiency of multi-table join, while effectively overcoming the deficiency of the existing methods. Extensive experimental results based on various synthetic datasets show that the proposed algorithm can greatly reduce join operation time compared with the existing chain join algorithms.
  • loading
  • [1]
    Slagter K, Hsu C H, Chung Y C, et al. SmartJoin: A network-aware multiway join for MapReduce[J]. Cluster Computing, 2014, 17(3): 629-641.
    [2]
    Jiang D W, Tung A K H, Chen G. MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1299-1311.
    [3]
    Afrati F N, Ullman J D. Optimizing multiway joins in a map-reduce environment[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1282-1298.
    [4]
    Kimmett, Thomo A, Venkatesh S. Three-way joins on MapReduce: An experimental study[C]// Proceedings of the 5th International Conference on Information, Intelligence, Systems and Applications. Crete, Greece: IEEE Press, 2014: 227-232.
    [5]
    Yan K, Zhu H. Two MRJs for multi-way theta-join in MapReduce[C]// Proceedings of the 6th International Conference on Internet and Distributed Computing Systems. Hangzhou, China: Springer, 2013:321-332.
    [6]
    Dittrich J, Quiané-Ruiz J A, Jindal A, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing)[C]// Proceedings of the 36th International Conference on Very Large Data Bases. Singapore: ACM Press, 2010: 518-529.
    [7]
    Eltabakh M Y, Tian Y Y, ?zcan F, et al. Cohadoop: Flexible data placement and its exploitation in hadoop[C]// Proceedings of the 37th International Conference on Very Large Data Bases. Seattle, USA: ACM Press, 2011: 575-585.
    [8]
    Yang H C, Dasdan A, Hsiao R L, et al. Map-reduce-merge: Simplified relational data processing on large clusters[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. Beijing, China: ACM Press, 2007: 1029-1040.
    [9]
    Blanas S, Patel J M, Ercegovac V, et al. A comparison of join algorithms for log processing in MapReduce[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. Indianapolis, USA: ACM Press, 2010: 975-986.
    [10]
    Zhang X X, Guo Z L, Guo H L, et al. CasJoin: A Cascade Chain for Text Similarity Joins[C]// Proceedings of the 19th ACM international conference on Information and knowledge management. New York, USA: ACM Press, 2010: 1725-1728.
    [11]
    Blanas S, Li Y N, Patel J M. Design and evaluation of main memory hash join algorithms for multi-core CPUs[C]// Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. Athens, Greece: ACM Press, 2011:37-48.
    [12]
    Yuan Y, Wang D, Liu J C. Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments[C]// Proceedings of the IEEE Conference on Computer Communications. Toronto, Canada: ACM Press, 2014: 2175-2183.
    [13]
    Hunt P, Konar M, Junqueira F P, et al. ZooKeeper: Wait-free coordination for Internet-scale systems[C]// Proceedings of the Annual Technical Conference. Boston, USA: ACM Press, 2010:11-24.)
  • 加载中

Catalog

    [1]
    Slagter K, Hsu C H, Chung Y C, et al. SmartJoin: A network-aware multiway join for MapReduce[J]. Cluster Computing, 2014, 17(3): 629-641.
    [2]
    Jiang D W, Tung A K H, Chen G. MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1299-1311.
    [3]
    Afrati F N, Ullman J D. Optimizing multiway joins in a map-reduce environment[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1282-1298.
    [4]
    Kimmett, Thomo A, Venkatesh S. Three-way joins on MapReduce: An experimental study[C]// Proceedings of the 5th International Conference on Information, Intelligence, Systems and Applications. Crete, Greece: IEEE Press, 2014: 227-232.
    [5]
    Yan K, Zhu H. Two MRJs for multi-way theta-join in MapReduce[C]// Proceedings of the 6th International Conference on Internet and Distributed Computing Systems. Hangzhou, China: Springer, 2013:321-332.
    [6]
    Dittrich J, Quiané-Ruiz J A, Jindal A, et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing)[C]// Proceedings of the 36th International Conference on Very Large Data Bases. Singapore: ACM Press, 2010: 518-529.
    [7]
    Eltabakh M Y, Tian Y Y, ?zcan F, et al. Cohadoop: Flexible data placement and its exploitation in hadoop[C]// Proceedings of the 37th International Conference on Very Large Data Bases. Seattle, USA: ACM Press, 2011: 575-585.
    [8]
    Yang H C, Dasdan A, Hsiao R L, et al. Map-reduce-merge: Simplified relational data processing on large clusters[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. Beijing, China: ACM Press, 2007: 1029-1040.
    [9]
    Blanas S, Patel J M, Ercegovac V, et al. A comparison of join algorithms for log processing in MapReduce[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. Indianapolis, USA: ACM Press, 2010: 975-986.
    [10]
    Zhang X X, Guo Z L, Guo H L, et al. CasJoin: A Cascade Chain for Text Similarity Joins[C]// Proceedings of the 19th ACM international conference on Information and knowledge management. New York, USA: ACM Press, 2010: 1725-1728.
    [11]
    Blanas S, Li Y N, Patel J M. Design and evaluation of main memory hash join algorithms for multi-core CPUs[C]// Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. Athens, Greece: ACM Press, 2011:37-48.
    [12]
    Yuan Y, Wang D, Liu J C. Joint scheduling of MapReduce jobs with servers: Performance bounds and experiments[C]// Proceedings of the IEEE Conference on Computer Communications. Toronto, Canada: ACM Press, 2014: 2175-2183.
    [13]
    Hunt P, Konar M, Junqueira F P, et al. ZooKeeper: Wait-free coordination for Internet-scale systems[C]// Proceedings of the Annual Technical Conference. Boston, USA: ACM Press, 2010:11-24.)

    Article Metrics

    Article views (25) PDF downloads(77)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return