ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Article

Hybrid fault tolerance in distributed in-memory storage systems

Cite this:
https://doi.org/10.52396/JUSTC-2022-0125
More Information
  • Corresponding author: Email: siwu5938@ustc.edu.cn
  • Received Date: 06 September 2022
  • Accepted Date: 30 April 2023
  • An in-memory storage system provides submillisecond latency and improves the concurrency of user applications by caching data into memory from external storage. Fault tolerance of in-memory storage systems is essential, as the loss of cached data requires access to data from external storage, which evidently increases the response latency. Typically, replication and erasure code (EC) are two fault-tolerant schemes that pose different trade-offs between access performance and storage usage. To help make the best performance and space trade-off, we design ElasticMem, a hybrid fault-tolerant distributed in-memory storage system that supports elastic redundancy transition to dynamically change the fault-tolerant scheme. ElasticMem exploits a novel EC-oriented replication (EOR) that carefully designs the data placement of replication according to the future data layout of EC to enhance the I/O efficiency of redundancy transition. ElasticMem solves the consistency problem caused by concurrent data accesses via a lightweight table-based scheme combined with data bypassing. It detects corelated read and write requests and serves subsequent read requests with local data. We implement a prototype that realizes ElasticMem based on Memcached. Experiments show that ElasticMem remarkably reduces the time of redundancy transition, the overall latency of corelated concurrent data accesses, and the latency of single data access among them.
    An in-memory storage system provides submillisecond latency and improves the concurrency of user applications by caching data into memory from external storage. Fault tolerance of in-memory storage systems is essential, as the loss of cached data requires access to data from external storage, which evidently increases the response latency. Typically, replication and erasure code (EC) are two fault-tolerant schemes that pose different trade-offs between access performance and storage usage. To help make the best performance and space trade-off, we design ElasticMem, a hybrid fault-tolerant distributed in-memory storage system that supports elastic redundancy transition to dynamically change the fault-tolerant scheme. ElasticMem exploits a novel EC-oriented replication (EOR) that carefully designs the data placement of replication according to the future data layout of EC to enhance the I/O efficiency of redundancy transition. ElasticMem solves the consistency problem caused by concurrent data accesses via a lightweight table-based scheme combined with data bypassing. It detects corelated read and write requests and serves subsequent read requests with local data. We implement a prototype that realizes ElasticMem based on Memcached. Experiments show that ElasticMem remarkably reduces the time of redundancy transition, the overall latency of corelated concurrent data accesses, and the latency of single data access among them.
    • Replication and erasure code (EC) are used to tolerate faults and have different time and space costs. To obtain the best of both worlds, we support hybrid fault tolerance and dynamic redundancy transition between both schemes.
    • EC-oriented replication is introduced to improve the I/O efficiency of the redundancy transition.
    • Local data could be leveraged to serve coreleased requests while avoiding concurrent consistency problems.

  • loading
  • [1]
    Reinsel D, Gantz J, Rydning J. The Digitization of The World from Edge to Core. Framingham, MA: International Data Corporation, 2018 .
    [2]
    Apache. HDFS Design. 2021. https://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. Accessed July 01, 2022.
    [3]
    Ghemawat S, Gobioff H, Leung S T. The Google file system. In: Proceedings of the nineteenth ACM symposium on Operating systems principles. New York: ACM, 2003 , 29–43.
    [4]
    Apache. Spark. 2021 . https://spark.apache.org/. Accessed July 01, 2022.
    [5]
    Apache. Storm. 2021 . https://storm.apache.org/index.html. Accessed July 01, 2022.
    [6]
    J. Zawodny. Redis: Lightweight key/value store that goes the extra mile. 2009 , 79. www.linux.com/news/redis-lightweight-keyvalue-store-goes-extra-mile. Accessed July 01, 2022.
    [7]
    B Fitzpatrick. Distributed caching with memcached. Linux Journal, 2004 124: 5. doi: 10.5555/1012889.1012894
    [8]
    Nishtala R, Fugal H, Grimm S, et al. Scaling memcache at Facebook. In: 10th USENIX Symposium on Networked Systems Design and Implementation. Lombard, IL: USENIX Association, 2013, 385–398.
    [9]
    Twitter Inc. Twemcache is the twitter memcached. 2012 . https://github.com/twitter/twemcache. Accessed July 01, 2022.
    [10]
    Goel A, Chopra B, Gerea C, et al. Fast database restarts at facebook. In: ProcIn: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2014, 541–549.
    [11]
    Budhiraja N, Marzullo K, Schneider F B, et al. The primary-backup approach. In: Distributed systems, New York: ACM Press/Addison-Wesley Publishing Co. 1993 : 199–216.
    [12]
    van Renesse R, Schneider F. Chain replication for supporting high throughput and availability. In: 6th Conference on Symposium on Operating Systems Design & Implementation. San Francisco, CA: USENIX Association, 2004, 91–104.
    [13]
    Lai C, Jiang S, Yang L, et al. Atlas: Baidu’s key-value storage system for cloud data. In: 2015 31st Symposium on Mass Storage Systems and Technologies (MSST). Santa Clara, USA: IEEE, 2015, 1–14.
    [14]
    Li S, Zhang Q, Yang Z, et al. BCStore: Bandwidth-efficient in-memory KV-store with batch coding. In: 33nd International Conference on Massive Storage Systems and Technologies. Santa Clara, USA: MSST, 2017.
    [15]
    Rashmi K V, Chowdhury M, Kosaian J, et al. EC-Cache: Load-balanced, low-latency cluster caching with online erasure coding. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. New York: ACM, 2016, 401–417.
    [16]
    Yiu M M T, Chan H H W, Lee P P C. Erasure coding for small objects in in-memory KV storage. In: Proceedings of the 10th ACM International Systems and Storage Conference. New York, USA: ACM, 2017, 14.
    [17]
    Xu L, Lyu M, Li Q, et al. SelectiveEC: Towards balanced recovery load on erasure-coded storage systems. IEEE Transactions on Parallel and Distributed Systems, 2022, 33 (10): 2386–2400. doi: 10.1109/tpds.2021.3129973
    [18]
    MacWilliams F J, Sloane N J A. The theory of error-correcting codes. In: North-Holland Mathematical Library. Murray Hill, USA: Bell Laboratories, 1977, 16.
    [19]
    Yang J, Yue Y, Rashmi K V. A large scale analysis of hundreds of in-memory cache clusters at Twitter. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Berkeley CA, USA: USENIX Association, 2020, 191–208.
    [20]
    Xia M, Saxena M, Blaum M, et al. A tale of two erasure codes in HDFS. In: 13th USENIX conference on file and storage technologies (FAST’ 15). Santa Clara, USA: USENIX Association, 2015, 213–226.
    [21]
    Yao Q, Hu Y, Cheng L, et al. StripeMerge: Efficient wide-stripe generation for large-scale erasure-coded storage. In: 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS). DC, USA: IEEE, 2021, 483–493.
    [22]
    Wu S, Shen Z, Lee P P C. Enabling I/O-efficient redundancy transitioning in erasure-coded KV stores via elastic reed-solomon codes. In: 2020 International Symposium on Reliable Distributed Systems (SRDS). Shanghai, China: IEEE, 2020, 246–255.
    [23]
    Chen H, Zhang H, Dong M, et al. Efficient and available in-memory KV-store with hybrid erasure coding and replication. ACM Transactions on Storage, 2017, 13 (3): 25. doi: 10.1145/3129900
    [24]
    M. Kerrisk. The Linux Programming Interface. San Francisco, USA: No Starch Press, 2010 .
    [25]
    Libmemcached. https://libmemcached.org/libMemcached.html. Accessed July 01, 2022.
    [26]
    Plank J S. A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems. Software:Practice and Experience, 1997, 27 (9): 995–1012. doi: 10.1002/(sici)1097-024x(199709)27:9<995::aid-spe111>3.0.co;2-6
    [27]
    Stuedi P, Trivedi A, Pfefferle J, et al. Unification of temporary storage in the nodekernel architecture. In: Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference. New York: ACM, 2019, 767–781.
    [28]
    Klimovic A, Wang Y, Stuedi P, et al. Pocket: Elastic ephemeral storage for serverless analytics. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad,USA: 2018: 427–444.
    [29]
    Plank J S, Simmerman S, Schuman C D. Jerasure: A library in C/C++ facilitating erasure coding for storage applications. Knoxville, TN, USA: University of Tennessee, 2007.
    [30]
    Plank J S, Miller E L, Houston W B. GF-Complete: A comprehensive open source library for Galois Field arithmetic. Knoxville, TN, USA: University of Tennessee, 2013.
    [31]
    Xu B, Huang J, Cao Q, et al. TEA: A traffic-efficient erasure-coded archival scheme for In-memory stores. In: Proceedings of the 48th International Conference on Parallel Processing. New York: ACM, 2019, 24.
    [32]
    Li R, Hu Y, Lee P P C. Enabling efficient and reliable transition from replication to erasure coding for clustered file systems. IEEE Transactions on Parallel and Distributed Systems, 2017, 28: 2500–2513. doi: 10.1109/tpds.2017.2678505
    [33]
    Atikoglu B, Xu Y, Frachtenberg E, et al. Workload analysis of a large-scale key-value store. In: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems. New York: ACM, 2012, 53–64.
    [34]
    Li H, Berger D S, Hsu L, et al. Pond: CXL-based memory pooling systems for cloud platforms. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2023, 2: 574–587.
    [35]
    Guo Z, Shan Y, Luo X, et al. Clio: A hardware-software co-designed disaggregated memory system. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2022: 417–433.
    [36]
    Cai Q, Guo W, Zhang H, et al. Efficient distributed memory management with RDMA and caching. Proceedings of the VLDB Endowment, 2018, 11: 1604–1617. doi: 10.14778/3236187.3236209
    [37]
    Zuo P, Sun J, Yang L, et al. One-sided RDMA-conscious extendible Hashing for disaggregated memory. In: Proceedings of the 2021 USENIX Annual Technical Conference. Berkeley, CA, USA: USENIX, 2021, 15–29.
    [38]
    Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. New York: ACM, 2014, 305–320.
    [39]
    Wang Z, Li T, Wang H, et al. CRaft: An erasure-coding-supported version of raft for reducing storage cost and network cost. In: FAST'20: Proceedings of the 18th USENIX Conference on File and Storage Technologies. New York: ACM, 2020, 297–308.
  • 加载中

Catalog

    Figure  1.  Store an object with three-way replication on Memcached. One main copy (in node3) and two replicas (in node4 and node1) are stored at the same time.

    Figure  2.  RS(k = 2, m = 1) on Memcached. A KV pair is split and encoded into multiple data and parity blocks (i.e., <k, v0>, <k, v1>, and <k, v2>), which are separately stored.

    Figure  3.  Transition from two-way replication to EC (k = 3, m = 1). The naive transition process includes replica read, EC encoding, and EC block writes.

    Figure  4.  New data layout of replicas for two-way replication, which is similar to EC(k = 3,m).

    Figure  5.  Overall architecture. Modules identified by solid boxes are introduced to support our design.

    Figure  6.  Transition from EOR(3,3) to EC(3,1). IO overhead is reduced due to the new data placement of EOR.

    Figure  7.  Consistency problem triggered by writing and reading the same KV concurrently. Wrong data are returned.

    1.  Structure of work table, IO state is get or set, context indicates address of data to be written or read in local memory. Future means the data may not be in place, and synchronization is needed until data arrives.

    2.  Key management is composed of two extension schemes for data blocks and others.

    Figure  8.  Normal read and write performance with different redundancy schemes under (k,m,r) = (4,2,3)

    Figure  9.  Transition performance under different (k,m,r).

    [1]
    Reinsel D, Gantz J, Rydning J. The Digitization of The World from Edge to Core. Framingham, MA: International Data Corporation, 2018 .
    [2]
    Apache. HDFS Design. 2021. https://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. Accessed July 01, 2022.
    [3]
    Ghemawat S, Gobioff H, Leung S T. The Google file system. In: Proceedings of the nineteenth ACM symposium on Operating systems principles. New York: ACM, 2003 , 29–43.
    [4]
    Apache. Spark. 2021 . https://spark.apache.org/. Accessed July 01, 2022.
    [5]
    Apache. Storm. 2021 . https://storm.apache.org/index.html. Accessed July 01, 2022.
    [6]
    J. Zawodny. Redis: Lightweight key/value store that goes the extra mile. 2009 , 79. www.linux.com/news/redis-lightweight-keyvalue-store-goes-extra-mile. Accessed July 01, 2022.
    [7]
    B Fitzpatrick. Distributed caching with memcached. Linux Journal, 2004 124: 5. doi: 10.5555/1012889.1012894
    [8]
    Nishtala R, Fugal H, Grimm S, et al. Scaling memcache at Facebook. In: 10th USENIX Symposium on Networked Systems Design and Implementation. Lombard, IL: USENIX Association, 2013, 385–398.
    [9]
    Twitter Inc. Twemcache is the twitter memcached. 2012 . https://github.com/twitter/twemcache. Accessed July 01, 2022.
    [10]
    Goel A, Chopra B, Gerea C, et al. Fast database restarts at facebook. In: ProcIn: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York: ACM, 2014, 541–549.
    [11]
    Budhiraja N, Marzullo K, Schneider F B, et al. The primary-backup approach. In: Distributed systems, New York: ACM Press/Addison-Wesley Publishing Co. 1993 : 199–216.
    [12]
    van Renesse R, Schneider F. Chain replication for supporting high throughput and availability. In: 6th Conference on Symposium on Operating Systems Design & Implementation. San Francisco, CA: USENIX Association, 2004, 91–104.
    [13]
    Lai C, Jiang S, Yang L, et al. Atlas: Baidu’s key-value storage system for cloud data. In: 2015 31st Symposium on Mass Storage Systems and Technologies (MSST). Santa Clara, USA: IEEE, 2015, 1–14.
    [14]
    Li S, Zhang Q, Yang Z, et al. BCStore: Bandwidth-efficient in-memory KV-store with batch coding. In: 33nd International Conference on Massive Storage Systems and Technologies. Santa Clara, USA: MSST, 2017.
    [15]
    Rashmi K V, Chowdhury M, Kosaian J, et al. EC-Cache: Load-balanced, low-latency cluster caching with online erasure coding. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. New York: ACM, 2016, 401–417.
    [16]
    Yiu M M T, Chan H H W, Lee P P C. Erasure coding for small objects in in-memory KV storage. In: Proceedings of the 10th ACM International Systems and Storage Conference. New York, USA: ACM, 2017, 14.
    [17]
    Xu L, Lyu M, Li Q, et al. SelectiveEC: Towards balanced recovery load on erasure-coded storage systems. IEEE Transactions on Parallel and Distributed Systems, 2022, 33 (10): 2386–2400. doi: 10.1109/tpds.2021.3129973
    [18]
    MacWilliams F J, Sloane N J A. The theory of error-correcting codes. In: North-Holland Mathematical Library. Murray Hill, USA: Bell Laboratories, 1977, 16.
    [19]
    Yang J, Yue Y, Rashmi K V. A large scale analysis of hundreds of in-memory cache clusters at Twitter. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Berkeley CA, USA: USENIX Association, 2020, 191–208.
    [20]
    Xia M, Saxena M, Blaum M, et al. A tale of two erasure codes in HDFS. In: 13th USENIX conference on file and storage technologies (FAST’ 15). Santa Clara, USA: USENIX Association, 2015, 213–226.
    [21]
    Yao Q, Hu Y, Cheng L, et al. StripeMerge: Efficient wide-stripe generation for large-scale erasure-coded storage. In: 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS). DC, USA: IEEE, 2021, 483–493.
    [22]
    Wu S, Shen Z, Lee P P C. Enabling I/O-efficient redundancy transitioning in erasure-coded KV stores via elastic reed-solomon codes. In: 2020 International Symposium on Reliable Distributed Systems (SRDS). Shanghai, China: IEEE, 2020, 246–255.
    [23]
    Chen H, Zhang H, Dong M, et al. Efficient and available in-memory KV-store with hybrid erasure coding and replication. ACM Transactions on Storage, 2017, 13 (3): 25. doi: 10.1145/3129900
    [24]
    M. Kerrisk. The Linux Programming Interface. San Francisco, USA: No Starch Press, 2010 .
    [25]
    Libmemcached. https://libmemcached.org/libMemcached.html. Accessed July 01, 2022.
    [26]
    Plank J S. A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems. Software:Practice and Experience, 1997, 27 (9): 995–1012. doi: 10.1002/(sici)1097-024x(199709)27:9<995::aid-spe111>3.0.co;2-6
    [27]
    Stuedi P, Trivedi A, Pfefferle J, et al. Unification of temporary storage in the nodekernel architecture. In: Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference. New York: ACM, 2019, 767–781.
    [28]
    Klimovic A, Wang Y, Stuedi P, et al. Pocket: Elastic ephemeral storage for serverless analytics. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad,USA: 2018: 427–444.
    [29]
    Plank J S, Simmerman S, Schuman C D. Jerasure: A library in C/C++ facilitating erasure coding for storage applications. Knoxville, TN, USA: University of Tennessee, 2007.
    [30]
    Plank J S, Miller E L, Houston W B. GF-Complete: A comprehensive open source library for Galois Field arithmetic. Knoxville, TN, USA: University of Tennessee, 2013.
    [31]
    Xu B, Huang J, Cao Q, et al. TEA: A traffic-efficient erasure-coded archival scheme for In-memory stores. In: Proceedings of the 48th International Conference on Parallel Processing. New York: ACM, 2019, 24.
    [32]
    Li R, Hu Y, Lee P P C. Enabling efficient and reliable transition from replication to erasure coding for clustered file systems. IEEE Transactions on Parallel and Distributed Systems, 2017, 28: 2500–2513. doi: 10.1109/tpds.2017.2678505
    [33]
    Atikoglu B, Xu Y, Frachtenberg E, et al. Workload analysis of a large-scale key-value store. In: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems. New York: ACM, 2012, 53–64.
    [34]
    Li H, Berger D S, Hsu L, et al. Pond: CXL-based memory pooling systems for cloud platforms. In: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2023, 2: 574–587.
    [35]
    Guo Z, Shan Y, Luo X, et al. Clio: A hardware-software co-designed disaggregated memory system. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2022: 417–433.
    [36]
    Cai Q, Guo W, Zhang H, et al. Efficient distributed memory management with RDMA and caching. Proceedings of the VLDB Endowment, 2018, 11: 1604–1617. doi: 10.14778/3236187.3236209
    [37]
    Zuo P, Sun J, Yang L, et al. One-sided RDMA-conscious extendible Hashing for disaggregated memory. In: Proceedings of the 2021 USENIX Annual Technical Conference. Berkeley, CA, USA: USENIX, 2021, 15–29.
    [38]
    Ongaro D, Ousterhout J. In search of an understandable consensus algorithm. In: Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference. New York: ACM, 2014, 305–320.
    [39]
    Wang Z, Li T, Wang H, et al. CRaft: An erasure-coding-supported version of raft for reducing storage cost and network cost. In: FAST'20: Proceedings of the 18th USENIX Conference on File and Storage Technologies. New York: ACM, 2020, 297–308.

    Article Metrics

    Article views (233) PDF downloads(697)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return