Group stochastic gradient descent: A tradeoff between straggler and staleness

GAO Xiang; CHEN Li

doi:10.3969/j.issn.0253-2778.2020.08.016

JUSTC > 2020 > 50(8): 1156. > DOI: 10.3969/j.issn.0253-2778.2020.08.016 CSTR: 32290.14.j.issn.0253-2778.2020.08.016

PDF (1655 KB)

Open Access JUSTC Original Paper

Group stochastic gradient descent: A tradeoff between straggler and staleness

GAO Xiang,
CHEN Li

Department of Electronic Engineering and Information Science , University of Science and Technology of China , HeFei 230027, China

Cite this: JUSTC, 2020, 50(8): 1156-1161

https://doi.org/10.3969/j.issn.0253-2778.2020.08.016

CSTR: 32290.14.j.issn.0253-2778.2020.08.016

More Information

Received Date: June 30, 2020
Revised Date: July 27, 2020
Accepted Date: July 27, 2020
Published Date: August 30, 2020

Full text PDF

Abstract

Abstract

Distributed stochastic gradient descent（DSGD)is widely used for large scale distributed machine learning. Two typical implementations of DSGD are synchronous SGD（SSGD)and asynchronous SGD（ASGD). In SSGD, all workers should wait for each other and the training speed will be slowed down to that of the straggler. In ASGD, the stale gradients can result in a poorly trained model. To solve this problem, a new version of distributed SGD method based named group SGD(GSGD)is proposed, which puts workers with similar computation and communication performance in a group and divides them into several groups. The workers in the same group work in a synchronous manner while different groups work in an asynchronous manner. The proposed method can migrate the straggler problem since workers in the same group spend little time waiting for each other. The staleness of the method is small since the number of groups is much smaller than the number of workers. The convergence of the method is proved through theoretical analysis. Simulation results show that the method converges faster than SSGD and ASGD in the heterogeneous cluster.

Abstract

Distributed stochastic gradient descent（DSGD)is widely used for large scale distributed machine learning. Two typical implementations of DSGD are synchronous SGD（SSGD)and asynchronous SGD（ASGD). In SSGD, all workers should wait for each other and the training speed will be slowed down to that of the straggler. In ASGD, the stale gradients can result in a poorly trained model. To solve this problem, a new version of distributed SGD method based named group SGD(GSGD)is proposed, which puts workers with similar computation and communication performance in a group and divides them into several groups. The workers in the same group work in a synchronous manner while different groups work in an asynchronous manner. The proposed method can migrate the straggler problem since workers in the same group spend little time waiting for each other. The staleness of the method is small since the number of groups is much smaller than the number of workers. The convergence of the method is proved through theoretical analysis. Simulation results show that the method converges faster than SSGD and ASGD in the heterogeneous cluster.

FullText(HTML)

References (24)

References

[1]	BOTTOU L. Stochastic gradient learning in neural networks[J]. Proceedings of Neuro-Nimes, 1991, 91(8):12.
[2]	BOTTOU L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of COMPSTAT’2010. Berlin, German: Springer, 2010: 177-186.
[3]	RAKHLIN A, SHAMIR O, SRIDHARAN K. Making gradient descent optimal for strongly convex stochastic optimization[C]//International Conference on Machine Learning. Basel, Switzerland: MDPI, 2012:449-456.
[4]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2012: 1097-1105.
[5]	DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on audio, speech, and language processing, 2011, 20(1):30-42.
[6]	COLLOBERT R, WESTON J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 160-167.
[7]	DEAN J, CORRADO G S, MONGA R, et al. Large scale distributed deep networks[J]. Advances in Neural Information Processing Systems, 2012, 2:1223-1231.
[8]	XING E P, HO Q, DAI W, et al. Petuum: A new platform for distributed machine learning on big data[J]. IEEE Transactions on Big Data, 2015, 1(2):49-67.
[9]	ABADI M, AGARWAL A, BARHAM P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. 2015， arXiv:1603.04467.
[10]	LI M, ANDERSEN D G, PARK J W, etal. Scaling distributed machine learning with the parameter server[C]//Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583-598.
[11]	ZHANG S, CHOROMANSKA A E, LECUN Y. Deep learning with elastic averaging sgd[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 685-693.
[12]	LIAN X, HUANG Y, LI Y, et al. Asynchronous parallel stochastic gradient for nonconvex optimization[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 2737-2745.
[13]	CHEN J, PAN X, MONGA R, et al. Revisiting distributed synchronous sgd[J]. arXiv preprint arXiv:1604.00981, 2016.
[14]	TANDON R, LEI Q, DIMAKIS A G, et al. Gradient coding: Avoiding stragglers in distributed learning[C]//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 3368-3376.
[15]	HARLAP A, CUI H, DAI W, et al. Addressing the straggler problem for iterative convergent parallel ml[C]//Proceedings of the Seventh ACM Symposium on Cloud Computing. New York, NY, USA: ACM, 2016:98-111.
[16]	MCMAHAN H B, STREETER M. Delay-tolerant algorithms for asynchronous distributed online learning[J]. Advances in Neural Information Processing Systems, 2014, 4:2915-2923.
[17]	CHAN W, LANE I. Distributed asynchronous optimization of convolutional neural networks[J]. College & Research Libraries, 2014, 76(6):756-770.
[18]	ZHENG S, MENG Q, WANG T, et al. Asynchronous stochastic gradient descent with delay compensation[C]//Proceedings of the 34th International Conference on Machine Learning. New York, NY: ACM, 2017:4120-4129.
[19]	HO Q, CIPAR J, CUI H, et al. More effective distributed ml via a stale synchronous parallel parameter server[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2013: 1223-1231.
[20]	GUPTA S, ZHANG W, WANG F. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study[C]//16th International Conference on Data Mining. NEW YORK, NY: IEEE, 2016: 171-180.
[21]	ZHANG W, GUPTA S, LIAN X, et al. Staleness-aware async-sgd for distributed deep learning[J]. In International Joint Conference on Artificial Intelligence, 2016:2350-2356.
[22]	BASU S, SAXENA V, PANJA R, et al. Balancing stragglers against staleness in distributed deep learning[C]//25th International Conference on High Performance Computing. NEW YORK, NY: IEEE, 2018: 12-21.
[23]	BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Siam Review, 2016, 60(2):223-311.
[24]	DUTTA S, JOSHI G, GHOSH S, et al. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd[J]. 2018， arXiv:1803.01113.)

Cited By

Track Citations

Get Citation

{{if article.articleBusiness.pdfLink && article.articleBusiness.pdfLink != ''}} {{else}} {{/if}}PDF

XML

References

[1]	BOTTOU L. Stochastic gradient learning in neural networks[J]. Proceedings of Neuro-Nimes, 1991, 91(8):12.
[2]	BOTTOU L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of COMPSTAT’2010. Berlin, German: Springer, 2010: 177-186.
[3]	RAKHLIN A, SHAMIR O, SRIDHARAN K. Making gradient descent optimal for strongly convex stochastic optimization[C]//International Conference on Machine Learning. Basel, Switzerland: MDPI, 2012:449-456.
[4]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2012: 1097-1105.
[5]	DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on audio, speech, and language processing, 2011, 20(1):30-42.
[6]	COLLOBERT R, WESTON J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning. New York, USA: ACM, 2008: 160-167.
[7]	DEAN J, CORRADO G S, MONGA R, et al. Large scale distributed deep networks[J]. Advances in Neural Information Processing Systems, 2012, 2:1223-1231.
[8]	XING E P, HO Q, DAI W, et al. Petuum: A new platform for distributed machine learning on big data[J]. IEEE Transactions on Big Data, 2015, 1(2):49-67.
[9]	ABADI M, AGARWAL A, BARHAM P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. 2015， arXiv:1603.04467.
[10]	LI M, ANDERSEN D G, PARK J W, etal. Scaling distributed machine learning with the parameter server[C]//Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 583-598.
[11]	ZHANG S, CHOROMANSKA A E, LECUN Y. Deep learning with elastic averaging sgd[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 685-693.
[12]	LIAN X, HUANG Y, LI Y, et al. Asynchronous parallel stochastic gradient for nonconvex optimization[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2015: 2737-2745.
[13]	CHEN J, PAN X, MONGA R, et al. Revisiting distributed synchronous sgd[J]. arXiv preprint arXiv:1604.00981, 2016.
[14]	TANDON R, LEI Q, DIMAKIS A G, et al. Gradient coding: Avoiding stragglers in distributed learning[C]//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017: 3368-3376.
[15]	HARLAP A, CUI H, DAI W, et al. Addressing the straggler problem for iterative convergent parallel ml[C]//Proceedings of the Seventh ACM Symposium on Cloud Computing. New York, NY, USA: ACM, 2016:98-111.
[16]	MCMAHAN H B, STREETER M. Delay-tolerant algorithms for asynchronous distributed online learning[J]. Advances in Neural Information Processing Systems, 2014, 4:2915-2923.
[17]	CHAN W, LANE I. Distributed asynchronous optimization of convolutional neural networks[J]. College & Research Libraries, 2014, 76(6):756-770.
[18]	ZHENG S, MENG Q, WANG T, et al. Asynchronous stochastic gradient descent with delay compensation[C]//Proceedings of the 34th International Conference on Machine Learning. New York, NY: ACM, 2017:4120-4129.
[19]	HO Q, CIPAR J, CUI H, et al. More effective distributed ml via a stale synchronous parallel parameter server[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. New York, USA: Curran Associates Inc, 2013: 1223-1231.
[20]	GUPTA S, ZHANG W, WANG F. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study[C]//16th International Conference on Data Mining. NEW YORK, NY: IEEE, 2016: 171-180.
[21]	ZHANG W, GUPTA S, LIAN X, et al. Staleness-aware async-sgd for distributed deep learning[J]. In International Joint Conference on Artificial Intelligence, 2016:2350-2356.
[22]	BASU S, SAXENA V, PANJA R, et al. Balancing stragglers against staleness in distributed deep learning[C]//25th International Conference on High Performance Computing. NEW YORK, NY: IEEE, 2018: 12-21.
[23]	BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Siam Review, 2016, 60(2):223-311.
[24]	DUTTA S, JOSHI G, GHOSH S, et al. Slow and stale gradients can win the race: Error-runtime trade-offs in distributed sgd[J]. 2018， arXiv:1803.01113.)

[1]	Zheng Gong, Si Wu, Yinlong Xu. Hybrid fault tolerance in distributed in-memory storage systems[J]. JUSTC, 2025, 55(1): 0105. DOI: 10.52396/JUSTC-2022-0125
[2]	Jie Qian, Junfan Xia, Bin Jiang. Machine learning molecular dynamics simulations of liquid methanol[J]. JUSTC, 2024, 54(6): 0603. DOI: 10.52396/JUSTC-2024-0031
[3]	Jiatong Hu, Mingqing Liu, Hongqi Li, Jiayin Yue, Wei Wang, Ji Liu. Association study on bone metabolism in type 2 diabetes by using machine learning[J]. JUSTC, 2023, 53(12): 1205. DOI: 10.52396/JUSTC-2023-0089
[4]	Can Yang, Qi Li, Yan Liu, Ling Zhang, Jian Gao, Xu Steven Xu, Min Yuan. Machine-learning diet quality score and risk of cardiovascular disease[J]. JUSTC, 2023, 53(12): 1204. DOI: 10.52396/JUSTC-2023-0067
[5]	Siqi Tan, Li Chen, Weidong Wang. Coded computing for distributed graph-based semi-supervised learning[J]. JUSTC, 2023, 53(4): 0401. DOI: 10.52396/JUSTC-2022-0133
[6]	Jie Wu, Yumeng Wu. Machine learning in data envelopment analysis: A smart mechanism for indicator selection[J]. JUSTC, 2022, 52(12): 5-1-5-9. DOI: 10.52396/JUSTC-2022-0106
[7]	WEI Yutong, BAO Bingkun, ZHANG Ziqi, ZHU Jin. A low-latency inpainting method for unstably transmitted videos[J]. JUSTC, 2021, 51(10): 717-724. DOI: 10.52396/JUST-2020-0032
[8]	LONG Fei, HUANG Kun, LI Feng. A novel mapping between machine learning and phase transition[J]. JUSTC, 2020, 50(1): 18-28. DOI: 10.3969/j.issn.0253-2778.2020.01.003
[9]	ZHONG Bin, WANG Xinghu, SHENG Jie. Distributed adaptive control of nonlinear vehicular platoons[J]. JUSTC, 2019, 49(7): 588-594. DOI: 10.3969/j.issn.0253-2778.2019.07.009
[10]	LIN Yupeng, XIE Zhige, XU Kai, CHEN Fei, LIU Ligang. Fast cancer diagnosis based on extreme learning machine[J]. JUSTC, 2018, 48(2): 154-160. DOI: 10.3969/j.issn.0253-2778.2018.02.010

TrendMD

Volume 50 Issue 8 PP. 1156-1161

Cover

Keywords

Article Metrics

Article views (87) PDF downloads (513)

Group stochastic gradient descent: A tradeoff between straggler and staleness

Abstract

Abstract

References

Related Articles

Catalog

References

Related Articles

TrendMD

Article Metrics

Authors

Browse

Contact Us

About

Group stochastic gradient descent: A tradeoff between straggler and staleness

Share

Tools

Abstract

Abstract

References

Related Articles

Catalog

References

Related Articles

TrendMD

Article Metrics

Authors

Browse

Contact Us

About

Export File

Citation

Format

Content