ISSN 0253-2778

CN 34-1054/N

Open AccessOpen Access JUSTC Original Paper

Corpus selection for Uyghur-Chinese machine translation based on bilingual sentence coverage

Cite this:
https://doi.org/10.3969/j.issn.0253-2778.2017.04.001
  • Received Date: 01 March 2016
  • Rev Recd Date: 17 September 2016
  • Publish Date: 30 April 2017
  • When making the selection of corpora, information includes not only redundancy at the vocabulary level but also redundancy at the sentential level. Present methods for this purpose are mainly focused on selecting corpora at the vocabulary level of coverage. These methods can effectively reduce the redundancy of words and phrases, but does not take into account the level of sentence coverage. Aiming at selecting a smaller training corpus from large-scale bilingual corpus, in order to get a the same or better translation system than the mass training data, the corpus from sentence coverage was mainly selected, by combining unseen n-grams method and edit distance. The experimental results show that the proposed method uses less training corpus, but still achieves almost equivalent performance compared with the original training corpus.
    When making the selection of corpora, information includes not only redundancy at the vocabulary level but also redundancy at the sentential level. Present methods for this purpose are mainly focused on selecting corpora at the vocabulary level of coverage. These methods can effectively reduce the redundancy of words and phrases, but does not take into account the level of sentence coverage. Aiming at selecting a smaller training corpus from large-scale bilingual corpus, in order to get a the same or better translation system than the mass training data, the corpus from sentence coverage was mainly selected, by combining unseen n-grams method and edit distance. The experimental results show that the proposed method uses less training corpus, but still achieves almost equivalent performance compared with the original training corpus.
  • loading
  • [1]
    CHAO W H, LI Z J. A Graph-based bilingual corpus selection approach for SMT[C]// Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation. Singapore: Waseda University Press, 2011: 120-129.
    [2]
    CUI L, ZHANG D D, LIU S J, et al. Collective corpus weighting and phrase scoring for SMT using graph-based random walk[C]// The 2nd Conference on Natural Language Processing & Chinese Computing. Chongqing, China, 2013: 176-187.
    [3]
    ECK M, VOGEL S, WAIBEL A. Low cost portability for statistical machine translation based on n-gram coverage[C]// International Workshop on Spoken Language Translation. Pittsburgh, USA: IWSLT Press, 2005: 61-67.
    [4]
    MANDAL A, VERGYRI D, WANG W, et al. Efficient data selection for machine translation[C]// Spoken Language Technology Workshop. Goa, India: IEEE Press, 2008: 261-264.
    [5]
    SKADIA I, BRLTIS E. English-Latvian SMT: knowledge or data? [C]// Proceedings of the 17th NODALIDA Conference Processing, http://beta.visl.sdu.dk/~eckhard/nodalida/paper_57.pdf, 2009: 242-245.
    [6]
    HAN X W, LI H Z, ZHAO T J. Train the machine with what it can learn: Corpus selection for SMT[C]// Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-Parallel Corpora. Suntec, Singapore: ACM Press, 2009: 27-33.
    [7]
    王志洋,吕雅娟,刘群. 面向形态丰富语言的多粒度翻译融合[J]. 中文信息学报. 2011, 25(4): 75-81.
    WANG Z Y, LV Y J, LIU Q. System combination with multiple granularities for morphologically rich language translation[J]. Journal of Chinese Information Processing, 2011, 25(4): 75-81.
    [8]
    米莉万·雪合来提, 刘凯,吐尔根·依布拉音. 基于维语尔语词干词缀粒度的汉维机器翻译[J]. 中文信息学报, 2015, 29(3): 201-206.
    MILIWAN·XUEHELAITI, LIU KAI, TURGUN·IBRAHIM. Chinese-Uyghur machine translation based on smallest translation units of stem and suffixes[J]. Journal of Chinese Information Processing, 2015, 29(3):201-206.
    [9]
    HAN J W, JI H, SUN Y Z. Successful data mining methods for NLP[C]// Proceedings of the Tutorials of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing. Beijing, China: ACL Press, 2015: 1-4.
    [10]
    LIU L, HONG Y, LIU H, et al. Effective selection of translation model training data[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, USA: IEEE Press, 2014: 569-573.
    [11]
    HILDEBRAND A S, ECK M, VOGEL S, et al. Adaptation of the translation model for statistical machine translation based on information retrieval[C]// Proceedings of the 10th Annual Conference on European Association for Machine Translation. San Diego, USA: ACM Press, 2005: 133-142.
    [12]
    黄瑾,吕雅娟,刘群. 基于信息检索方法的统计翻译系统训练数据选择与优化[J]. 中文信息学报, 2008, 22(2): 40-46.
    HUANG Jin, LV Yajun, lIU Qun. The statistical translation system based on information retrieval method selection and optimization of training data[J]. Journal of Chinese Information Processing, 2008, 22(2): 40-46.
    [13]
    姚树杰, 肖桐, 朱靖波. 基于句对质量和覆盖度的统计机器翻译训练语料选取[J]. 中文信息学报, 2011, 25(1): 72-77.
    YAO Shujie, XIAO Tong, ZHU Jingbo. Selection of SMT training data based on sentence pair quality and coverage[J]. Journal of Chinese Information Processing, 2011, 25(1): 72-77.
    [14]
    王星, 涂兆鹏, 谢军, 等. 一种基于分类的平行语料选取方法[J]. 中文信息学报, 2013, 27(6): 144-150.
    WANG Xing, TU Zhaopeng, XIE Jun, etal. Selection of parallel corpus based on classification[J]. Journal of Chinese Information Processing, 2013, 27(6): 144-150.
    [15]
    KIRCHHOFF K, BILMES J. Submodularity for data selection in statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL Press, 2014: 131-141.
  • 加载中

Catalog

    [1]
    CHAO W H, LI Z J. A Graph-based bilingual corpus selection approach for SMT[C]// Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation. Singapore: Waseda University Press, 2011: 120-129.
    [2]
    CUI L, ZHANG D D, LIU S J, et al. Collective corpus weighting and phrase scoring for SMT using graph-based random walk[C]// The 2nd Conference on Natural Language Processing & Chinese Computing. Chongqing, China, 2013: 176-187.
    [3]
    ECK M, VOGEL S, WAIBEL A. Low cost portability for statistical machine translation based on n-gram coverage[C]// International Workshop on Spoken Language Translation. Pittsburgh, USA: IWSLT Press, 2005: 61-67.
    [4]
    MANDAL A, VERGYRI D, WANG W, et al. Efficient data selection for machine translation[C]// Spoken Language Technology Workshop. Goa, India: IEEE Press, 2008: 261-264.
    [5]
    SKADIA I, BRLTIS E. English-Latvian SMT: knowledge or data? [C]// Proceedings of the 17th NODALIDA Conference Processing, http://beta.visl.sdu.dk/~eckhard/nodalida/paper_57.pdf, 2009: 242-245.
    [6]
    HAN X W, LI H Z, ZHAO T J. Train the machine with what it can learn: Corpus selection for SMT[C]// Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-Parallel Corpora. Suntec, Singapore: ACM Press, 2009: 27-33.
    [7]
    王志洋,吕雅娟,刘群. 面向形态丰富语言的多粒度翻译融合[J]. 中文信息学报. 2011, 25(4): 75-81.
    WANG Z Y, LV Y J, LIU Q. System combination with multiple granularities for morphologically rich language translation[J]. Journal of Chinese Information Processing, 2011, 25(4): 75-81.
    [8]
    米莉万·雪合来提, 刘凯,吐尔根·依布拉音. 基于维语尔语词干词缀粒度的汉维机器翻译[J]. 中文信息学报, 2015, 29(3): 201-206.
    MILIWAN·XUEHELAITI, LIU KAI, TURGUN·IBRAHIM. Chinese-Uyghur machine translation based on smallest translation units of stem and suffixes[J]. Journal of Chinese Information Processing, 2015, 29(3):201-206.
    [9]
    HAN J W, JI H, SUN Y Z. Successful data mining methods for NLP[C]// Proceedings of the Tutorials of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing. Beijing, China: ACL Press, 2015: 1-4.
    [10]
    LIU L, HONG Y, LIU H, et al. Effective selection of translation model training data[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, USA: IEEE Press, 2014: 569-573.
    [11]
    HILDEBRAND A S, ECK M, VOGEL S, et al. Adaptation of the translation model for statistical machine translation based on information retrieval[C]// Proceedings of the 10th Annual Conference on European Association for Machine Translation. San Diego, USA: ACM Press, 2005: 133-142.
    [12]
    黄瑾,吕雅娟,刘群. 基于信息检索方法的统计翻译系统训练数据选择与优化[J]. 中文信息学报, 2008, 22(2): 40-46.
    HUANG Jin, LV Yajun, lIU Qun. The statistical translation system based on information retrieval method selection and optimization of training data[J]. Journal of Chinese Information Processing, 2008, 22(2): 40-46.
    [13]
    姚树杰, 肖桐, 朱靖波. 基于句对质量和覆盖度的统计机器翻译训练语料选取[J]. 中文信息学报, 2011, 25(1): 72-77.
    YAO Shujie, XIAO Tong, ZHU Jingbo. Selection of SMT training data based on sentence pair quality and coverage[J]. Journal of Chinese Information Processing, 2011, 25(1): 72-77.
    [14]
    王星, 涂兆鹏, 谢军, 等. 一种基于分类的平行语料选取方法[J]. 中文信息学报, 2013, 27(6): 144-150.
    WANG Xing, TU Zhaopeng, XIE Jun, etal. Selection of parallel corpus based on classification[J]. Journal of Chinese Information Processing, 2013, 27(6): 144-150.
    [15]
    KIRCHHOFF K, BILMES J. Submodularity for data selection in statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL Press, 2014: 131-141.

    Article Metrics

    Article views (557) PDF downloads(197)
    Proportional views

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return