Unsupervised identification of Malay domain multiword expressions

WANG Lin; LIU Wuying

doi:10.3969/j.issn.0253-2778.2019.07.001

PDF( 970 KB)

Open Access JUSTC

Unsupervised identification of Malay domain multiword expressions

1.
Xianda College of Economics and Humanities, Shanghai International Studies University, Shanghai 200083, China
2.
Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou 510420, China

Cite this:

https://doi.org/10.3969/j.issn.0253-2778.2019.07.001

Received Date: 15 June 2018
Rev Recd Date: 18 September 2018
Publish Date: 31 July 2019

Abstract Full text PDF

Abstract

Abstract

Multiword expression (MWE) is an optimal granularity of language reuse. However, no explicit formal boundaries between MWEs and other words cause a serious problem on automatic identification of MWEs for some non-common languages. We address the identification issue of Malay domain MWEs, and propose a natural-annotation-based unsupervised extraction and clustering algorithm.In the novel algorithm, we firstly use a binary classification for each space character to solve length-varying Malay MWEs extraction, secondly transfer natural document-level category annotations to MWE-level ones for Malay MWEs clustering, and finally distill out several domain datasets of MWEs after filtering general MWEs.The experimental results in the Malay dataset of 272 783 text documents show that our algorithm can extract MWEs precisely and dispatch them into domain lexicons efficiently.

Abstract

Multiword expression (MWE) is an optimal granularity of language reuse. However, no explicit formal boundaries between MWEs and other words cause a serious problem on automatic identification of MWEs for some non-common languages. We address the identification issue of Malay domain MWEs, and propose a natural-annotation-based unsupervised extraction and clustering algorithm.In the novel algorithm, we firstly use a binary classification for each space character to solve length-varying Malay MWEs extraction, secondly transfer natural document-level category annotations to MWE-level ones for Malay MWEs clustering, and finally distill out several domain datasets of MWEs after filtering general MWEs.The experimental results in the Malay dataset of 272 783 text documents show that our algorithm can extract MWEs precisely and dispatch them into domain lexicons efficiently.