Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking

Mingyu Lee, Jun Hyung Park, Junho Kim, Kang Min Kim, Sang Keun Lee

Research output: Contribution to conferencePaperpeer-review

7 Scopus citations

Abstract

Masked language modeling (MLM) has been widely used for pre-training effective bidirectional representations, but incurs substantial training costs. In this paper, we propose a novel concept-based curriculum masking (CCM) method to efficiently pre-train a language model. CCM has two key differences from existing curriculum learning approaches to effectively reflect the nature of MLM. First, we introduce a carefully-designed linguistic difficulty criterion that evaluates the MLM difficulty of each token. Second, we construct a curriculum that gradually masks words related to the previously masked words by retrieving a knowledge graph. Experimental results show that CCM significantly improves pre-training efficiency. Specifically, the model trained with CCM shows comparative performance with the original BERT on the General Language Understanding Evaluation benchmark at half of the training cost. Code is available at https://github.com/KoreaMGLEE/Concept-based-curriculum-masking.

Original languageEnglish
Pages7417-7427
Number of pages11
DOIs
StatePublished - 2022
Event2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: 7 Dec 202211 Dec 2022

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period7/12/2211/12/22

Bibliographical note

Publisher Copyright:
© 2022 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'Efficient Pre-training of Masked Language Model via Concept-based Curriculum Masking'. Together they form a unique fingerprint.

Cite this