Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations

Hyunjae Kim; jaehyo yoo; Seunghyun Yoon; Jaewoo Kang

Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations

Hyunjae Kim, jaehyo yoo, Seunghyun Yoon, Jaewoo Kang

📝 Paper

Anthology

Underline 📺 Watch Video on Underline Add to Favorites

Main: Information Extraction Main-poster Paper

Poster Session 3: Information Extraction (Poster)

Conference Room: Frontenac Ballroom and Queen's Quay

Conference Time: July 11, 09:00-10:30 (EDT) (America/Toronto)

Global Time: July 11, Poster Session 3 (13:00-14:30 UTC)

Keywords: named entity recognition and relation extraction

TLDR: Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wi...

You can open the #paper-P4980 channel in a separate window.

Abstract: Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.