Skip to content

The Corpus of Linguistically Significant Entities (CLSE) is a dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.

google-research-datasets/clse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLSE: Corpus of Linguistically Significant Entities

Description

The Corpus of Linguistically Significant Entities (CLSE) is a dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.

For more details, see the docs/ directory and the paper.

License

The contents of this repository is licensed under CC-BY.

Paper

Make sure to cite the following paper when using this dataset:

@inproceedings{clse2022,
  title={CLSE: Corpus of Linguistically Significant Entities},
  author={Chuklin, Aleksandr and Zhao, Justin and Kale, Mihir},
  booktitle={Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2022) at EMNLP 2022},
  year={2022}
}

https://arxiv.org/abs/2211.02423

About

The Corpus of Linguistically Significant Entities (CLSE) is a dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages