CLSE: Corpus of Linguistically Significant Entities

Description

The Corpus of Linguistically Significant Entities (CLSE) is a dataset of named entities annotated by linguist experts. It includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games. The aim of the corpus is to facilitate the creation of more linguistically diverse NLG datasets.

For more details, see the docs/ directory and the paper.

License

The contents of this repository is licensed under CC-BY.

Paper

Make sure to cite the following paper when using this dataset:

@inproceedings{clse2022,
  title={CLSE: Corpus of Linguistically Significant Entities},
  author={Chuklin, Aleksandr and Zhao, Justin and Kale, Mihir},
  booktitle={Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2022) at EMNLP 2022},
  year={2022}
}

https://arxiv.org/abs/2211.02423

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
docs		docs
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

docs

docs

scripts

scripts

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

README.md

README.md

setup.cfg

setup.cfg

Repository files navigation

CLSE: Corpus of Linguistically Significant Entities

Description

License

Paper

About

Releases

Packages

Contributors 2

Languages

google-research-datasets/clse

Folders and files

Latest commit

History

Repository files navigation

CLSE: Corpus of Linguistically Significant Entities

Description

License

Paper

About

Resources

Stars

Watchers

Forks

Languages