Skip to the content.

Lemmatization Dataset

The dataset has been curated from https://www.isical.ac.in/~utpal/resources.php. The raw text was collected from a collection of Rabindranath Tagore’s short stories and news articles from various domains.

Dataset

Each of the following files contains word and its lemma form.

Licensing

The original dataset does not provide any license information.

Citation

Please cite the following papers if you are using the data:

@article{alam2021review,
  title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
  author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
  journal={arXiv preprint arXiv:2107.03844},
  year={2021}
}

@inproceedings{chakrabarty-etal-2017-context,
 address = {Vancouver, Canada},
 author = {Chakrabarty, Abhisek  and Pandit, Onkar Arun  and Garain, Utpal},
 booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
 doi = {10.18653/v1/P17-1136},
 pages = {1481--1491},
 publisher = {Association for Computational Linguistics},
 title = {Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks},
 url = {https://www.aclweb.org/anthology/P17-1136},
 year = {2017}
}