Skip to the content.

Authorship Attribution Dataset

The authorship attribution dataset curated from the work of Bangla Text Classification using Transformers. The dataset is originally hosted on Mendeley: https://data.mendeley.com/datasets/6d9jrkgtvv/2.

Dataset

The dataset contains writings of 14 different authors from an online Bangla e-library (e.g., novels, story, series, etc.).

Directory Structure:

In the authorship.tar.gz file is a zip version of train, dev, and test set of combined dataset.

To unzip the file, use the following command:

tar -xvzf authorship.tar.gz

Licensing

The original dataset is licensed under CC BY 4.0. The data split version is licensed under MIT license.

Citation

Please cite the following papers if you are using the data:

@article{alam2020bangla,
  title={Bangla Text Classification using Transformers},
  author={Alam, Tanvirul and Khan, Akib and Alam, Firoj},
  journal={arXiv preprint arXiv:2011.04446},
  year={2020}
}
@inproceedings{khatun2019authorship,
 author = {Khatun, Aisha and Rahman, Anisur and Islam, Md Saiful and others},
 booktitle = {2019 22nd International Conference on Computer and Information Technology (ICCIT)},
 organization = {IEEE},
 pages = {1--5},
 title = {Authorship Attribution in Bangla literature using Character-level {CNN}},
 year = {2019}
}