The Bhojpuri LT Resources (BHLTR) project was initially initiated by me (Atul) at Jawaharlal Nehru University (JNU), New Delhi during the doctoral research work. BHLTR data contains monolingual, parallel (English-Bhojpuri), and POS annotated monolingual corpora. In this data, POS is annotated according to Bureau of Indian Standards (BIS) Part Of Speech (POS) tagset.
bho-resources/
├─ mono-bho-corpus/
│ ├─ monolingual.bho
│ ├─ README.md
│ ├─ pos-annotated/
│ │ └─ pos-tagged.bho
│ ├─ treebank/
│ │ └─ README.md
│
└─ parallel-corpora/
├─ README.md
├─ eng-bho/
│ └─ eng-bho.en
│ └─ eng-bho.bho
├─ additional-resources.md
├─ license.md
├─ README.md
├─ README.txt
I would like to thank my Doctoral supervisor Prof. Girish Nath Jha and Sanskrit Computational Lab, JNU, New Delhi.
If you use this data, please cite:
@article{ojha2019english, title={English-Bhojpuri SMT System: Insights from the Karaka Model}, author={Ojha, Atul Kr}, journal={arXiv preprint arXiv:1905.02239}, year={2019} }
other papers/references about the BHLTR are:
@inproceedings{karakanta2019proceedings, title={Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages}, author={Karakanta, Alina and Ojha, Atul Kr and Liu, Chao-Hong and Washington, Jonathan and Oco, Nathaniel and Lakew, Surafel Melaku and Malykh, Valentin and Zhao, Xiaobing}, booktitle={Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages}, year={2019} }
@article{kumar2018automatic, title={Automatic identification of closely-related Indian languages: Resources and experiments}, author={Kumar, Ritesh and Lahiri, Bornini and Alok, Deepak and Ojha, Atul Kr and Jain, Mayank and Basit, Abdul and Dawer, Yogesh}, journal={arXiv preprint arXiv:1803.09405}, year={2018} }
@inproceedings{ojha2015training, title={Training \& evaluation of POS taggers in Indo-Aryan languages: a case of Hindi, Odia and Bhojpuri}, author={Ojha, Atul Kr. and Behera, Pitambar and Singh, Srishti and Jha, Girish N}, booktitle={the proceedings of 7th Language \& Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics}, pages={524--529}, year={2015} }
=== Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: BHLTR v1.0 License: CC BY-NC-SA 4.0 Includes text: yes Contributors: Ojha, Atul Kr. Copyright (©) holder: Ojha, Atul Kr. Contact: [email protected] ===============================================================================