MasakhaNER: Named Entity Recognition for African Languages

We take a step towards addressing the underrepresentation of the African continent in NLP research by creating the first large publicly available high­quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders.

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent.

A survey on machine learning techniques in movie revenue prediction

With the growing number of literature on movie revenue prediction using machine learning techniques in recent years, a systemic review will help in strengthening the understanding of this research domain. Therefore, this article is aimed at determining the sources of data, the techniques, the features, and the evaluation metrics used in movie revenue prediction.

Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society.

