We will be accepting submissions until the end of the evaluation phase of the VarDial Evaluation Campaign 2021 on January 28, 2021. Participants who submit results until this date will be invited to submit a system description paper to appear in the proceedings of VarDial 2021.
|Rank||Team name||Link to paper||Method||Relevant macro F1|
|1||NRC||Method description||Probabilistic classifier (similar to Naive Bayes) using character 5-grams||0.8138|
|3||Phlyers||Naive Bayes classifier trained on character 3grams and 4grams||0.7584|
|4||Phlyers||SVM binary classifier (char n-grams 5-7) followed by Naive Bayes classifier (char n-grams 3-5)||0.4727|
|5||NRC||VarDial2020||deep neural network with adaptation to the test set||0.2996|
|6||NRC||VarDial2020||ensemble of 6 deep neural networks||0.2872|
|7||NRC||VarDial2020||deep neural network||0.2514|
|Rank||Team name||Link to paper||Method||Relevant micro F1|
|1||NRC||Method description||Probabilistic classifier (similar to Naive Bayes) using character 5-grams||0.9668|
|3||Phlyers||SVM binary classifier (char n-grams 5-7) followed by Naive Bayes classifier (char n-grams 3-5)||0.7793|
|4||NRC||VarDial2020||ensemble of 6 deep neural networks||0.2596|
|5||NRC||VarDial2020||deep neural network with adaptation to the test set||0.1547|
|6||NRC||VarDial2020||deep neural network||0.1359|
|Rank||Team name||Link to paper||Method||Macro F1|
|1||NRC||Method description||Probabilistic classifier (similar to Naive Bayes) using character 5-grams||0.9079|
|2||Phlyers||Naive Bayes classifier trained on character 3grams and 4grams||0.8753|
|4||NRC||VarDial2020||deep neural network with adaptation to the test set||0.6751|
|5||NRC||VarDial2020||deep neural network||0.6628|
|6||NRC||VarDial2020||ensemble of 6 deep neural networks||0.6356|
As training data for the relevant languages, we use the Wanca 2016 corpus. In total, the corpus contains 646,043 unique sentences, ranging from 19 sentences of Kemi Sami to 214,225 sentences of Northern Sami. The source version of the corpus can be downloaded from urn:nbn:fi:lb-2020022902. The test data includes new sentences from the yet unpublished Wanca 2017 corpus and will be provided to the participants by the task organizers in the beginning of the evaluation period. Not all of the 29 relevant languages in the training set are attested in the test set: the distribution of languages in the test set is close to the actual distribution of new sentences in the forthcoming Wanca 2017 corpus.
In addition to the relevant languages, the test set includes sentences in 149 other languages. The three largest Uralic languages have been included into this category. The download links for the training data for these non-relevant languages are distributed by the task organizers only to participating teams. In total, the training data for this task consists of 63,772,445 sentences in non-relevant and 646,043 sentences in relevant languages, totaling 64,418,488 sentences.
Both, the training data for the relevant and non-relevant languages must be considered as noisy, e.g. there will be incorrectly labeled sentences (not intentionally, though). The Wanca 2016 corpus includes a http-address for each sentence and the form of these addresses themselves can be used in the task as well. For example, our current pipeline allows only one of two close languages to be found from the same page and this kind of information can be used to clean the corpora if deemed helpful by the participants.
The shared task is divided in three different tracks. All of the tracks are closed, so no other data or models can be used for training in addition to the 64,418,488 sentences in the training set. All the tracks use the same training data.
The first track of the shared task considers all the relevant languages equal in value and the aim is to maximize their average F-score. This is important when one is interested to find also the very rare languages included in the set of relevant languages. The F-score is calculated as a macro-F1 score over the relevant languages in the training set. E.g. if you predict relevant languages in the test set that are not supposed to be there at all, your precision and thus your F1-score for that language goes to zero. The result is the average of the F1-scores of all the 29 relevant languages.
The second track considers each sentence in the test set that is written in or is predicted to be in a relevant language as equals. When compared to the first track, this track gives less importance to the very rare languages as their precision is not so important when the resulting F-score is calculated. The resulting F-score is calculated as a micro-F1 over the sentences in the test set for sentences in the relevant languages as well as those that you have predicted to be in relevant languages.
In the first two tracks, there is no difference between the non-relevant languages when the F1-scores are calculated. The third track, however, does not especially concentrate on the 29 relevant languages, but instead the target is to maximize the average F-score over all the 178 languages present in the training set. This track will be the LI shared task with the largest number of languages to date (ALTW 2010 included 74 languages). The F-score is calculated as a macro-F1 score over all the languages in the training set.
The training set contains sentences in the 178 languages below.