ULI SHARED TASK

We will be accepting submissions until the end of the evaluation phase of the VarDial Evaluation Campaign 2021 on January 28, 2021. Participants who submit results until this date will be invited to submit a system description paper to appear in the proceedings of VarDial 2021.

Current Top Results

ULI-RLE

Rank Team name Link to paper Method Relevant macro F1
1 NRC Method description Probabilistic classifier (similar to Naive Bayes) using character 5-grams 0.8138
2 SUKI baseline HeLI 0.8004
3 Phlyers Naive Bayes classifier trained on character 3grams and 4grams 0.7584
4 Phlyers SVM binary classifier (char n-grams 5-7) followed by Naive Bayes classifier (char n-grams 3-5) 0.4727
5 NRC VarDial2020 deep neural network with adaptation to the test set 0.2996
6 NRC VarDial2020 ensemble of 6 deep neural networks 0.2872
7 NRC VarDial2020 deep neural network 0.2514

ULI-RSS

Rank Team name Link to paper Method Relevant micro F1
1 NRC Method description Probabilistic classifier (similar to Naive Bayes) using character 5-grams 0.9668
2 SUKI baseline HeLI 0.9632
3 Phlyers SVM binary classifier (char n-grams 5-7) followed by Naive Bayes classifier (char n-grams 3-5) 0.7793
4 NRC VarDial2020 ensemble of 6 deep neural networks 0.2596
5 NRC VarDial2020 deep neural network with adaptation to the test set 0.1547
6 NRC VarDial2020 deep neural network 0.1359

ULI-178

Rank Team name Link to paper Method Macro F1
1 NRC Method description Probabilistic classifier (similar to Naive Bayes) using character 5-grams 0.9079
2 Phlyers Naive Bayes classifier trained on character 3grams and 4grams 0.8753
3 SUKI baseline HeLI 0.8004
4 NRC VarDial2020 deep neural network with adaptation to the test set 0.6751
5 NRC VarDial2020 deep neural network 0.6628
6 NRC VarDial2020 ensemble of 6 deep neural networks 0.6356

Training and testing

Read the task descriptions below. You can download the training data from here and the testing data from here. If you have any questions or when you wish to have your results evaluated contact the first author of this article.

Task description

As training data for the relevant languages, we use the Wanca 2016 corpus. In total, the corpus contains 646,043 unique sentences, ranging from 19 sentences of Kemi Sami to 214,225 sentences of Northern Sami. The source version of the corpus can be downloaded from urn:nbn:fi:lb-2020022902. The test data includes new sentences from the yet unpublished Wanca 2017 corpus and will be provided to the participants by the task organizers in the beginning of the evaluation period. Not all of the 29 relevant languages in the training set are attested in the test set: the distribution of languages in the test set is close to the actual distribution of new sentences in the forthcoming Wanca 2017 corpus.

In addition to the relevant languages, the test set includes sentences in 149 other languages. The three largest Uralic languages have been included into this category. The download links for the training data for these non-relevant languages are distributed by the task organizers only to participating teams. In total, the training data for this task consists of 63,772,445 sentences in non-relevant and 646,043 sentences in relevant languages, totaling 64,418,488 sentences.

Both, the training data for the relevant and non-relevant languages must be considered as noisy, e.g. there will be incorrectly labeled sentences (not intentionally, though). The Wanca 2016 corpus includes a http-address for each sentence and the form of these addresses themselves can be used in the task as well. For example, our current pipeline allows only one of two close languages to be found from the same page and this kind of information can be used to clean the corpora if deemed helpful by the participants.

The shared task is divided in three different tracks. All of the tracks are closed, so no other data or models can be used for training in addition to the 64,418,488 sentences in the training set. All the tracks use the same training data.

Track 1: ULI-RLE (Relevant languages as equals)

The first track of the shared task considers all the relevant languages equal in value and the aim is to maximize their average F-score. This is important when one is interested to find also the very rare languages included in the set of relevant languages. The F-score is calculated as a macro-F1 score over the relevant languages in the training set. E.g. if you predict relevant languages in the test set that are not supposed to be there at all, your precision and thus your F1-score for that language goes to zero. The result is the average of the F1-scores of all the 29 relevant languages.

Track 2: ULI-RSS (Relevant sentences as equals)

The second track considers each sentence in the test set that is written in or is predicted to be in a relevant language as equals. When compared to the first track, this track gives less importance to the very rare languages as their precision is not so important when the resulting F-score is calculated. The resulting F-score is calculated as a micro-F1 over the sentences in the test set for sentences in the relevant languages as well as those that you have predicted to be in relevant languages.

Track 3: ULI-178 (All 178 languages as equals)

In the first two tracks, there is no difference between the non-relevant languages when the F1-scores are calculated. The third track, however, does not especially concentrate on the 29 relevant languages, but instead the target is to maximize the average F-score over all the 178 languages present in the training set. This track will be the LI shared task with the largest number of languages to date (ALTW 2010 included 74 languages). The F-score is calculated as a macro-F1 score over all the languages in the training set.

Languages

The training set contains sentences in the 178 languages below.

The 29 relevant languages are:

The 149 irrelevant languages are:

Afrikaans (afr), Tosk Albanian (als), Amharic (amh), Arabic (ara), Assamese (asm), North Azerbaijani (azj), Bashkir (bak), Bavarian (bar), Central Bikol (bcl), Belarusian (bel), Bengali (ben), Bosnian (bos), Bishnupriya (bpy), Breton (bre), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Czech (ces), Chechen (che), Chuvash (chv), Mandarin Chinese (cmn), Corsican (cos), Welsh (cym), Danish (dan), German (deu), Dimli (diq), Dhivehi (div), Standard Estonian (ekk), Modern Greek (ell), English (eng), Esperanto (epo), Basque (eus), Extremaduran (ext), Faroese (fao), Finnish (fin), French (fra), Western Frisian (fry), Irish (gle), Galician (glg), Manx (glv), Goan Konkani (gom), Guarani (grn), Swiss German (gsw), Gujarati (guj), Haitian (hat), Hebrew (heb), Fiji Hindi (hif), Hindi (hin), Croatian (hrv), Upper Sorbian (hsb), Hungarian (hun), Ido (ido), Iloko (ilo), Interlingua (ina), Indonesian (ind), Icelandic (isl), Italian (ita), Javanese (jav), Japanese (jpn), Kalaallisut (kal), Kannada (kan), Georgian (kat), Kazakh (kaz), Kirghiz (kir), Korean (kor), Karachay-Balkar (krc), Kölsch (ksh), Latin (lat), Latvian (lav), Limburgan (lim), Lithuanian (lit), Lombard (lmo), Luxembourgish (ltz), Ganda (lug), Lushai (lus), Malayalam (mal), Marathi (mar), Minangkabau (min), Macedonian (mkd), Malagasy (mlg), Maltese (mlt), Mongolian (mon), Maori (mri), Mirandese (mwl), Mazanderani (mzn), Low German (nds), Nepali (nep), Newari (new), Dutch (nld), Norwegian Nynorsk (nno), Norwegian Bokmål (nob), Pedi (nso), Occitan (oci), Oriya (ori), Ossetian (oss), Pampanga (pam), Panjabi (pan), Iranian Persian (pes), Pfaelzisch (pfl), Piemontese (pms), Western Panjabi (pnb), Polish (pol), Portuguese (por), Pushto (pus), Quechua (que), Romansh (roh), Romanian (ron), Russian (rus), Yakut (sah), Sicilian (scn), Scots (sco), Samogitian (sgs), Sinhala (sin), Slovak (slk), Slovenian (slv), Shona (sna), Somali (som), Southern Sotho (sot), Spanish (spa), Sardinian (srd), Serbian (srp), Sundanese (sun), Swahili (swa), Swedish (swe), Tamil (tam), Tatar (tat), Telugu (tel), Tajik (tgk), Tagalog (tgl), Thai (tha), Tsonga (tso), Turkmen (tuk), Turkish (tur), Uighur (uig), Ukrainian (ukr), Urdu (urd), Northern Uzbek (uzn), Venetian (vec), Vietnamese (vie), Vlaams (vls), Volapük (vol), Walloon (wln), Wu Chinese (wuu), Xhosa (xho), Mingrelian (xmf), Yiddish (yid), Zeeuws (zea), Standard Malay (zsm), Zulu (zul).