ULI SHARED TASK

We were accepting submissions until the end of the evaluation phase of the VarDial Evaluation Campaign 2021 on February 2, 2021. Participants who submitted results are all invited to submit a system description paper to appear in the proceedings of VarDial 2021.

The results list can be updated again after the workshop has been held on April (19th or 20th as of current knowledge).

Current Top Results

ULI-RLE

Rank	Team name	Link to paper	Method	Relevant macro F1
1	NRC	VarDial2021	Probabilistic classifier (similar to Naive Bayes) using character 5-grams	0.8138
2	Phlyers	VarDial2021	Ensemble of SVM and Naive Bayes classifiers using character n-grams 3-5.	0.8085
3	SUKI	baseline	HeLI	0.8004
4	Phlyers	VarDial2021	Naive Bayes classifier trained on character 5grams	0.7977
5	LAST	VarDial2021	Majority vote ensemble of three Logistic Regression classifiers trained on char n-grams 1-3 weighted with BM25	0.7758
6	LAST	VarDial2021	Logistic Regression classifier trained on char n-grams 1-3 weighted with BM25	0.7755
7	Phlyers	VarDial2021	SVM binary classifier (char n-grams 3-4) followed by Naive Bayes classifier (char n-grams 3-5)	0.7740
8	LAST	VarDial2021	Logistic Regression classifier trained on word internal char n-grams 1-4 weighted with BM25	0.7727
9	Phlyers	VarDial2021	Naive Bayes classifier trained on character 3grams and 4grams	0.7584
10	NRC	VarDial2021	BERT-style deep neural network with early stopping	0.7430
11	NRC	VarDial2021	BERT-style deep neural network	0.6866
12	Phlyers	VarDial2021	SVM binary classifier (char n-grams 5-7) followed by Naive Bayes classifier (char n-grams 3-5)	0.6783
13	CRF-LI22		CRF with many different types of features	0.6140
14	NRC	VarDial2020	deep neural network with adaptation to the test set	0.2996
15	NRC	VarDial2020	ensemble of 6 deep neural networks	0.2872
16	NRC	VarDial2020	deep neural network	0.2514

ULI-RSS

Rank	Team name	Link to paper	Method	Relevant micro F1
1	NRC	VarDial2021	Probabilistic classifier (similar to Naive Bayes) using character 5-grams	0.9668
2	SUKI	baseline	HeLI	0.9632
3	NRC	VarDial2021	BERT-style deep neural network with early stopping	0.9530
4	LAST	VarDial2021	Majority vote ensemble of three Logistic Regression classifiers trained on char n-grams 1-3 weighted with BM25	0.9496
5	LAST	VarDial2021	Logistic Regression classifier trained on word internal char n-grams 1-4 weighted with BM25	0.9492
6	LAST	VarDial2021	Logistic Regression classifier trained on char n-grams 1-3 weighted with BM25	0.9484
7	CRF-LI22		CRF with many different types of features	0.8693
8	Phlyers	VarDial2021	SVM binary classifier (char n-grams 3-4) followed by Naive Bayes classifier (char n-grams 3-5)	0.8389
9	NRC	VarDial2021	BERT-style deep neural network	0.8177
10	Phlyers	VarDial2021	SVM binary classifier (char n-grams 5-7) followed by Naive Bayes classifier (char n-grams 3-5)	0.7595
11	Phlyers	VarDial2021	Naive Bayes classifier trained on character 5grams	0.5934
12	Phlyers	VarDial2021	Ensemble of SVM and Naive Bayes classifiers using character n-grams 3-5.	0.5932
13	NRC	VarDial2020	ensemble of 6 deep neural networks	0.2596
14	NRC	VarDial2020	deep neural network with adaptation to the test set	0.1547
15	NRC	VarDial2020	deep neural network	0.1359

ULI-178

Rank	Team name	Link to paper	Method	Macro F1
1	SUKI	baseline	HeLI	0.9252
2	LAST	VarDial2021	Logistic Regression classifier trained on word internal char n-grams 1-4 weighted with BM25	0.9164
3	LAST	VarDial2021	Majority vote ensemble of three Logistic Regression classifiers trained on char n-grams 1-3 weighted with BM25	0.9131
4	LAST	VarDial2021	Logistic Regression classifier trained on char n-grams 1-3 weighted with BM25	0.9125
5	NRC	VarDial2021	Probabilistic classifier (similar to Naive Bayes) using character 5-grams	0.9079
6	NRC	VarDial2021	BERT-style deep neural network with early stopping	0.9039
7	Phlyers	VarDial2021	Ensemble of SVM and Naive Bayes classifiers using character n-grams 3-5.	0.8847
8	Phlyers	VarDial2021	Naive Bayes classifier trained on character 5grams	0.8831
9	Phlyers	VarDial2021	Naive Bayes classifier trained on character 3grams and 4grams	0.8753
10	CRF-LI22		CRF with many different types of features	0.8644
11	NRC	VarDial2021	BERT-style deep neural network	0.8366
12	NRC	VarDial2020	deep neural network with adaptation to the test set	0.6751
13	NRC	VarDial2020	deep neural network	0.6628
14	NRC	VarDial2020	ensemble of 6 deep neural networks	0.6356

Training and testing

Read the task descriptions below. You can download the training data from here and the testing data from here. If you have any questions or when you wish to have your results evaluated contact the first author of this article.

Task description

As training data for the relevant languages, we use the Wanca 2016 corpus. In total, the corpus contains 646,043 unique sentences, ranging from 19 sentences of Kemi Sami to 214,225 sentences of Northern Sami. The source version of the corpus can be downloaded from urn:nbn:fi:lb-2020022902. The test data includes new sentences from the yet unpublished Wanca 2017 corpus and will be provided to the participants by the task organizers in the beginning of the evaluation period. Not all of the 29 relevant languages in the training set are attested in the test set: the distribution of languages in the test set is close to the actual distribution of new sentences in the forthcoming Wanca 2017 corpus.

In addition to the relevant languages, the test set includes sentences in 149 other languages. The three largest Uralic languages have been included into this category. The download links for the training data for these non-relevant languages are distributed by the task organizers only to participating teams. In total, the training data for this task consists of 63,772,445 sentences in non-relevant and 646,043 sentences in relevant languages, totaling 64,418,488 sentences.

Both, the training data for the relevant and non-relevant languages must be considered as noisy, e.g. there will be incorrectly labeled sentences (not intentionally, though). The Wanca 2016 corpus includes a http-address for each sentence and the form of these addresses themselves can be used in the task as well. For example, our current pipeline allows only one of two close languages to be found from the same page and this kind of information can be used to clean the corpora if deemed helpful by the participants.

The shared task is divided in three different tracks. All of the tracks are closed, so no other data or models can be used for training in addition to the 64,418,488 sentences in the training set. All the tracks use the same training data.

Track 1: ULI-RLE (Relevant languages as equals)

The first track of the shared task considers all the relevant languages equal in value and the aim is to maximize their average F-score. This is important when one is interested to find also the very rare languages included in the set of relevant languages. The F-score is calculated as a macro-F1 score over the relevant languages in the training set. E.g. if you predict relevant languages in the test set that are not supposed to be there at all, your precision and thus your F1-score for that language goes to zero. The result is the average of the F1-scores of all the 29 relevant languages.

Track 2: ULI-RSS (Relevant sentences as equals)

The second track considers each sentence in the test set that is written in or is predicted to be in a relevant language as equals. When compared to the first track, this track gives less importance to the very rare languages as their precision is not so important when the resulting F-score is calculated. The resulting F-score is calculated as a micro-F1 over the sentences in the test set for sentences in the relevant languages as well as those that you have predicted to be in relevant languages.

Track 3: ULI-178 (All 178 languages as equals)

In the first two tracks, there is no difference between the non-relevant languages when the F1-scores are calculated. The third track, however, does not especially concentrate on the 29 relevant languages, but instead the target is to maximize the average F-score over all the 178 languages present in the training set. This track will be the LI shared task with the largest number of languages to date (ALTW 2010 included 74 languages). The F-score is calculated as a macro-F1 score over all the languages in the training set.

Languages

The training set contains sentences in the 178 languages below.

The 29 relevant languages are:

fit Tornedalen Finnish (meänkieli)
fkv Kven (kvääni)
izh Ingrian (ižoran keel)
kca Khanty (ханты ясанг)
koi Komi-Permyak (перем коми кыв)
kpv Komi-Zyrian (Коми кыв)
krl Karelian (karjal)
liv Liv (līvõ kēļ)
lud Ludian (lüüdin kiel')
mdf Moksha (мокшень)
mhr Eastern and Meadow Mari (марий йылме)
mns Mansi (мāньси лāтыӈ)
mrj Western or Hill Mari (Кырык мары)
myv Erzya (эрзянь)
nio Nganasan (ня”)
olo Livvi (Olonets / livvin karjal)
sjd Kildin Sami (Кӣллт са̄мь кӣлл)
sjk Kemi Sami (samääškiela)
sju Ume Sami (uumajanlappi)
sma Southern Sami (åarjel-saemien)
sme Northern Sami (davvisámi, davvisámegiella)
smj Lule Sami (julevsábme)
smn Inari Sami (anarâškielâ)
sms Skolt Sami (sää´mǩiõll)
udm Udmurt (удмурт кыл)
vep Veps (vepsän kel')
vot Votic (vad̕d̕a ceeli)
vro Võro (võro kiil)
yrk Nenets (ненэцяʼ вада)

The 149 irrelevant languages are:

Afrikaans (afr), Tosk Albanian (als), Amharic (amh), Arabic (ara), Assamese (asm), North Azerbaijani (azj), Bashkir (bak), Bavarian (bar), Central Bikol (bcl), Belarusian (bel), Bengali (ben), Bosnian (bos), Bishnupriya (bpy), Breton (bre), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Czech (ces), Chechen (che), Chuvash (chv), Mandarin Chinese (cmn), Corsican (cos), Welsh (cym), Danish (dan), German (deu), Dimli (diq), Dhivehi (div), Standard Estonian (ekk), Modern Greek (ell), English (eng), Esperanto (epo), Basque (eus), Extremaduran (ext), Faroese (fao), Finnish (fin), French (fra), Western Frisian (fry), Irish (gle), Galician (glg), Manx (glv), Goan Konkani (gom), Guarani (grn), Swiss German (gsw), Gujarati (guj), Haitian (hat), Hebrew (heb), Fiji Hindi (hif), Hindi (hin), Croatian (hrv), Upper Sorbian (hsb), Hungarian (hun), Ido (ido), Iloko (ilo), Interlingua (ina), Indonesian (ind), Icelandic (isl), Italian (ita), Javanese (jav), Japanese (jpn), Kalaallisut (kal), Kannada (kan), Georgian (kat), Kazakh (kaz), Kirghiz (kir), Korean (kor), Karachay-Balkar (krc), Kölsch (ksh), Latin (lat), Latvian (lav), Limburgan (lim), Lithuanian (lit), Lombard (lmo), Luxembourgish (ltz), Ganda (lug), Lushai (lus), Malayalam (mal), Marathi (mar), Minangkabau (min), Macedonian (mkd), Malagasy (mlg), Maltese (mlt), Mongolian (mon), Maori (mri), Mirandese (mwl), Mazanderani (mzn), Low German (nds), Nepali (nep), Newari (new), Dutch (nld), Norwegian Nynorsk (nno), Norwegian Bokmål (nob), Pedi (nso), Occitan (oci), Oriya (ori), Ossetian (oss), Pampanga (pam), Panjabi (pan), Iranian Persian (pes), Pfaelzisch (pfl), Piemontese (pms), Western Panjabi (pnb), Polish (pol), Portuguese (por), Pushto (pus), Quechua (que), Romansh (roh), Romanian (ron), Russian (rus), Yakut (sah), Sicilian (scn), Scots (sco), Samogitian (sgs), Sinhala (sin), Slovak (slk), Slovenian (slv), Shona (sna), Somali (som), Southern Sotho (sot), Spanish (spa), Sardinian (srd), Serbian (srp), Sundanese (sun), Swahili (swa), Swedish (swe), Tamil (tam), Tatar (tat), Telugu (tel), Tajik (tgk), Tagalog (tgl), Thai (tha), Tsonga (tso), Turkmen (tuk), Turkish (tur), Uighur (uig), Ukrainian (ukr), Urdu (urd), Northern Uzbek (uzn), Venetian (vec), Vietnamese (vie), Vlaams (vls), Volapük (vol), Walloon (wln), Wu Chinese (wuu), Xhosa (xho), Mingrelian (xmf), Yiddish (yid), Zeeuws (zea), Standard Malay (zsm), Zulu (zul).