The Fourth Project: Cognate Detection

Due: April 10th, 2025

Cognates are pairs of words that are descended from the same word in an language directly ancestral to the two languages from which the pairs are drawn. These are to be distinguished from loanwords (or borrowings), which are “borrowed” into one language from a another language. Borrowing is like adoption and the relationship between cognates is like tha between blood siblings.

Im this project, you will build a model to automatically identify cognates. You will be given data from four closely-related languages:

Ukhrul (Tankghul)
Kachai
Huishu
Tusom

These are Tibeto-Burman languages spoken in the northeast corner of Manipur State, in Northeast India. They diverged around 500–1000 years ago, and their phonologies are quite different, but it is still possible to identify cognates if you know what you’re looking for.

The Datasets

Each dataset will contain a list of Ukhrul words tha have cognates in the other language (form and gloss [=translation]) and a list of all of the words in the other language that I have collected (form and gloss).

Sets will be provided as tab-delimited files. Each set will have at least two files (one for Ukrhul and one for the other language) with two columns (form and gloss). The gold labels will be provided as a single-column file with the same number of rows as the Ukrhul file in which there is the form cognate to the Ukhrul form.

You will be provided with the following sets:

Ukhrul–Huishu (development data with gold cognacy judgements)
Ukhrul–Tusom
Ukrhul–Kachai

The data is in student_dataset.zip

The Task

The task will be to generate a tab-separated file with ten columns correspoding to the top five most likely cognates (forms and glosses) in the other langugae to the corresponding Ukhrul word (the nth row consists of the top-five best cognate candidates for the nth Ukhrul word). The best candidate should be in the first-second column, the second best candidate should be in the third-fourth column, and so on.

The format should look like this:

[FORM1]\t[GLOSS1]\t[FORM2]\[GLOSS2]...[FORM5]\t[GLOSS5]\n

Baseline

The baseline system takes the phonological similarity and semantic similarity of the candidate cognates into account. It uses a simple method of producing phonological embeddings that is based upon the tf-idf of IPA character 1-, 2-, and 3-grams. The algorithm simply ranks words according to the cosine similarity of their embeddings. The semantic similarity metric is “exact match on the gloss.” If the glosses of the Ukhrul word and the candidate match completely, 1 is added to the score. Otherwise 0 is added to the score.

The baseline code is here: https://colab.research.google.com/drive/1Duh7ZU7oSDNEvXckpG9sWUUjxr892XFX?usp=sharing

The baseline scores are as follows:

Language Pair	Mean Reciprocal Rank
Ukhrul-Huishu	0.61
Ukhrul-Kachai	0.66
Ukhrul-Tusom	0.42

Evaluation

Results will be evaluated with Mean Reciprocal Rank. The reciprocal rank of a query response (candidate cognates) is the multiplicative inverse of the first correct answer. If the first answer is correct, is 1; if the second answer is correct, it is 1/2, if the third is correct, it is 1/2. If none of the answers is correct, it is 0. The formula is as follows:

The function used for evaluation is as follows:

def mean_reciprocal_rank(gold, inputs, preds):
  total = []
  for idx, inp in enumerate(inputs):
    if gold["\t".join(inp)] in preds[idx][:5]:
      total.append(1.0 / (preds[idx].index(gold["\t".join(inp)]) + 1))

  return sum(total) / len(total)

You will upload your outputs to Gradescope with the names:

ukhrul-tusom_out.tsv
ukhrul-kachai_out.tsv