Subword Modeling


The Third Project

In certain North American subcultures, there is a trend towards using creatively spelled or unconventional names for children like Ashleigh, Peytyn, Jaxxon, and Qwaylon. These names are sometimes called “tragedeighs.” The goal of this project is to build a dataset of phonemically transcribed tragedeighs and a model to predict the phonemic transcriptions of names of this type.

This project will have two phases

Phase I: Data collection

Each student will collect 50 unique tragedeighs and transcribe them phonemically. A guide to phonemic transcription will be provided. Because the pronunciation of some of the names will not be obvious (especially to non-native speakers, students are encouraged to consult native speakers or Internet resources in order to obtain accurate transcriptions).

The data should be submitted via Canvas as a CSV file with two columns: name and transcription before March 12th at 11:59pm. The instructors will collate and deduplicate the data, and provide it to the students. Standard train, dev, and test splits will be made and the train and dev sets will be provided to the students. The splits will be 70%/15%/15% (train/dev/test). The test set will be held out (only the names will be provided) and used for the final evaluation of the models.

Pase II: Modeling

Each student will develop either a machine learning model or a rule-based system that will generate phonemic transcriptions of tragedeighs given orthographic representations. Each student should submit a text file consisting of a list of transcriptions (one per line) corresponding to the names in the test set. Students may submit as many submissions as they like, but only the last submission will be counted. The final submission must be made by April 2nd at 11:59pm.

Constraints

In general, the only constraints are as follows:

In other words, students are given great freedom to develop their models/systems:

Metrics

Submissions will be evaluated using two metrics:

To obtain full credit, the student’s model must achieve and exact match rate at least as high as and a phoneme error rate at least as low as the instructor-provided baseline system. Partial credit will be awarded based on a linear function of the difference between the student’s and the baseline system’s scores. Submit your predictions to the autograder as a list of outputs, where each line is one prediction. Name this file test.txt.

The baseline scores to beat are as follows:

PER EM
0.562 0.326

Code to reproduce the baseline can be found here.