Subword Modeling


The Third Project

In certain North American subcultures, there is a trend towards using creatively spelled or unconventional names for children like Ashleigh, Peytyn, Jaxxon, and Qwaylon. These names are sometimes called “tragedeighs.” The goal of this project is to build a dataset of phonemically transcribed tragedeighs and a model to predict the phonemic transcriptions of names of this type.

Ths project will have two phases

Phase I: Data collection

Each student will collect 50 unique tragedeighs and transcribe them phonemically. A guide to phonemic transcription will be provided. Because the pronunciation of some of the names will not be obvious (especially to non-native speakers, students are encouraged to consult native speakers or Internet resources in order to obtain accurate transcriptions).

The data should be submitted via gradescope as a TSV file with two columns: name and transcription before March XYZ at 11:59pm. The instructors will collate and deduplicate the data, the provide it to the students. Standard train, dev, and test splits will be made and the train and dev sets will be provided to the students. The splits will be 70%/17%/17% (train/dev/test). The test set will be held out (only the names will be provided) and used for the final evaluation of the models.

Pase II: Modeling

Each student will develop either a machine learning model or a rule-based system that will generate phonemic transcriptions of tragedeighs given orthographic representations. Each student should submit a text file consiting of a list of transcriptions (one per line) corresponding the the names in the test set. Students may submit as many submissions as the like, the only the last submission will be counted. The final submission must be made by March XYZ at 11:59pm.

Constraints

In general, the only constraints are as follows:

In other words, students are given great freedom to develop their models/systems:

Metrics

Submissions will be evaluated using two metrics:

To obtain full credit, the student’s model must achieve and exact match rate at least as high as and a phoneme error rate at least as low as the instructor-provided baselie system. Partial credit will be awarded based on a linear function ofthe difference between the student’s and the baseline system’s scores.