This is some old corpus augmentation code I wrote around 2014. The main algorithm is in Filter.cpp, which is based on the paper Submodularity for Data Selection in Statistical Machine Translation. There are also some supporting scripts for data prep, transforming the output to a WFST (arpa2fst), the main mining script (mine_google.py), etc.
TimeDelta/submodularity-for-data-selection
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|