GitHub - TimeDelta/submodularity-for-data-selection: Some old code I wrote around 2014 based on "Submodularity for Data Selection in Statistical Machine Translation"

This is some old corpus augmentation code I wrote around 2014. The main algorithm is in Filter.cpp, which is based on the paper Submodularity for Data Selection in Statistical Machine Translation. There are also some supporting scripts for data prep, transforming the output to a WFST (arpa2fst), the main mining script (mine_google.py), etc.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bil_word_lm_bench/scripts		bil_word_lm_bench/scripts
scripts		scripts
token_mappers		token_mappers
README.md		README.md

Provide feedback