submodularity-for-data-selection/scripts/Filter.cpp at master · TimeDelta/submodularity-for-data-selection · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
// This code is based on the paper "Submodularity for Data Selection in Statistical Machine Translation"
// by Katrin Kirchoff and Jeff Bilmes
/*
Greedy Algorithm for submodular data selection:
[Pseudo-code]
> obtain all n-gram counts from each set separately
> define feature set (U) as: ({features in dev data} UNION {features in test data}) INTERSECT {features in unfiltered data}
> calculate and store the cost of each sentence for future reference
> calculate and store the weight of each feature for future reference
> while the total cost < budget
	> from the available sentences, choose the one that has the best ratio of submodular function / cost function
	> add the chosen sentence to the output
	> remove chosen sentence from available sentences
> use previous iteration (before cost >= budget was satisfied)


submodular function (basic):
[Pseudo-code]
> for each feature (n-gram)
	> let X = ({already selected output} UNION {sentence to potentially add})
	// calculate the total relevance score for current feature in X
	> for each sentence in X
		> total relevance += relevance score of current sentence
	> function value += feature weight * concave function (total relevance score for current feature in X)


time complexity optimization for submodular function evaluation:
[Explanation]
The pseudo-code from the basic submodular function is wasteful (it redoes many of its calculations)
By definition, M_u(X) = SUM_over_x (m_u (x)), which represents the total relevance score for feature u over the set of sentences X
M_u is modular and the relevance score (m_u) for a feature in a sentence is 0 if the feature does not appear in that sentence (b/c tf(u,x)=0), so
the total relevance score is M_u(X_i) = M_u(X_i-1) for any feature not appearing in the new sentence
if a feature does appear in the new sentence, then   M_u(X_i) = M_u(X_i-1) + m_u(x_i)   where:
	- u is a feature (n-gram)
	- i is the iteration number
	- x_i is the sentence to potentially add at iteration i
	- X_i is the set of selected sentences after iteration i
	- relevance score is m_u(x) = tf (u,x) * idf^train (u)
	- w_u is the weight of feature u
---------------------------------
[Pseudo-code]
before entering main loop (while total cost < budget), calculate m_u(x) for each u in each x as such:
> for each sentence x
	> for each feature in x
		> calculate and store m_u(x) for future reference
store the previously calculated total relevance scores for each feature across X_i-1
store previously calculated value of submodular function
> for each feature in sentence to potentially add
	> update total relevance score for current feature [M_u(X_i-1)] as (previous total relevance score for current feature [M_u(X_i-1)] + relevance score for current feature in sentence to potentially add [m_u(x_i)])
	> subtract w_u * phi_u( previous total relevance score for current feature ) from value of submodular function for chosen sentence of previous iteration
	> add w_u * phi_u ( updated total relevance score for current sentence )

This reduces the time complexity of the submodular function from O(n|U|) to O(u), where:
n is the number of sentences
U is the feature set as defined above
u is the number of features in the sentence that is to be potentially added
 */
#include <unordered_map>
#include <unordered_set>
#include <queue>
#include <algorithm>
#include <math.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <boost/program_options.hpp>
#include <boost/functional/hash.hpp>
#include <boost/tokenizer.hpp>

typedef unsigned int uint;
typedef unsigned short wordid;
typedef std::vector<wordid> ngram;
typedef std::vector<wordid> wordvec;
typedef std::vector<ngram> ngramvec;
typedef std::unordered_set<ngram,boost::hash<ngram> > ngramset;
typedef std::unordered_map<std::string,wordid> wordmap;
typedef unsigned char uchar;
typedef unsigned int sentindex;
typedef unsigned int Cost;


struct Feature {
	uint seedCount = 0;
	uint minedCount = 0;
	double idf = 0; // inverse document frequency
	double weight = 0;
	double prevTotalRelevance = 0;

	// force constructors to be generated by the compiler
	Feature() = default;
	Feature(const Feature& f) = default; // copy constructor
};

struct Sentence {
	// storing the actual string for the sentence uses way too much memory, so just store
	// the index of the sentence within the corpus and loop over the corpus at the end,
	// printing out the line corresponding to each chosen sentence's index
	sentindex index;
	// this was originally a map from Feature* to uchar until I realized that changing it to
	// two vectors reduces memory complexity by a factor of about 3 b/c in a map, each element
	// takes up 32 bytes more than the total size of the key, object
	std::vector<Feature*> features;
	// should not have any feature appear more than max value for uchar in a single sentence
	std::vector<uchar> counts;
	uchar length; // sentence length (# words) should never be more than max value for uchar
	double maxMarginalGain; // upper bound on marginal gain by choosing this sentence

	Sentence(){
		features = std::vector<Feature*>();
		counts = std::vector<uchar>();
	}

	void incCount(Feature* const feature){
		int i = 0;
		bool found = false;
		for (auto f : features){
			if (f == feature){
				found = true;
				break;
			}
			i++;
		}
		if (!found){
			features.push_back(feature);
			counts.push_back(1);
		} else
			counts[i]++;
	}

	// define a less than operator so the priority queue knows how to order the sentences
	const bool operator < (const Sentence &s) const {
		return (maxMarginalGain < s.maxMarginalGain);
	}
};

// note: must define Sentence & Feature structs before these
// use unordered version for amortized constant lookup time
typedef std::unordered_map<ngram,Feature,boost::hash<ngram> > umap;
// potential updates - map from address of object to new value for object
typedef std::unordered_map<Feature*,Feature> ufmap; // potential updates - map from address of object to new value for object
typedef std::vector<Sentence> sentvec;
typedef std::priority_queue<Sentence> sentqueue;

// word id reserved to signify a missing word - word ids start at MISSING_WORD_ID + 1
const wordid MISSING_WORD_ID=0;
const uint NGRAM_ORDER=3;
// count types (allow re-use of feature counting code)
const uchar SEED=0;
const uchar MINED=1;
const uchar MAX_SENT_LENGTH=std::numeric_limits<uchar>::max();


// Definition:
// SUM_over_u w_u * phi_u ( SUM_over_x m_u(x) ) , where:
//   u is a feature in the feature set
//   x is a sentence from the set of selected sentences
//   w_u is the weight for feature u
//   phi_u is a non-negative non-decreasing concave function for feature u (sqrt in our case)
//   m_u(x) is the relevance score for feature u in sentence x (TFIDF in our case)
// NOTE: Actual implementation is equivalent to definition but reuses calculations from
//       previous evaluations of the submodular function to greatly improve time complexity.
//       Time complexity optimization also relies on the fact that freq(u,x) = 0 if feature u
//       does not appear in sentence x.
std::pair<double,ufmap> submodularFunction(const umap &features, Sentence &sentence, double prevFuncValue){
 	double funcValue = prevFuncValue;
 	ufmap updatedFeatures = ufmap();
 	auto countIter = sentence.counts.begin();
	for (auto featureIter = sentence.features.begin(); featureIter != sentence.features.end(); ++featureIter){
		Feature feature(*(*featureIter));
		uchar count = *countIter;

		// calculate the new relevance score for current feature
		double frequency = (double)count / (double)sentence.length;
		double relevanceScore = frequency * feature.idf + feature.prevTotalRelevance;

		// update the function value
		funcValue -= feature.weight * sqrt(feature.prevTotalRelevance);
		funcValue += feature.weight * sqrt(relevanceScore);

		// store the updated relevance score for the current feature in case this sentence is chosen
		feature.prevTotalRelevance = relevanceScore;
		updatedFeatures[*featureIter] = std::move(feature);

		++countIter;
	}

	// this line is part of a second time complexity optimization (LAZYGREED)
	sentence.maxMarginalGain = funcValue - prevFuncValue;

	return std::pair<double,ufmap>(funcValue, std::move(updatedFeatures));
}


// SUMMARY: the cost of adding the specified sentence
// some possible cost definitions are:
//   1 per sentence              - budget is total # sentences in output
//   # words in sentence         - budget is total number of words (non-unique) in output
//   # ngrams in sentence        - budget is total number of ngrams (non-unique) in output
//   # unique words in sentence  - budget is # unique words in output but this should
//                                 allow for tighter control over the size of the output
//                                 than just taking the # of words
//   # unique ngrams in sentence - budget is # unique ngrams in output but this should
//                                 allow for tighter control over the size of the output
//                                 than just taking the # of ngrams
//
// In the case of the last 2 definitions, this function could also be modified to take an
// extra input that would represent the set of unique words / ngrams chosen so far in order
// to make the budget represent the total # of unique words / ngrams in the output, which
// might sound good but in practice would have major negative consequences. This includes
// breaking the LAZYGREED optimization that uses a priority queue to avoid having to
// evaluate every single sentence each iteration of the budget loop b/c the cost would change
// between iterations. It would also cause issues with division by zero for cases where a
// sentence only contains words / ngrams that have already been chosen and make the algorithm
// favor sentences that don't contain many new words / ngrams, which would most likely result
// in a larger filtered corpus with more redundant information, which shouldn't negatively
// impact the size of the final LM b/c the size of the LM is based on the total number of unique
// ngrams in the corpus but it will increase the time required to create the LM from the corpus
//
// Currently, experimentation is needed on the different possible cost definitions to
// determine / verify their effect on the quality and size of the output
Cost cost(const Sentence &sentence){
	return (Cost)sentence.length;
}


// evaluate the efficacy of adding the specified sentence
std::pair<double,ufmap> evaluate(const umap &features, Sentence &sentence, double prevFuncValue){
	auto result = submodularFunction(features, sentence, prevFuncValue);
	// the cost is supposed to be accounted for in the value used for ordering within the priority queue
	// note that this does NOT break the foundation for the LAZYGREED optimization b/c both sides of the
	// inequality are always divided by the same number, so the relationship still holds
	sentence.maxMarginalGain /= cost(sentence);
	return result;
}


// update the total relevance scores for all of the features in the chosen sentence
void updateFeatures(umap &features, ufmap &updates){
	for (auto update = updates.begin(); update != updates.end(); ++update)
		*(update->first) = std::move(update->second);
}


// SUMMARY: split a string on spaces and store each word id as an element in a vector
// if addMissingWords is false, any spot corresponding to a word without an id will
// have a placeholder value of MISSING_WORD_ID in order to make it possible to ignore
// ngrams containing words that are not in the seed corpus as an optimization
wordvec tokenize(std::string &line, wordmap &wordids, bool addMissingWords){
	wordvec words = wordvec();

	boost::char_separator<char> sep(" \t");
    boost::tokenizer<boost::char_separator<char> > tok(std::move(line), std::move(sep));
    for (auto word = tok.begin(); word != tok.end(); ++word){
		wordid wid;
		// if the current word doesn't have an id, define one for it
		auto result = wordids.find(*word);
		if (result == wordids.end()){
			if (addMissingWords){
				wordid id = wordids.size() + MISSING_WORD_ID + 1; // word ids start at MISSING_WORD_ID + 1
				wordids[*word] = id;
				wid = id;
			} else
				wid = MISSING_WORD_ID;
		} else
			wid = result->second;
		words.push_back(wid);
	}

	return words;
}


// get all ngrams up to order NGRAM_ORDER, skipping ngrams containing a word
// that is not in the feature set
// note: ngrams may appear multiple times in returned vector
ngramvec getNgrams(const wordvec &tokens){
	ngramvec ngrams = ngramvec();
	for (uint order = 0; order < NGRAM_ORDER && order < tokens.size(); ++order){
		for (uint start = 0; start < tokens.size() - order; ++start){
			// get the current ngram
			bool cancelled = false;
			ngram gram = ngram();
			for (uint word = start; word <= start + order; ++word){
				wordid id = tokens[word];
				if (id == MISSING_WORD_ID){
					// time optimization - skip over ngrams that we know will contain a missing word
					start += word - start;
					cancelled = true;
					break;
				}
				gram.push_back(std::move(id));
			}
			if (!cancelled) // only add the ngram if it doesn't contain any missing words
				ngrams.push_back(std::move(gram));
		}
	}

	return ngrams;
}


// get the set of ngram features in a corpus
ngramset getFeatures(const std::string &corpusFileName, wordmap &wordids, bool addMissingWords){
	ngramset features;
	std::string line;
	std::ifstream corpus(corpusFileName);
	while (!corpus.eof()){
		getline(corpus, line);
		ngramvec ngrams = getNgrams(tokenize(line, wordids, addMissingWords));
		for (auto gram : ngrams)
			features.insert(gram);
	}
	corpus.close();

	return features;
}


// feature set definition: {U_seed INTERSECT U_mined}
umap getFeatureSet(const std::string &seedFileName,
                   const std::string &minedFileName,
                   wordmap &wordids,
                   bool quiet){
	if (!quiet) std::cerr << "Getting seed features\n";
	ngramset seedFeatures = getFeatures(seedFileName, wordids, true);
	if (!quiet) std::cerr << "Getting mined features\n";
	ngramset minedFeatures = getFeatures(minedFileName, wordids, false);

	// intersection - add all of the features to the feature map
	// note: if a feature is not in the seed corpus, then it cannot
	//       be in the intersection, so just loop over the seed features
	//       and search for them in the mined features b/c there are fewer
	//       seed features than mined features
	if (!quiet) std::cerr << "Taking feature intersection\n";
	umap features = umap();
	auto b = seedFeatures.begin();
	auto e = seedFeatures.end();
	while (b != e) {
		if (minedFeatures.find(*b) != minedFeatures.end())
			features[*b] = Feature();
		++b;
	}

	return features;
}


// count the occurences of each feature from the feature set in a given corpus
// while simultaneously storing information about each sentence in the mined corpus
void countFeatures(umap &features,
                   const std::string &corpusFileName,
                   sentvec &sentences,
                   uchar countType,
                   wordmap &wordids){
	std::string line;
	std::ifstream corpusFile(corpusFileName);
	sentindex lineIndex = 0;
	while (!corpusFile.eof()){
		getline(corpusFile, line);
		// store word tokens to avoid tokenizing the same line twice
		wordvec tokens = tokenize(line, wordids, false);
		// to prevent integer overflow on sentence length that can cause problems
		// with frequency calculations in the submodular function like division by 0
		// this is okay b/c a valid sentence should never be more than 255 words long
		if (countType == MINED && tokens.size() > MAX_SENT_LENGTH){
			lineIndex++;
			continue;
		}

		// only used for the mined corpus
		Sentence sentence = Sentence();
		// features added to the sentence so far - used for idf calculations
		ngramset featuresAdded = ngramset();

		ngramvec ngrams = getNgrams(tokens);
		for (auto gram : ngrams){
			// only count ngrams that are in the feature set
			auto result = features.find(gram);
			if (result != features.end())
				switch (countType){
					case SEED: features[gram].seedCount++; break;
					case MINED:
						// for idf calculations - count the number of sentences in which each feature appears
						auto result = featuresAdded.find(gram);
						if (result == featuresAdded.end()){
							featuresAdded.insert(gram);
							features[gram].idf++;
						}

						features[gram].minedCount++;
						Feature* feature = &(features[gram]);
						sentence.incCount(feature);
						break;
				}
		}

		// iff = if and only if
		// add to available sentences iff the sentence contains a feature from the feature set
		// b/c otherwise it will never be chosen
		if (countType == MINED && sentence.counts.size() > 0){
			sentence.index = lineIndex;
			sentence.length = tokens.size();
			sentences.push_back(sentence);
		}

		lineIndex++;
	}
	corpusFile.close();
}


umap parseCorpora(const std::string &seedFileName,
                  const std::string &minedFileName,
                  sentvec &sentences,
                  bool quiet){
	// don't care what the word ids map to after counting is done,
	// so let it go out of scope and be destroyed
	wordmap wordids;
	umap features = getFeatureSet(seedFileName, minedFileName, wordids, quiet);

	if (!quiet) std::cerr << "Counting features in " << seedFileName << "\n";
	countFeatures(features, seedFileName, sentences, SEED, wordids);
	if (!quiet) std::cerr << "Counting features in " << minedFileName << "\n";
	countFeatures(features, minedFileName, sentences, MINED, wordids);

	return features;
}


void calculateFeatureStats(umap &features, uint sentenceCount){
	for (auto feature = features.begin(); feature != features.end(); ++feature){
		feature->second.weight = sqrt((double)feature->second.seedCount / (double)feature->second.minedCount);
		feature->second.idf = log(sentenceCount/feature->second.idf);
	}
}


int main(int argc, char const *argv[]) {
	namespace po = boost::program_options;

	std::string seedFileName;
	std::string minedFileName;
	uint budget;
	bool quiet = false;

	po::options_description desc("Sentence filter based on the paper\n\"Submodularity for Data Selection in Statistical Machine Translation\"\nby Katrin Kirchoff and Jeff Bilmes\nOptions");
	desc.add_options()
		("help,h", "Prints help message")
		("seed,s", po::value<std::string>(&seedFileName)->required(), "The seed corpus (must be cleaned first)")
		("mined,m", po::value<std::string>(&minedFileName)->required(), "The unfiltered corpus (must be cleaned first)")
		("budget,b", po::value<uint>(&budget)->required(), "Once the cost of the chosen sentences exceeds the budget, the algorithm stops.")
		("quiet,q", po::bool_switch(&quiet), "Don't print status messages")
	;
	po::positional_options_description pos;
	pos.add("seed", 1);
	pos.add("mined", 1);
	pos.add("budget", 1);

	po::variables_map vm;

	try {
		po::store(po::command_line_parser(argc, argv).options(desc).positional(pos).run(), vm);
		if (vm.count("help")) {
			std::cout << desc << std::endl;
			return 0;
		}
		po::notify(vm); // throws on error, so do after help in case there are any problems
	} catch (boost::program_options::error& e) {
		std::cerr << "ERROR: " << e.what() << std::endl << std::endl;
		std::cerr << desc << std::endl;
		return 1;
	}

	// initial calculations
	sentvec sentences = sentvec();
	if (!quiet) std::cerr << "Parsing corpora\n";
	umap features = parseCorpora(seedFileName, minedFileName, sentences, quiet);
	if (!quiet) std::cerr << "Calculating feature weights and IDFs\n";
	calculateFeatureStats(features, sentences.size());

	// sentence selection
	// first, initialize the marginal gain for each sentence
	if (!quiet) std::cerr << "Calculating initial marginal gains\n";
	for (sentindex i = 0; i < sentences.size(); ++i)
		evaluate(features, sentences[i], 0);
	sentqueue availableSentences = sentqueue(std::less<Sentence>(), std::move(sentences));

	// NOTE: the use of the priority queue is a time complexity optimization
	// it takes advantage of the fact that the marginal gain from the submodular
	// function is non-increasing, so we only have to find a sentence whose
	// new marginal gain is >= upper bound of previously calculated marginal gains.
	// the upper bound is obtained from the top element in the priority queue since
	// the greatest element is always at the top and the marginal gain can never
	// increase, which means the max previously calculated marginal gain is always >=
	// the max marginal gain given updated values for each sentence
	if (!quiet) std::cerr << "Choosing sentences\n";
	std::vector<sentindex> chosen;
	double prevSubmodValue = 0;
	sentindex i = 0;
	Cost totalCost = 0;
	while (totalCost < budget){
		// handle edge cases to avoid errors being thrown
		if (availableSentences.size() == 1){
			chosen.push_back(availableSentences.top().index);
			availableSentences.pop();
		}
		if (availableSentences.size() == 0){
			if (!quiet) std::cerr << "\nNo more sentences left\n";
			break;
		}

		// note: due to priority_queue's stl implementation, there's always going to be an unnecessary
		// object copy here b/c pop() returns void and top() returns a const reference so std::move()
		// cannot be used
		Sentence sentence = availableSentences.top();
		availableSentences.pop();
		std::pair<double,ufmap> result = evaluate(features, sentence, prevSubmodValue);;
		while (sentence.maxMarginalGain < availableSentences.top().maxMarginalGain){
			availableSentences.push(std::move(sentence));
			// same goes for unecessary object copy on the line below
			sentence = availableSentences.top();
			availableSentences.pop();
			result = evaluate(features, sentence, prevSubmodValue);
		}

		prevSubmodValue = result.first;
		totalCost += cost(sentence);

		updateFeatures(features, result.second);
		chosen.push_back(sentence.index);

		i++;

		if (!quiet) // display progress if requested
			std::cerr << "\r" << (int)(100*(double)(totalCost)/(double)budget) << "% done - " << i << " sentences chosen";
	}
	if (!quiet) std::cerr << "\n"; // b/c of progress display

	// sort chosen sentence indices so that we only have to read over the corpus once
	std::sort(chosen.begin(),chosen.end());

	// output sentences
	std::ifstream mined(minedFileName);
	std::string line;
	getline(mined,line); // start at index 0
	sentindex current = 0;
	for (auto next : chosen){
		while (current < next){
			getline(mined,line);
			current++;
		}
		std::cout << line << "\n";
	}
	mined.close();

	return 0;
}