Assignment 2
in the course Mathematical Linguistics II
Stockholm University, March 2004
Simon Clematide (siclemat AT ifi.unizh.ch)
Task
Implement a simple bigram tagger using the Viterbi algorithm in Perl. Integrate your simple, but hopefully effective treatment of unknown words and unseen bigrams.
Resources
Deliverables
- 2 programs
- bitag.perl (Bigram Tagger)
- $ bitag.perl para.12 para.lex raw.txt > tagged.txt
- Format: raw.txt contains verticalized text in paragraph mode. Two sentences are separated by an empty line. tagged.txt contains raw text with the most probable tag sequence in a second column. Format of para.12 and para.lex is described below.
- bitagpara.perl (Parameter data estimation compilation) [could be done in Prolog if you want]
- $ bitagpara.perl corpus.txt
- Format of corpus.txt: Word TAB Tag NL ( See sample.txt )
- Creates two file:
- corpus.12 with unigram coming first and bigram counts having the unigram tag as first element. The first tag is omitted.(Therefore, bigram lines start with TAB.)
- Format of unigram lines: TAG TAB UNIGRAMCOUNT
- Format of bigram lines: TAB TAG2 TAB BIGRAMCOUNT
- This format should make it as easy as possible to read in the data and construe the corresponding probabilities out of it.
NN 30147
NN 3546
IN 7468
...
- corpus.lex with word-tag counts
- Format: WORD TAB TOTALCOUNT TAB TAG1 TAB TAG1COUNT TAB TAG2COUNT TAB ... TAB LASTTAGCOUNT NL
enforces 1 VBZ 1
engage 4 VB 3 VBP 1
engaged 7 VBD 3 VBN 4
engaging 1 VBG 1
Documentation
Deadline
- Due on 8th of April 2004.