Assignment 2

in the course Mathematical Linguistics II

Stockholm University, March 2004

Simon Clematide (siclemat AT ifi.unizh.ch)

Task

Implement a simple bigram tagger using the Viterbi algorithm in Perl. Integrate your simple, but hopefully effective treatment of unknown words and unseen bigrams.

Resources

Deliverables

2 programs
- bitag.perl (Bigram Tagger)
  - $ bitag.perl para.12 para.lex raw.txt > tagged.txt
  - Format: raw.txt contains verticalized text in paragraph mode. Two sentences are separated by an empty line. tagged.txt contains raw text with the most probable tag sequence in a second column. Format of para.12 and para.lex is described below.
- bitagpara.perl (Parameter data estimation compilation) [could be done in Prolog if you want]
  - $ bitagpara.perl corpus.txt
  - Format of corpus.txt: Word TAB Tag NL ( See sample.txt )
  - Creates two file:
    - corpus.12 with unigram coming first and bigram counts having the unigram tag as first element. The first tag is omitted.(Therefore, bigram lines start with TAB.)
      - Format of unigram lines: TAG TAB UNIGRAMCOUNT
      - Format of bigram lines: TAB TAG2 TAB BIGRAMCOUNT
      - This format should make it as easy as possible to read in the data and construe the corresponding probabilities out of it.

NN    30147
    NN 3546
    IN	7468
    ...

corpus.lex with word-tag counts
- Format: WORD TAB TOTALCOUNT TAB TAG1 TAB TAG1COUNT TAB TAG2COUNT TAB ... TAB LASTTAGCOUNT NL

enforces		1	VBZ	1
engage			4	VB	3	VBP	1
engaged			7	VBD	3	VBN	4
engaging		1	VBG	1

Documentation

Pseudo-code for the Viterbi algorithm in the style of slide 29 (see corrected text version of forward algorithm on web here - usable as a starting point); see also Perl version of this code.
Short description how you handled unknown words and unseen bigrams.

Deadline

Due on 8th of April 2004.