Tutorial: Extracting a Lexicon

This section provides a short tutorial on extracting a lexicon from the reference list and using the result to help in processing an associated text. The pipeline in runbibtutorial contains an example which implements the points discussed here.

The reason this is a potentially useful approach is simply that identifying proper names can be a tricky task, and so far the grammars have done this based solely on the character-level form of the tokens - capitalisation being the main restriction. Furthermore, it was noted in the discussion of the citation rules in $TTT/GRAM/sgml/citationrules.gr that syntactic citations in which the dates are not bracketed are not covered by the grammar - so we allow "Jones (1990)" but not "Jones, 1990". Given that the latter is an acceptable form of citation, the grammar is clearly too restrictive.

It is a simple matter to relax the restriction, of course - the simplest method is just to edit the "bracketed_date" rule to make the brackets optional. So we just change this:

<RULE	name="bracketed_date" targ="&A-VAL; &B-REW; &C-VAL;">
  <REL	            match="&LPAR;" var="A"> </REL>
  <REL	type="REF"  match="simple_date_extent" m_mod="PLUS" var="B"> </REL>
  <REL	            match="&RPAR;" var="C"> </REL>
</RULE>

to this:

<RULE	name="bracketed_date" targ="&A-VAL; &B-REW; &C-VAL;">
  <REL	            match="&LPAR;" m_mod="QUEST" var="A"> </REL>
  <REL	type="REF"  match="simple_date_extent" m_mod="PLUS" var="B"> </REL>
  <REL	            match="&RPAR;" m_mod="QUEST" var="C"> </REL>
</RULE>

The $TTT/GRAM/sgml/citationrules.gr rule file contains both versions of these rules, with the latter commented out. Now the problem is that we are likely to get lots of false hits. As noted, the basic description of a proper name is simply that it is a capitalised word - so we will get matches in all the examples below:

   In 1990, it rained a lot.
   There was no sun in August 1998.
   Apparently 1940 was a good summer.

The program will identify "In 1990", "August 1998", and "Apparently 1940" as citations. One obvious way to approach this problem is to include general rules about the possible parts of speech of a word, and another is to use an existing proper name list (or both, of course). Another option that we have in the present case is to process the reference list associated with a document, extract the author names, and use the result to identify the citations precisely.

The first stage in the suggested process is handled by simply running the reference list rules described earlier. In this case, however, we assume that we have a single file (see $TTT/EGS/sgml/bibtutorial) which contains both the text and the reference list. The text is contained in TEXT tags and the reference list inside REFERENCES. This means that the initial stage in the processes we have already seen - using the script in $TTT/SCRIPTS/bibplain2xml to construct a basic XML file - is not required.

The first part of the pipeline marks up the references as before, although this time we explicitly look for the refs inside the REFERENCES tag:

cat $argv[1] \
| bin/fsgmatch -q ".*/REFERENCES" GRAM/char/bibparas.gr \
| SCRIPTS/openangle.perl \
| bin/fsgmatch -q ".*/P" GRAM/char/bibwords.gr \
| SCRIPTS/openangle.perl \
| bin/fsgmatch -q ".*/P" GRAM/sgml/pubrules.gr \
| bin/fsgmatch -max_pos 100 -q ".*/P" GRAM/sgml/refrules.gr \
| bin/sgdelmarkup -q ".*/W" > $argv[1].rmu

This is very similar to runbiblio, except that the input is held in a variable and the output is simplified (using sgdelmarkup) before being piped into a file which has the same name as the input plus the extension "rmu". The references section of this file will be in the following format:

<REF>
   <AUTHOR>
     <NAME><SURNAME>Abelson</SURNAME>, <INVERTED>D.</INVERTED></NAME>
   ,</AUTHOR> 
   (<DATE>1990</DATE>). 
    <TITLE>Preferential, cooperative binding of topoisomerase II
          to scaffold associated regions.
    </TITLE> 
    <JOURNAL><JNAME>EMBO J.</JNAME> 
             <VOLUME>8</VOLUME>
             <RANGE>3997-4006</RANGE>
    </JOURNAL>.
</REF>

<REF>
   <AUTHOR>
     <NAME><SURNAME>Baader</SURNAME>, <INVERTED>C.</INVERTED></NAME>,
     <NAME><SURNAME>Cabelli</SURNAME>, <INVERTED>H.F.</INVERTED></NAME>
   </AUTHOR> 
  [<DATE>1990</DATE>]. 
   <TITLE>Chromosome assembly in vitro: topoisomerase II is
         required for condensation.</TITLE>
   <JOURNAL><JNAME>Cell</JNAME> 
            <VOLUME>64</VOLUME> 
            <RANGE>137-148</RANGE>
   </JOURNAL>.
</REF>

We want to construct a lexicon using the names in these structures, of course, so from the two reference list items above we want to form the following lexicon:

Abelson   SURNAME
Baader    SURNAME
Cabelli   SURNAME

The simplest way to do this in the present context is to use the sgmlperl program to identify the elements in question and print them out in the required format. For the latter examples, the following sgmlperl rule is all that is required:

<rule query=".*/SURNAME/#">
   print "$_  SURNAME\n"
</rule>

The query is in the standard XML form - we're looking for the string which is the value of the path .*/SURNAME, and the Perl part of the rule then simply prints the string followed by "SURNAME" and a linefeed. Note, by the way, that rather than use sgdelmarkup to remove the word-level markup, we could have got to the necessary strings by making the query .*/SURNAME/W/#

The latter rule is not quite enough to handle names in the current context, however. Some names have internal punctuation, specifically quotes and hyphens, as in "O'Brien" and "Stainton-Ellis". The grammar in $TTT/GRAM/sgml/refrules.gr assumes that the quote and the hyphen have been identified as separate words, and there are specific rules in the grammar which deal with names in this form. To make lexical entries using names of this type, we must list the parts of the name separately and ensure that the entry is phrasal, so we will actually need the following representations:

O ' Brien          :: SURNAME
Stainton - Ellis   :: SURNAME

It is fairly straightforward to get sgmlperl to produce the required forms. The following rule will suffice:

<rule query=".*/SURNAME/#">
   s%'% ' %;
   s%-% - %;
   if    ($_=~/\-|\'/) {print "$_  :: SURNAME\n"}
   else  {print "$_  SURNAME\n"}
</rule>

The query is the same as before, but now the Perl script adds spaces around quotes and hyphens and then prints out phrasal entries if the string contains either. The default is a `normal' lexical entry as before. This rule is provided in the file $TTT/SCRIPTS/sgmlperlrule. To run it under Unix, we simply need the command sgmlperl sgmlperlrule INFILE, where INFILE would contain the marked-up reference list. In the current case, we shall run the program on the intermediate file which contains the marked-up reference list, of course, and save the lexicon in $TTT/LEX/names.lex. The pipeline in runbibtutorial thus continues with:

cat $argv[1].rmu \
| bin/sgmlperl SCRIPTS/sgmlperlrule \
| sort -u -o LEX/names.lex

Here we have piped the intermediate file through sgmlperl, using the rule described above. The result is then sorted and `uniqed' (to remove duplicates), and output to the lexicon directory.

All that remains is to ensure that the citation rules use the information in the new lexicon. The citation rule file $TTT/GRAM/sgml/citationrules.gr contains everything necessary with the lexicon definition and the two relevant rules commented out. The first alteration, therefore, is to remove the comments around the pointer to the new lexicon, to leave:

<LEX    type="PHRASE"
	file_name="LEX/names.lex"
	alias="name_lex">
</LEX>

The "simple_surname" rule should now just perform a lexical look-up, so all we need is the version of this which is commented out in $TTT/GRAM/sgml/citationrules.gr:

<RULE   name="simple_surname" type="PHRASE" targ="&S-VAL;">
  <REL  match="W">
    <CONSTR check_in="name_lex" check_tags="SURNAME" > </CONSTR>
  </REL>
</RULE>

Note that in this case we want the string to match exactly, so the constraint on the lexical look-up doesn't allow different case matches (all the previous constraints that we have seen in the context of bibliographic material included a check_mod="LOWERCASE" specification). Finally, we just need to use the version of "bracketed_date" in which the brackets are optional:

<RULE	name="bracketed_date" targ="&A-VAL; &B-REW; &C-VAL;">
  <REL	            match="&LPAR;" m_mod="QUEST" var="A"> </REL>
  <REL	type="REF"  match="simple_date_extent" m_mod="PLUS" var="B"> </REL>
  <REL	            match="&RPAR;" m_mod="QUEST" var="C"> </REL>
</RULE>

Now the citation rules will only find citations in which the name has been previously listed, but it will match things like "Arkwright 1998" which don't have brackets around the date. We have included a citation grammar in $TTT/GRAM/sgml/bibtutorialrules.gr which loads the lexicon and uses the alternative version of the rules, and so the final part of the pipeline in runbibtutorial is:

cat $argv[1].rmu \
| bin/fsgmatch -q ".*/TEXT" GRAM/char/bibparas.gr \
| SCRIPTS/openangle.perl \
| bin/fsgmatch -q ".*/P" GRAM/char/bibwords.gr \
| SCRIPTS/openangle.perl \
| bin/fsgmatch -q ".*/P" GRAM/sgml/bibtutorialrules.gr \
| sgmltrans -r OUTPUT/SCRIPTS/bibtrans \
| OUTPUT/SCRIPTS/bibtrans2html.perl \

The file with the marked-up references is processed again, almost exactly as in runcitations, but using the new grammar. To run the tutorial example described here, we need a slightly different command:

runbibtutorial EGS/sgml/bibtutorial > OUTPUT/HTML/tutorial.html

This will process the references and text, as described, and pipe the complete marked-up file to the appropriate directory. Looking at the actual output, it is clear that some citations not captured by the previous grammar are now marked up correctly.