TTT: Text Tokenisation Tool
Prev		Next

Chapter 9. Finding and Structuring Bibliographic Information

Table of Contents
Converting Plain Text to XML
Character-Level Processing
Reference Lists
Processing the Input in Stages
Grammars for Reference Lists
Publication Information
In-Text Citations
Pipelines
Tutorial: Extracting a Lexicon

This chapter discusses the use of the TTT system in parsing reference lists and finding in-text citations. We describe three main grammars ($TTT/GRAM/sgml/refrules.gr and $TTT/GRAM/sgml/pubrules.gr for reference lists and $TTT/GRAM/sgml/citationrules.gr for citations in running text) which capture the basic structure of the data as a means of illustrating the potential of the system. The grammars are thus not intended to provide comprehensive coverage of bibliographic material, and places where extensions would be necessary are pointed out below where appropriate. However, we do claim that the general approach is viable for large-coverage systems, and we provide a tutorial on the staged process of extracting information from the reference list to make the subsequent text processing more accurate. We claim that this methodology is potentially very fruitful for many kinds of large-scale markup tasks.

We assume that there are two main types of bibliographic information in documents: the reference list (or `bibliography') which usually appears at the end of the text, and the in-text citations which normally point to items in the reference list. Typically, then, a reference list looks something like this:

Abelson, D., (1990). Preferential, cooperative binding of topoisomerase II
to scaffold associated regions. EMBO J. 8 3997-4006.

Cabelli, H.F., 1990.  Promoter occlusion: transcription through a
promoter may inhibit its activity.  Cell 29 939-944.

van Dijk, D., (1990). Regulation of the higher-order structure of
chromatin by histones H1 and H5. J. Cell Biol. 90 279-288.

and so on. The order and presentation of the material varies fairly widely, of course, depending to some extent on publishers' individual conventions. Nevertheless there are many common factors which make it viable to use grammars to describe the material, as we hope to illustrate. We should note here that the examples used here are drawn largely from a real reference list which was provided for the BibEdit project (Matheson and Dale 1993) by Harcourt Brace Jovanovich. Occasionally the material is edited to illustrate particular points, but in general the data is presented in the form in which it was sent to the publishers - before being copy edited.

The other form of bibliographic material, the in-text citations, come in two main forms -`syntactic' and `parenthetic'. A syntactic citation is part of the sentence which contains it, as in the examples below:

This is argued by Abelson (1990) and others, and Jones (1987) further
claims that .....

Parenthetic citations, on the other hand, are in the form of parenthetic comments - for instance:

This has often been claimed (Abelson [1990]; Jones [1987]), and the
data suggest that ....

The distinction is useful in that publishers typically insist on different forms for the two types.

The first two subsections below discuss the initial stages of processing bibliography material - the basic translations into XML and the character-level grammars used. The following subsection looks more closely at the structure of reference list items and describes the grammars used to extract the overall structure. The next two sections do the same for the publication information and for citations, and the subsequent section looks briefly at the pipelines which are necessary to run the processes. Finally, we provide a tutorial example in which author information is extracted from the reference list as a means of making the text processing more accurate and less restrictive.

Converting Plain Text to XML

This topic has been covered in some detail already, so here we need only note two main points. Firstly, the Perl program in $TTT/SCRIPTS/bibplain2xml.perl wraps texts in basic XML. This is the same as the program $TTT/SCRIPTS/plain2xml.perl discussed in Chapter 3 (The runplain pipeline) except that it adds a different DTD, $TTT/RES/biblio.dtd,xml. This DTD contains general information on the structure of citations and reference list items.

The second point is that the input files are split into paragraphs in exactly the same way as the other examples, the main thing to note being that reference lists are paragraphed in the same way as texts - so after running the initial script and the paragraph grammar in $TTT/GRAM/char/bibparas.gr, a reference list will look like this:

<?xml version='1.0'?>
<!DOCTYPE DOCS SYSTEM "RES/biblio.dtd,xml" >
<DOCS>
<TEXT>
<P>Abelson, D., (1990). Preferential, cooperative binding of topoisomerase II
to scaffold associated regions. EMBO J. 8 3997-4006.</P>

<P>Baader, C., [1990]. Chromosome assembly in vitro: topoisomerase II is
required for condensation. Cell 64 137-148.</P>
</TEXT>
</DOCS>

Angle brackets have been converted using $TTT/SCRIPTS/openangle.perl, of course. The material is now suitable for tokenisation using the character-level grammar, as described in the next section.

Prev	Home	Next
The TIMEX Grammar		Character-Level Processing