Atwell chapter in Thomas+Short96

Machine Learning from Corpus Resources for Speech And Handwriting Recognition

by Eric Steven Atwell
Centre for Computer Analysis of Language And Speech

School of Computer Studies

The University of Leeds, LEEDS LS2 9JT, Yorkshire, England

tel:0113-2335761 fax:0113-2335468 email:eric@scs.leeds.ac.uk

This tutorial has been published as Chapter 9 of Jenny Thomas and Mick Short (eds), "Using Corpora for Language Research: Studies in the Honour of Geoffrey Leech", pp151-166, Longman, Harlow. 1996.

9.1. Introduction: different perspectives of language learning.

Geoffrey Leech's ideas have been inspirational both to Corpus-based computational linguists in general, and to me personally: first as a student and Researcher Associate at Lancaster University, then as a Lecturer in Artificial Intelligence at Leeds University. This chapter focuses on research at Lancaster and Leeds building on Geoffrey Leech's ideas, looking in particular at how corpus resources can be applied as training data for "intelligent" machine learning software systems, enabling them to "learn" models of English language to use in Speech And Language Technology applications such as speech and handwriting recognition systems.

The pioneers of modern Corpus Linguistics saw their principal application areas as linguistic theory and English language teaching. For example, (Leech et al 1983a), an early paper on the LOB Corpus tagging project, cited "the need for accurate textual evidence for statements about the language" ; and (Leech 1986) discussed applications in computer-assisted language learning (CALL). Even today, the bulk of regular attenders at annual ICAME English Corpus Linguistics conferences (*footnote 1*) are from English linguistics and language teaching departments. Implicit in terms like ELT and even CALL is the assumption that the learners are humans.

With the growth of Information Technology has come growing demand for computers to process language in various ways, and this has led to a subfield of IT with a number of names: "Natural Language Processing" or "Computational Linguistics" amongst academics (see Gazdar and Mellish 1989), "Speech And Language Technology" ("SALT") to UK research funding agencies such as DTI, EPSRC and HEFCs' NTI, and "Language Engineering" to European Union funding agencies. (*footnote 2*) All SALT systems need some sort of language model, and one way to get this model into the computer system is to use a "machine learning" algorithm, with a corpus as "training data". Much of the terminology has parallels with human language learning, but there are major differences in the underlying learning process. Most significantly, human learners of English as a second or foreign language start off knowing some other Natural Language; the learning task is one of mapping between the new and known languages. Learners obviously focus on the differences between English and their native tongue, but there is a great deal of overlap between any two Natural Languages which can be taken for granted. For example, when learning the English word COMPUTER the non-native speaker naturally relates this word to the native-language translation, and has mental lexical, syntactic, semantic and pragmatic concepts and structured interrelationships already available to "hook onto". In machine learning, by contrast, the computer starts from scratch. A word like COMPUTER is simply an ASCII character sequence or string, unless and until more complex language patterns are learnt and associated with the word.

The kinds of "language patterns" which can be learnt may be quite different from those in human language processing. For SALT systems, this is not necessarily a failing: it may not be essential to mimic human language learning and processing functionality. SALT systems can achieve acceptable performance rates in their specific task without full human linguistic abilities. For example, the task of a speech recognition system is to map the input acoustic signal onto the corresponding sequence of ASCII strings; it is possible to do this surprisingly accurately without "understanding" the speech signal, using models of vocabulary and grammar which are quite different to those learnt by EFL students. While general-purpose SALT systems generally have to be robust (able to cope with variations and "ill-formedness" found in unconstrained English input), the level of linguistic analysis may be shallow or skeletal. To ensure computation in a reasonable time, it is more pragmatic to accept a minimally sufficient analysis, deliberately ignoring linguistic issues which are not essential or relevant to the application. Current successful SALT systems focus on tasks for which a robust shallow linguistic analysis suffices.

9.2. Speech recognition as disambiguation of 'noisy' English

Ambiguity resolution is a central issue in Speech And Language Technology. For large-vocabulary speech recognisers, not restricted to a small domain, the task of finding the correct ASCII transcription for the input acoustic signal can be seen as a disambiguation problem. The same perspective applies to systems for handwriting recognition and optical character recognition (OCR), and even grammar correction for word processors can be modelled in this way: in each case, the input must be assumed to be 'noisy', with several potential analyses (e.g. ASCII transcriptions) to choose between at each point.

This "noisy English" representation inside a recognition system (be it for speech, handwriting or printed text input) is typically a sequence of sets of candidate-words, referred to as a word recognition lattice. (Atwell 1993a) gives the following illustration (much simplified compared to realistic systems). On `hearing' the sentence "Stephen left school last year", an English speech recognition system may produce the following lattice of word-candidates (in order of decreasing similarity to input speech signal):


stephen     stiffen     stiffens
left        lift        loft
school      scowl       scull
lest        last        least
yearn       your        year
.

In speech recognition, alternative candidates at each point are phonetically similar; in script recognition, candidates are graphemically similar. The task of detecting errors in word-processed English text can also be cast in terms of a word recognition lattice, if the system artificially "ambiguates" each word as it is typed in: (Atwell 1987a) proposed that "...this could be done by generating COHORTS for each input word, and then choosing the cohort-member word which fits the context best". If the "best" choice is not the word actually typed in, this becomes the suggested replacement for an error.

The language model's task is to find the best sequence of words, so that the chosen sequence of words is the most linguistically plausible. Most language models for lattice disambiguation provide only a limited coverage of the linguistic knowledge available (Jelinek 1990, Young 1995). This is because the system has to search through all possible combinations of candidates; in the above example this amounts to 3*3*3*3*3 = 243 possible sentences. Analysis of recognition lattices involves traversing a much larger search space than when analysing 'known' sentences. Because of this, sophisticated language analysis systems may be too slow and unwieldy to disambiguate recognition lattices in reasonable time. For example, (Atwell 1994) found that a probabilistic context-free chart parser (based on the outline implementation in Gazdar and Mellish 1989) required impractically long computation times to discover a very large number of ambiguous analyses of even simple spoken word recognition lattices. Similarly, (Keenan 1992) reported impractically-long compute times when attempting to use the Alvey Natural Language Toolkit (ANLT) chart parser (Phillips and Thompson 1987) to disambiguate handwritten word recognition lattices. There is arguably a need for language models which strike a pragmatic balance, incorporating a range of linguistic knowledge while remaining computationally practical.

9.3. What aspects of language can computers learn from corpus resources?

Generally models extracted or learnt from corpus resources are lexically-oriented; not for linguistic-theoretic reasons, but because machine learning is data-driven, and in "bottom-up" learning the words in the training text are the baseline. A wide range of information about words and how they combine can be gleaned from corpus resources. This is particularly true if corpus resources are not limited to raw text samples, but also richer text resources (see Atwell et al 1993). A widening range of annotated corpora are also available, including Part-of-Speech wordtagged text (see chapter 11, Taylor et al 1991), treebanks with sentences annotated with syntactic phrase structure (see chapter 4, Souter and Atwell 1994), spoken corpora with graphemic and/or phonetic transcriptions aligned with digitised acoustic signal (see chapter 10, Arnfield 1994), parallel corpora with English sentences aligned with their translations in another language, e.g. French or German (see chapter 15, Taylor et al 1991), error corpora, English text with spelling or grammatical errors flagged or marked, and annotated with preferred corrections (see Atwell and Elliott 1987); lattice corpora, the output of speech or handwriting recognition systems, where each word is annotated with a cohort or word-candidate set (see Modd and Atwell 1994, Hanlon 1994), and even parallel-annotated corpora, with words and sentences annotated with several types of linguistic analysis (see Arnfield and Atwell 1993, Hughes et al 1995). For machine learning purposes, it can be useful to regard the machine-readable version of a printed dictionary (such as the Longman Dictionary of Contemporary English, LDOCE) as a special sort of annotated corpus, where headwords have complex annotations including grammatical classes and sense-definition text (see Demetriou and Atwell 1994).

Given this wider range of training data, a wider range of linguistic information is machine-learnable, including the following:

A list of words in English, together with frequencies or likelihoods;
Word classes or categories: syntactic, semantic, or composite;
Recurring word combinations: idioms and collocation patterns;
Word-tuples (word-pairs, word-triples,...) with frequencies or likelihoods;
Wordclass- or wordtag-tuples (tag-pairs, tag-triples,...) with frequencies or likelihoods;
Higher-level syntactic constituent structures, with constituent likelihoods;
Recurring combinations of words and grammatical contexts: collegation patterns;
Dictionary word-sense (lexical semantics) collocation likelihoods;
Thesaurus semantic field collocation likelihoods;
Semantic structures, with constituent likelihoods;
Pragmatic discourse structures, with constituent likelihoods;
Word-candidate lattice traversal models, for speech or written script input;
Error-correction models for "noisy English", e.g. word-processed text;
Mappings or transducers between annotations at different levels, e.g. prosody to/from syntax;
Mappings or transducers between rival syntactic annotation schemes advocated by different research teams;
Mappings or translators between sentences in two languages in a parallel corpus.

Some of these models are well-known and used by linguists and ELT researchers; while others have no immediate analogy in human language learning, but may still be useful in applications like speech word-lattice traversal.

9.4. Advantages of machine learning from corpus resources

Within the wider field of Knowledge Based Systems (see Atwell 1993b), the "knowledge acquisition bottleneck" problem is widespread. A Knowledge Based System or KBS must somehow acquire its knowledge from human experts in the field or domain. Instead of requiring domain experts to write down their knowledge directly as program code, a common approach is for a Knowledge Engineer to elicit a comprehensive set of examples of the knowledge being applied, and then to reprocess these examples into a Knowledge Base. Machine learning from corpus resources is essentially the same process: examples of linguistic expertise are captured in corpus annotations and lexical database entries, to be reprocessed into a language model usable in a SALT system.

The alternative to machine learning of language models is for linguists to "hand-craft" their linguistic expertise into SALT systems, directly encoding their knowledge about language into software. As a half-way house between these two extremes, the linguist can be guided by corpus examples in building a grammar; or, starting from the other end, the results of machine learning can be improved and fine-tuned by augmenting or merging with linguists' hand-crafted resources.

Even the watered-down use of machine learning has advantages over purely hand-crafted language modelling. The main advantage is that the building of SALT language models is broken into two separate subtasks: building corpus resources, and then extracting computable models from these corpus resources. For many linguists, annotating examples in a corpus or lexical database is more straightforward and 'natural' than encoding linguistic knowledge in a computable formalism. Both subtasks yield recyclable resources: an annotated corpus has uses other than machine learning of language models (e.g. lexicography, ELT, ...); and a successful language-model-learning algorithm can in principle be re-applied to other annotated corpora to extract models of rival tagging/parsing schemes. For example, (Atwell 1988) described how a context-free phrase-structure grammar and parser were learnt from the Lancaster/Leeds Treebank (see below); essentially the same technique was used in (Souter and Atwell 1992) to glean a context-free phrase-structure grammar from the Polytechnic of Wales treebank (annotated with Systemic Functional Grammar parsetrees) and in (Atwell 1994) to extract a third context-free phrase structure grammar from the Spoken English Corpus Treebank (annotated using Lancaster/IBM skeletal grammar). Furthermore, a machine learning algorithm which works on English corpus resources should work with minimal modifications with corpora in other languages. For example, the IBM research team led by Jelinek (Jelinek 1990) has been highly successful at gleaning corpus-based language models for English speech recognition; (Gros et al 1994) reused the same machine learning techniques to learn corpus-based language models for a Slovene speech recogniser.

A concept related to learning is adaptation. It is desirable that SALT systems can automatically adapt to variations of language style, genre, accent, etc, even single-speaker variations and degradations over time. Once a basic default language model has been learnt, a machine learning system can be transformed into a user-adaptive system by adding a "feedback-loop". The system then incrementally updates the initial language model from its own analyses. For example, (Atwell et al 1988) reported on a probabilistic parser which learnt its grammar model from the Polytechnic of Wales treebank; (Hughes 1989) described a modified version of this parser, which adapted its learnt grammar to account for parses of new sentences. In contrast, grammars hand-crafted by linguists must be explicitly updated or maintained by linguists.

Corpus-based approaches force awareness of practical issues which are important in working, reusable SALT systems. In hand-crafting a language model, linguists can overlook issues such as punctuation and/or prosodic markers, Capitalisation, neologisms or "out-of-vocabulary" unknown words, segmentation into words and sentences, etc. A machine learning system must have ways of coping with these in processing its training set, and these "low-level" routines can be re-used in the end-product SALT system. The prime example is the attention to detail of punctuation, capitalisation etc in the Constituent Likelihood Automatic Wordtagging System CLAWS, used to wordtag the LOB Corpus (Leech et al 1983a,b, Atwell et al 1984, Johansson et al 1986). The wordtag-pair model (also known as wordtag N-gram or Markov model) at the heart of the CLAWS wordtag-disambiguation routine is learnt from a tagged corpus training set, and the routines for handling punctuation etc are also used in analysing new text.

Last but not least, I find machine learning intellectually challenging because it allows researchers to explore an undiscovered country of novel models of language, quite different to those dreamt up by traditional theoretical linguists. (*footnote 3*) Some of these are illustrated below.

9.5 Problems with machine learning from corpus resources

Of course, machine learning is not a panacea. Probably the main disadvantage is that current machine learning techniques require copious training datasets, and current corpus resources may not be adequate. Furthermore, even a large corpus resource is unlikely to cover all possible instances of a linguistic feature. The simplest example of this is lexicon-extraction: a list of words in English, along with frequencies of usage, can be directly extracted from a raw-text English corpus. A wordtagged corpus can yield a wordlist with words coupled with possible parts of speech. However, wordlists derived even from a very large corpus are not complete: for example, many words do not appear in all the inflected and derived forms given in a dictionary. A solution is to merge or combine corpus-derived lexical resources with human-expert-crafted dictionary resources. For example, the original wordlist and suffixlist used in the CLAWS system to wordtag the LOB Corpus were in part derived from the tagged Brown Corpus, but supplemented by cross-checking with printed dictionaries (and linguistic intuition) to ensure all words in LOB were fully covered. More recently, (Souter 1993) described how the Systemic-Functional Grammar wordlist extracted from the SFG-annotated Polytechnic of Wales corpus was merged with the CELEX lexical database to derive a large-vocabulary SFG lexical resource. This principle of merging machine-learnt models with linguists' intuitions applies to higher levels of language modelling. For example, (Howarth 1995) argued that corpus-derived collocations need to be supplemented with idiomatic and collocational word-patterns from expert intuition for a more complete picture; (Haigh et al 1988) showed how linguistic knowledge can be applied to modify a corpus-derived grammar model to improve parser accuracy.

A rather more insidious problem is that machine-learnable syntax and semantics models tend to focus on what is learnable, and tend to be skeletal or non-compositional; it is difficult to see how to learn models based on the more esoteric linguistic theories such as GPSG, HPSG, LFG. The tendency to take due account of punctuation, capitalisation, etc, was listed above as a positive advantage of corpus-based machine learning; however, this could also be seen as undue attention to 'trivia', at the possible expense of overlooking theoretically significant linguistic issues. A related problem is that these theories aim to translate a parse-tree into a semantic or knowledge representation amenable to inferencing and 'understanding' - the kinds of machine-learnt skeletal language models currently in use do not readily deliver rich representations to pass to semantic interpretation or inferencing modules of an intelligent Natural Language Understanding system. There is a need for further research to extend machine learning algorithms to cover richer linguistic formalisms, and language processing beyond skeletal parsing and semantics. (Leech and Fligelstone 1992) noted: "A context-free phrase structure grammar is manifestly too weak a model for good results, and one direction of progress is in developing more sophisticated models, such as a probabilistic unification grammar. ... Another direction in which to advance is to integrate a corpus parser within a grander scheme for corpus analysis, including semantic analysis."

9.6 Examples of novel machine-learnt language models

At present, most large-vocabulary speech or script recognition systems use very simple language models. For example (Young 1995) reported that most current research on speech recognition systems attempt to constrain the search through word-candidate lattices using N-gram models of word-tuples: essentially a list of all word-pairs, word-triples, etc found in a training corpus, along with frequencies or likelihoods. Such models, when manipulated with smart statistical techniques, have proved surprisingly successful for certain tasks (such as transcribing passages read out loud from the Wall Street Journal); but some less constrained tasks may call for higher-level linguistic modelling.

This success of word N-gram language models may seem counterintuitive and even implausible to some linguists. An analogy with computer chess may be helpful. Grand Masters become chess experts by learning thousands of specialised move-sequences and strategies; but computer chess programs generally work by "brute force" computation of all possible combinations of moves and their relative scores, within a constrained "look-ahead". The Grand Master may well perceive a "best score" move sequence predicted by a chess program as a structured sequence of moves following one or more known generalised strategies; but the chess program does not need this higher-level knowledge. The lack of "psychological validity" in the computer approach to chess is irrelevant, as long as the models used yield the required results. Both human and computer approaches to chess are suited to respective processing capabilities. In language modelling for speech recognition, the computer must compute all possible sequences of words and their relative scores, using word sequence patterns in a training corpus as a guide. Speech recognition researchers have shown empirically that a word N-gram model can approximate the score of a word-sequence without recourse to higher-level knowledge, at least for some tasks. It is up to linguists to find speech recognition tasks where their wares are needed!

However, given that high-level linguistic knowledge MAY be useful, at least it may be worth developing some machine-learnt language models to try out. The following are some novel models of language which contrast with traditional linguistic concepts. Because of space limitations, these models can only be described here in outline; but see the references for fuller details.

9.6.1. Word-classes

Word N-gram models do not make use of the generally-accepted linguistic generalisation that words can be grouped into classes with similar linguistic behaviour. At least this minimal linguistic concept ought to be of use in speech and handwriting recognition. Traditionally English words have been grouped into Parts of Speech (noun, verb, adjective, ...); corpus linguists have refined these into much more detailed wordclass or wordtag sets for corpus annotation. However, below the coarse categories, there is fuzziness or disagreement on class-boundaries; the current corpus tagsets for LOB, Brown, SEC, POW, ICE, BNC, UPenn etc are all incompatible.

One possible response is to start again from scratch, formulating computable machine learning criteria for putting words into wordclasses. For example, (Atwell 1987b, Atwell and Drakos 1987, Atwell and Elliott 1987) computed new wordclasses for words in a training corpus by clustering words with similar collocational behaviour; other researchers have experimented with other clustering and similarity metrics. However, these machine learning experiments also led to a range of incompatible results: different algorithms for clustering, and different definitions of collocational criteria, led to different word-classes. As a further response, (Hughes 1994, Hughes and Atwell 1994) proposed an algorithm for automatically evaluating a range of competing word-classifications against a specific 'target'.

In practice, many researchers would prefer to reuse an existing tagged corpus and corpus tagset. At present, this means they are restricted to a single corpus; but (Atwell et al 1994, Hughes et al 1995) discussed experiments with machine learning a mapping between corpus tagsets, using a Parallel Annotated Corpus as a training set.

9.6.2 Phrase-structure syntax

Grammatical sentence structure is traditionally captured in a 2-dimensional phrase-structure syntax tree. The standard baseline way of defining phrase structure for parsing is in terms of context-free rules: each non-terminal node in the syntax-tree corresponds to the application of a context-free rule, which has the node-label as its left-hand-side, and the node's immediate daughters as its right-hand-side. So, for example, in the Lancaster/IBM Spoken English Corpus treebank, the parse-tree for the sentence "Stephen left school last year." is:

[S [N Stephen_NP1 N][V left_VVD [N school_NN1 N][Nr last_MD year_NNT1 Nr]V] ._. S]

This is a compact textfile encoding of the following 2-dimensional syntax tree diagram:


                    S
                  / | \
                 /  |  \
                N   V   \
               /   /|\   \
              /   / | \   \
             /   /  N  Nr  \
            /   /   |  | \  \
           /   /    |  |  \  \
          NP1 VVD NN1 MD NNT1 .
         /   /      |  |   |   \
  Stephen left school last year .

This phrase structure tree can be generated by, or broken down into, a set of context-free rules:


  S  --> N V .
  N  --> NP1
  V  --> VVD N Nr
  N  --> NN1
  Nr --> MD NNT1

There are also a set of lexical rewrite rules: "NP1 --> Stephen", etc. In this way, a context-free grammar and parser can be 'learnt' from a treebank: for each sentence-tree, extract a context-free rule for each non-terminal node. If the machine learning algorithm also keeps a count of how frequently each context-free rule is found in the treebank, this yields a probabilistic context-free grammar (see Atwell 1988b, 1994, Souter and Atwell 1992, Leech and Fligelstone 1992). However, as mentioned above, a probabilistic context-free chart parser can require impractically long computation times for speech recognition applications, so alternative grammar-extraction methods may be worth investigation.

One alternative approach to parsing involves a Markov or tag-N-gram model; this was originally proposed as an extension to the CLAWS wordtagging algorithm (Atwell 1983, Atwell et al 1984), but could not be implemented until training resources such as the SEC treebank became available (Atwell 1993c, 1994). The model used is a variant of standard N-gram theory, in that both the training set and desired output are required to be an alternating sequence of wordtags and labelled bracket combinations; for example, the above parse-tree (leaving off the words for simplification) is processed as the tag-sequence:

[S [N NP1 N][V VVD [N NN1 N][Nr MD _ NNT1 Nr]V] . S]

Whereas the standard word N-gram model learns word-pairs, word-triples etc (see above), the tag-N-gram parser learns tag-triples such as:


 [S     [N      NP1
 NP1    N][V    VVD
 VVD    [N      NN1
 NN1    N][Nr   MD
 MD     _       NNT1
 NNT1   Nr]V]   .
 .      S]      [S

The parser implementation uses this adapted model for a "bracket-insertion" procedure. First, the words are wordtagged, using a CLAWS-like tag N-gram model; then phrase-structure-brackets are inserted between pairs of wordtags, guided by the tag-triples table. There is a final "tree-closing" procedure to ensure parse trees are correctly balanced. With experiments in parsing lattices, using equivalent sized training sets, (Pocock and Atwell 93, Atwell 94) found that the Markov Model based parser is much faster and more robust than a probabilistic chart parser developed using the same training treebank. Its optimal parsetree is not always structurally correct, but it is more likely to dominate the correct word-sequence, which is adequate for lattice disambiguation.

A variant N-gram approximation to phrase-structure syntax is used in the Vertical Strip Parser VSP (O'Donoghue 1993). In learning from a treebank, syntax trees are chopped into a series of Vertical Strips from root to leaves; for example, the above syntax tree is analysed into:


   S    S    S    S    S    S
   |    |    |    |    |    |
   N    V    V    V    V    .
   |    |    |    |    |
  NP1  VVD   N    Nr   Nr
             |    |    |
            NN1   MD  NNT1

The VSP learning algorithm stores the set of vertical strips occurring with each wordtag. In analysing new sentences, the parser finds a vertical strip for each wordtag which is compatible with its neighbours, and combines these into a well-formed syntax tree by merging nodes from the root down until they diverge.

9.6.3 Semantic constraints

All word sense-definitions in the Longman Dictionary Of Contemporary English (LDOCE) are written in terms of the Longman Defining Vocabulary, a closed set of approximately 2000 words. This effectively means that LDOCE sense definitions are stated in terms of "semantic primitives", and as a crude measure of semantic relatedness we can count the number of "semantic primitives" the two definitions have in common. For example, (Demetriou 1992, 1993, Demetriou and Atwell 1994) used the LDOCE definitions as the basis for semantic constraints by maximising semantic overlap in a word recognition lattice. In analysing the earlier illustrative lattice, Demetriou's algorithm looks up the LDOCE definition of each candidate, to calculate a semantic overlap score for every possible path through the lattice. In evaluating one path, "stiffen lift scowl last year", the algorithm notes that the LDOCE definition of "last" includes:

"...(in time) one or ones before the one mentioned or now..."

and the LDOCE definition of "year" includes:

"...a measure of time equal to about 365 days..."

These definitions both contain the 'semantic primitive' "time", indicating a semantic overlap favouring cooccurrence of these two candidates; so the score of all word-lattice sequences including "last year" is incremented. This procedure is applied to all candidate-pairs in a sequence to evaluate each possible sequence, and the highest-scoring candidate sequence should be the most semantically consistent.

LDOCE also has a set of semantic field markers which provide a hierarchical taxonomic semantics at a higher-level of abstraction than the sense-definitions. Words have associated a small number of semantic field markers and (Jost and Atwell 1993, Jost 1994) showed that these can be used as semantic tags in a Markovian disambiguation algorithm. An alternative semantic tag set has been produced at Lancaster University (Wilson and Rayson 1993, Eyes and Leech 1993) and this may also prove applicable to semantic constraints for speech and handwriting recognition.

9.6.4 Discourse structure

The principles of pragmatics are particularly difficult to 'pin down', not just for SALT systems, but even in theoretical linguistics: (Leech 1983) warned against "... a tendency to compartmentalize and hierarchize units of discourse, as if they were constituents in an immediate constituent analysis ... the fault of 'overgrammaticization' can be observed not only in the paradigmatic rigidity of a set of mutually exclusive categories, but also in the syntagmatic rigidity of a segmentation of discourse into discrete non-overlapping units". Machine learning researchers in particular should heed this warning! However, there is at least a passing resemblance between N-gram modelling and empirical analysis of adjacent pairs of utterances in dialogue, called "adjacency pairs" (see Heritage 1988, McKevitt 1991). (Atwell and McKevitt 1994) discussed how such empirical intention models might be applied to speech recognition. Intention models are more tangible in a specific, constrained domain; for example (Atwell et al 95) described an empirical speech recognition discourse model trained on a constrained corpus of transcribed dialogue between Air Traffic Control and small aircraft around Leeds/Bradford Airport.

9.7 Conclusions

Machine learning of language models from corpus resources is a fledgling research field, with much to learn! (*footnote 4*) The pioneering developers of corpus resources had backgrounds in English linguistics and language teaching, so naturally saw this as the prime application area for corpus linguistics. Lancaster University continues to be the leading centre for the production of corpus resources (see, for example, Leech and Garside 1991, Eyes and Leech 1993, Black et al 1993). The rapidly-growing field of Information Technology, and specifically Speech And Language Technology, is finding new uses for corpus resources as training data for machine learning of robust large-scale language models; Geoffrey Leech's work is finding applications in a whole new field of research.

footnote 1: ICAME = International Computer Archive of Modern English, an international network of corpus linguists, with a base at Bergen University. ICAME publishes an annual Journal, and holds annual International Conferences; see (Souter and Atwell 1993) for a list of past venues and proceedings.

footnote 2: DTI = Department of Trade and Industry, which subsidises Industrial SALT research and development projects; EPSRC = Engineering and Physical Sciences Research Council, which funds University SALT research projects; HEFCs' NTI = Higher Education Funding Councils' New Technologies Initiative, which funds IT infrastructure for UK Universities, including SALT training resources.

footnote 3: As fellow Star Trek fans will know, the Undiscovered Country is the Future...

footnote 4: The Association for Computational Linguistics has recently set up a Special Interest Group on Natural Language Learning, ACL-SIGNLL; for further details, contact its President, David Powers, Flinders University (powers@cs.flinders.edu.au) or Secretary, Walter Daelemans, Tilburg University (walter@kub.nl).

REFERENCES

Arnfield, Simon and Eric Atwell 1993 "A syntax based grammar of stress sequences." in Simon Lucas (ed) "Grammatical Inference: theory, applications, and alternatives", pp71-77, Colloquium Proceedings 1993/092, Institution of Electrical Engineers, London.

Arnfield, Simon 1994 "Prosody and syntax in corpus based analysis of spoken English" PhD thesis, School of Computer Studies and Psychology Department, University of Leeds.

Atwell, Eric Steven 1983 "Constituent Likelihood Grammar" ICAME Journal no.7, pp34-67, Norwegian Computing Centre for the Humanities, Bergen.

Atwell, Eric Steven, Geoffrey Leech and Roger Garside 1984 "Analysis of the LOB Corpus: progress and prospects" in Jan Aarts and Willem Meijs (eds.) "Corpus Linguistics: Proceedings of the ICAME 4th International Conference", pp40-52, Rodopi, Amsterdam.

Atwell, Eric Steven 1987a "How to detect grammatical errors in a text without parsing it" in Bente Maegaard (ed), "Proceedings of the Third Conference of European Chapter of the Association for Computational Linguistics", pp38-45, Association for Computational Linguistics, New Jersey.

Atwell, Eric Steven 1987b "A parsing expert system which learns from corpus analysis" in Willem Meijs (ed.) "Corpus Linguistics and Beyond: Proceedings of the ICAME 7th International Conference", pp227-235, Rodopi, Amsterdam.

Atwell, Eric Steven, and Stephen Elliot 1987 "Dealing with ill-formed English text" in Roger Garside, Geoffrey Sampson and Geoffrey Leech (ed) "The computational analysis of English: a corpus-based approach", pp.120-138, Longman, Harlow.

Atwell, Eric Steven, and Nikos Drakos 1987 "Pattern Recognition Applied to the Acquisition of a Grammatical Classification System from Unrestricted English Text" in Bente Maegaard (ed.) "Proceedings of the Third Conference of European Chapter of the Association for Computational Linguistics", pp56-63, Association for Computational Linguistics, New Jersey.

Atwell, Eric Steven 1988 "Transforming a Parsed Corpus into a Corpus Parser" In Merja Kyto, Ossi Ihalainen and Matti Risanen (eds) "Corpus Linguistics, hard and soft: Proceedings of the ICAME 8th International Conference", pp61-70, Rodopi, Amsterdam.

Atwell, Eric Steven, Clive Souter and Tim O'Donoghue 1988 "Prototype Parser 1" COMMUNAL report no.17 to MoD, School of Computer Studies, Leeds University.

Atwell, Eric 1993a "Linguistic Constraints for Large-Vocabulary Speech Recognition" in Eric Atwell (ed) "Knowledge at Work in Universities", pp26-32, Leeds University Press, Leeds.

Atwell, Eric 1993b "Introduction and Overview of the HEFC Knowledge Based Systems Initiative" in Eric Atwell (ed) "Knowledge at Work in Universities", pp1-5, Leeds University Press, Leeds.

Atwell, Eric 1993c "Corpus-Based Statistical Modelling of English Grammar" in Clive Souter and Eric Atwell (eds.) "Corpus-based Computational Linguistics", pp195-215, Rodopi, Amsterdam.

Atwell, Eric, Simon Arnfield, George Demetriou, Steve Hanlon, John Hughes, Uwe Jost, Rob Pocock, Clive Souter and Joerg Ueberla 1993 "Multi-level Disambiguation Grammar Inferred from English Corpus, Treebank and Dictionary" in Simon Lucas (ed) "Grammatical Inference: theory, applications, and alternatives", pp91-97, Colloquium Proceedings 1993/092, Institution of Electrical Engineers, London.

Atwell, Eric 1994 "Speech-Oriented Probabilistic Parser Project: Final Report to MoD" School of Computer Studies, Leeds University.

Atwell, Eric, John Hughes and Clive Souter 1994 "AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models" in Judith Klavans (ed.) "Proceedings of ACL workshop on The Balancing Act: Combining Symbolic and Statistical Approaches to Language", pp 21-28, Association for Computational Linguistics, New Jersey.

Atwell, Eric, and Paul McKevitt 1994 "Pragmatic linguistic constraint models for large-vocabulary speech processing" in Paul McKevitt (ed.) "Integrating Speech and Natural Language Processing: AAAI94 Workshop Proceedings" pp58-64, American Association for Artificial Intelligence, Washington USA.

Atwell, Eric, Gavin Churcher and Clive Souter 1995 "Developing a corpus-based grammar model within a commercial continuous speech recognition package" In Collingham R (ed) Papers for the Institute of Acoustics Speech Group meeting on Integrating Speech Recognition and Natural Language Processing, pp5-6, Durham University.

Black, Ezra, Roger Garside, and Geoffrey Leech (eds.) 1993 "Statistically-driven computer grammars of English: the IBM/Lancaster approach", Rodopi, Amsterdam.

Demetriou, George 1992 "Lexical Disambiguation Using Constraint Handling In Prolog (CHIP)" MSc thesis, School of Computer Studies, Leeds University.

Demetriou, George 1993 "Lexical Disambiguation Using CHIP (Constraint Handling In Prolog)" in "Proceedings of the sixth European conference of the Association for Computational Linguistics", pp431-436, Association for Computational Linguistics, New Jersey.

Demetriou, George, and Eric Atwell 1994 "Machine-Learnable, Non-Compositional Semantics for Domain Independent Speech or Text Recognition" in "Proceedings of 2nd Hellenic-European Conference on Mathematics and Informatics (HERMIS)", pp103-104, Athens University of Economics and Business.

Eyes, Elizabeth, and Geoffrey Leech 1993 "Progress in UCREL research: improving corpus annotation practices" in Jan Aarts, Pieter de Haan and Nelleke Oostdijk (eds.) "English language corpora: design, analysis and exploitation", pp123-144, Rodopi, Amsterdam.

Gazdar, Gerald, and Christopher Mellish 1989 "Natural Language Processing in Pop-11: an introduction to computational linguistics" Addison-Wesley, Reading.

Gros, Jerneja, France Mihelic and Nikola Pavesic 1994 "Sentence hypothesization in a speech recognition and understanding system for the Slovene spoken language" in Lindsay Evett and Tony Rose (eds) "Proceedings of AISB workshop on Computational Linguistics for Speech and Handwriting Recognition", pp 91-96, Leeds University.

Haigh, Robin, Geoffrey Sampson and Eric Atwell 1988 "Project APRIL - a progress report" in "Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics (ACL)", pp104-112, Association for Computational Linguistics, New Jersey.

Hanlon, Stephen 1994 "A computational theory of contextual knowledge in machine reading" PhD thesis, School of Computer Studies, Leeds University.

Heritage, John 1988 "Explanations as accounts: a conversation analytic perspective" in Charles Antaki (ed.), "Analysing everyday explanation: a casebook of methods", pp127-144 Sage Publications, London.

Howarth, Peter 1995 "A computer-assisted study of collocations in academic prose, with special reference to grammatical structure and stylistic value" PhD thesis, School of English, Leeds University.

Hughes, John 1989 "A learning interface to the Realistic Annealing Parser" BSc Project Report, School of Computer Studies, Leeds University.

Hughes, John 1994 "Automatically acquiring a classification of words" PhD thesis, School of Computer Studies, Leeds University.

Hughes, John, and Eric Atwell 1994 "The Automated Evaluation of Inferred Word Classifications" in Tony Cohn (ed.) "Proceedings of European Conference on Artificial Intelligence (ECAI)", pp535-539, John Wiley, Chichester.

Hughes, John, Clive Souter and Eric Atwell 1995 "Automatic extraction of tagset mappings from Parallel-Annotated Corpora" in Evelyne Tzoukermann and Susan Armstrong (eds) "From text to tags: issues in multilingual language analysis" pp10-17, Proceedings of ACL-SIGDAT workshop, University College Dublin.

Jelinek, Fred 1990 "Self-organized language modeling for speech recognition" in Alex Waibel and Kai-Fu Lee (eds) "Readings in Speech Recognition", pp450-506 Morgan Kaufmann, San Mateo California.

Johansson, Stig, Eric Steven Atwell, Roger Garside and Geoffrey Leech 1986 "The Tagged LOB Corpus" Norwegian Computing Centre for the Humanities, Bergen.

Jost, Uwe and Eric Atwell 1993 "Deriving a probabilistic grammar of semantic markers from unrestricted English text" in Simon Lucas (ed) "Grammatical Inference: theory, applications, and alternatives", pp91-97, Colloquium Proceedings 1993/092, Institution of Electrical Engineers, London.

Jost, Uwe 1994 "Probabilistic language modelling for speech recognition" MSc thesis, School of Computer Studies, Leeds University.

Keenan, Francis 1992 "Large vocabulary syntactic analysis for text recognition" PhD thesis, Department of Computing, Nottingham Trent University.

Leech, Geoffrey 1983 "Principles of pragmatics" Longman, Harlow.

Leech, Geoffrey, Garside, Roger, and Atwell, Eric 1983a "Recent developments in the use of computer corpora in English language research" in Transactions of the Philological Society, Volume 1983, pp.23-40, Basil Blackwell, Oxford.

Leech, Geoffrey, Roger Garside, and Eric Steven Atwell 1983b "The automatic grammatical tagging of the LOB corpus" ICAME Journal no.7, pp13-33 Norwegian Computing Centre for the Humanities, Bergen.

Leech, Geoffrey 1986 "Automatic grammatical analysis and its educational applications" in Geoffrey Leech and Christopher Candlin (eds) "Computers in English language teaching and research: selected papers from the British Council Symposium" pp204-214, Longman, Harlow.

Leech, Geoffrey, and Roger Garside 1991 "Running a grammar factory: the production of syntactically analysed corpora or 'treebanks'" in Stig Johansson and Anna-Brita Stenstrom (eds) "English computer corpora", pp15-32 Mouton de Gruyter, Berlin.

Leech, Geoffrey, and Steven Fligelstone 1992 "Computers and corpus analysis" in Christopher Butler (ed) "Computers and written texts" pp115-140, Basil Blackwell, Oxford.

McKevitt, Paul 1991 "Analysing coherence of intention in natural-language dialogue" PhD thesis, Department of Computer Science, Exeter University.

Modd, Dan, and Eric Atwell 1994 "A Word Hypothesis Lattice Corpus - a benchmark for linguistic constraint models" in Lindsay Evett and Tony Rose (eds) "Proceedings of AISB workshop on Computational Linguistics for Speech and Handwriting Recognition", pp191-198, Leeds University.

O'Donoghue, Tim 1993 "Reversing the process of generation in systemic grammar" PhD thesis, School of Computer Studies, Leeds University.

Phillips, J D and Henry Thompson 1987 "A parsing tool for the Natural Language theme" Software Paper no.5, Department of Artificial Intelligence, Edinburgh University.

Rob Pocock and Eric Atwell 1993 "Treebank-Trained Probabilistic Parsing of Lattices" Technical Report 93.30, School of Computer Studies, Leeds University.

Souter, Clive 1993 "Harmonising a lexical database with a corpus-based grammar" in Clive Souter and Eric Atwell (eds.) "Corpus-based Computational Linguistics", pp181-193, Rodopi, Amsterdam.

Souter, Clive, and Eric Atwell 1992 "A Richly Annotated Corpus for Probabilistic Parsing" in "Proceedings of AAAI workshop on Statistically-Based NLP Techniques", pp28-38, American Association for Artificial Intelligence, San Jose California.

Souter, Clive, and Eric Atwell 1994 "Using Parsed Corpora: A review of current practice" in Nelleke Oostdijk and Pieter de Haan (eds) "Corpus-based Research Into Language", pp143-158 Rodopi, Amsterdam.

Taylor, Lita, Geoffrey Leech and Steve Fligelstone 1991 "A survey of English machine-readable corpora" in Stig Johansson and Anna-Brita Stenstrom (eds) "English computer corpora", pp319-354 Mouton de Gruyter, Berlin.

Wilson, Andrew, and Paul Rayson 1993 "Automatic content analysis of spoken discourse" in Clive Souter and Eric Atwell (eds.) "Corpus-based Computational Linguistics", pp215-227 Rodopi, Amsterdam.

Young, Steve 1995 "The state of the art in speech recognition" In Collingham R (ed) Papers for the Institute of Acoustics Speech Group meeting on Integrating Speech Recognition and Natural Language Processing, Durham University.