Centre for Computer Analysis of Language And Speech
School of Computer Studies
The University of Leeds, LEEDS LS2 9JT, Yorkshire, England
tel:0113-2335761 fax:0113-2335468 email:eric@scs.leeds.ac.uk
This tutorial has been published as Chapter 9 of Jenny Thomas and Mick Short (eds), "Using Corpora for Language Research: Studies in the Honour of Geoffrey Leech", pp151-166, Longman, Harlow. 1996.
The pioneers of modern Corpus Linguistics saw their principal application areas as linguistic theory and English language teaching. For example, (Leech et al 1983a), an early paper on the LOB Corpus tagging project, cited "the need for accurate textual evidence for statements about the language" ; and (Leech 1986) discussed applications in computer-assisted language learning (CALL). Even today, the bulk of regular attenders at annual ICAME English Corpus Linguistics conferences (*footnote 1*) are from English linguistics and language teaching departments. Implicit in terms like ELT and even CALL is the assumption that the learners are humans.
With the growth of Information Technology has come growing demand for computers to process language in various ways, and this has led to a subfield of IT with a number of names: "Natural Language Processing" or "Computational Linguistics" amongst academics (see Gazdar and Mellish 1989), "Speech And Language Technology" ("SALT") to UK research funding agencies such as DTI, EPSRC and HEFCs' NTI, and "Language Engineering" to European Union funding agencies. (*footnote 2*) All SALT systems need some sort of language model, and one way to get this model into the computer system is to use a "machine learning" algorithm, with a corpus as "training data". Much of the terminology has parallels with human language learning, but there are major differences in the underlying learning process. Most significantly, human learners of English as a second or foreign language start off knowing some other Natural Language; the learning task is one of mapping between the new and known languages. Learners obviously focus on the differences between English and their native tongue, but there is a great deal of overlap between any two Natural Languages which can be taken for granted. For example, when learning the English word COMPUTER the non-native speaker naturally relates this word to the native-language translation, and has mental lexical, syntactic, semantic and pragmatic concepts and structured interrelationships already available to "hook onto". In machine learning, by contrast, the computer starts from scratch. A word like COMPUTER is simply an ASCII character sequence or string, unless and until more complex language patterns are learnt and associated with the word.
The kinds of "language patterns" which can be learnt may be quite different from those in human language processing. For SALT systems, this is not necessarily a failing: it may not be essential to mimic human language learning and processing functionality. SALT systems can achieve acceptable performance rates in their specific task without full human linguistic abilities. For example, the task of a speech recognition system is to map the input acoustic signal onto the corresponding sequence of ASCII strings; it is possible to do this surprisingly accurately without "understanding" the speech signal, using models of vocabulary and grammar which are quite different to those learnt by EFL students. While general-purpose SALT systems generally have to be robust (able to cope with variations and "ill-formedness" found in unconstrained English input), the level of linguistic analysis may be shallow or skeletal. To ensure computation in a reasonable time, it is more pragmatic to accept a minimally sufficient analysis, deliberately ignoring linguistic issues which are not essential or relevant to the application. Current successful SALT systems focus on tasks for which a robust shallow linguistic analysis suffices.
This "noisy English" representation inside a recognition system (be it for speech, handwriting or printed text input) is typically a sequence of sets of candidate-words, referred to as a word recognition lattice. (Atwell 1993a) gives the following illustration (much simplified compared to realistic systems). On `hearing' the sentence "Stephen left school last year", an English speech recognition system may produce the following lattice of word-candidates (in order of decreasing similarity to input speech signal):
stephen stiffen stiffens left lift loft school scowl scull lest last least yearn your year .In speech recognition, alternative candidates at each point are phonetically similar; in script recognition, candidates are graphemically similar. The task of detecting errors in word-processed English text can also be cast in terms of a word recognition lattice, if the system artificially "ambiguates" each word as it is typed in: (Atwell 1987a) proposed that "...this could be done by generating COHORTS for each input word, and then choosing the cohort-member word which fits the context best". If the "best" choice is not the word actually typed in, this becomes the suggested replacement for an error.
The language model's task is to find the best sequence of words, so that the chosen sequence of words is the most linguistically plausible. Most language models for lattice disambiguation provide only a limited coverage of the linguistic knowledge available (Jelinek 1990, Young 1995). This is because the system has to search through all possible combinations of candidates; in the above example this amounts to 3*3*3*3*3 = 243 possible sentences. Analysis of recognition lattices involves traversing a much larger search space than when analysing 'known' sentences. Because of this, sophisticated language analysis systems may be too slow and unwieldy to disambiguate recognition lattices in reasonable time. For example, (Atwell 1994) found that a probabilistic context-free chart parser (based on the outline implementation in Gazdar and Mellish 1989) required impractically long computation times to discover a very large number of ambiguous analyses of even simple spoken word recognition lattices. Similarly, (Keenan 1992) reported impractically-long compute times when attempting to use the Alvey Natural Language Toolkit (ANLT) chart parser (Phillips and Thompson 1987) to disambiguate handwritten word recognition lattices. There is arguably a need for language models which strike a pragmatic balance, incorporating a range of linguistic knowledge while remaining computationally practical.
Given this wider range of training data, a wider range of linguistic information is machine-learnable, including the following:
Some of these models are well-known and used by linguists and ELT researchers; while others have no immediate analogy in human language learning, but may still be useful in applications like speech word-lattice traversal.
The alternative to machine learning of language models is for linguists to "hand-craft" their linguistic expertise into SALT systems, directly encoding their knowledge about language into software. As a half-way house between these two extremes, the linguist can be guided by corpus examples in building a grammar; or, starting from the other end, the results of machine learning can be improved and fine-tuned by augmenting or merging with linguists' hand-crafted resources.
Even the watered-down use of machine learning has advantages over purely hand-crafted language modelling. The main advantage is that the building of SALT language models is broken into two separate subtasks: building corpus resources, and then extracting computable models from these corpus resources. For many linguists, annotating examples in a corpus or lexical database is more straightforward and 'natural' than encoding linguistic knowledge in a computable formalism. Both subtasks yield recyclable resources: an annotated corpus has uses other than machine learning of language models (e.g. lexicography, ELT, ...); and a successful language-model-learning algorithm can in principle be re-applied to other annotated corpora to extract models of rival tagging/parsing schemes. For example, (Atwell 1988) described how a context-free phrase-structure grammar and parser were learnt from the Lancaster/Leeds Treebank (see below); essentially the same technique was used in (Souter and Atwell 1992) to glean a context-free phrase-structure grammar from the Polytechnic of Wales treebank (annotated with Systemic Functional Grammar parsetrees) and in (Atwell 1994) to extract a third context-free phrase structure grammar from the Spoken English Corpus Treebank (annotated using Lancaster/IBM skeletal grammar). Furthermore, a machine learning algorithm which works on English corpus resources should work with minimal modifications with corpora in other languages. For example, the IBM research team led by Jelinek (Jelinek 1990) has been highly successful at gleaning corpus-based language models for English speech recognition; (Gros et al 1994) reused the same machine learning techniques to learn corpus-based language models for a Slovene speech recogniser.
A concept related to learning is adaptation. It is desirable that SALT systems can automatically adapt to variations of language style, genre, accent, etc, even single-speaker variations and degradations over time. Once a basic default language model has been learnt, a machine learning system can be transformed into a user-adaptive system by adding a "feedback-loop". The system then incrementally updates the initial language model from its own analyses. For example, (Atwell et al 1988) reported on a probabilistic parser which learnt its grammar model from the Polytechnic of Wales treebank; (Hughes 1989) described a modified version of this parser, which adapted its learnt grammar to account for parses of new sentences. In contrast, grammars hand-crafted by linguists must be explicitly updated or maintained by linguists.
Corpus-based approaches force awareness of practical issues which are important in working, reusable SALT systems. In hand-crafting a language model, linguists can overlook issues such as punctuation and/or prosodic markers, Capitalisation, neologisms or "out-of-vocabulary" unknown words, segmentation into words and sentences, etc. A machine learning system must have ways of coping with these in processing its training set, and these "low-level" routines can be re-used in the end-product SALT system. The prime example is the attention to detail of punctuation, capitalisation etc in the Constituent Likelihood Automatic Wordtagging System CLAWS, used to wordtag the LOB Corpus (Leech et al 1983a,b, Atwell et al 1984, Johansson et al 1986). The wordtag-pair model (also known as wordtag N-gram or Markov model) at the heart of the CLAWS wordtag-disambiguation routine is learnt from a tagged corpus training set, and the routines for handling punctuation etc are also used in analysing new text.
Last but not least, I find machine learning intellectually challenging because it allows researchers to explore an undiscovered country of novel models of language, quite different to those dreamt up by traditional theoretical linguists. (*footnote 3*) Some of these are illustrated below.
A rather more insidious problem is that machine-learnable syntax and semantics models tend to focus on what is learnable, and tend to be skeletal or non-compositional; it is difficult to see how to learn models based on the more esoteric linguistic theories such as GPSG, HPSG, LFG. The tendency to take due account of punctuation, capitalisation, etc, was listed above as a positive advantage of corpus-based machine learning; however, this could also be seen as undue attention to 'trivia', at the possible expense of overlooking theoretically significant linguistic issues. A related problem is that these theories aim to translate a parse-tree into a semantic or knowledge representation amenable to inferencing and 'understanding' - the kinds of machine-learnt skeletal language models currently in use do not readily deliver rich representations to pass to semantic interpretation or inferencing modules of an intelligent Natural Language Understanding system. There is a need for further research to extend machine learning algorithms to cover richer linguistic formalisms, and language processing beyond skeletal parsing and semantics. (Leech and Fligelstone 1992) noted: "A context-free phrase structure grammar is manifestly too weak a model for good results, and one direction of progress is in developing more sophisticated models, such as a probabilistic unification grammar. ... Another direction in which to advance is to integrate a corpus parser within a grander scheme for corpus analysis, including semantic analysis."
This success of word N-gram language models may seem counterintuitive and even implausible to some linguists. An analogy with computer chess may be helpful. Grand Masters become chess experts by learning thousands of specialised move-sequences and strategies; but computer chess programs generally work by "brute force" computation of all possible combinations of moves and their relative scores, within a constrained "look-ahead". The Grand Master may well perceive a "best score" move sequence predicted by a chess program as a structured sequence of moves following one or more known generalised strategies; but the chess program does not need this higher-level knowledge. The lack of "psychological validity" in the computer approach to chess is irrelevant, as long as the models used yield the required results. Both human and computer approaches to chess are suited to respective processing capabilities. In language modelling for speech recognition, the computer must compute all possible sequences of words and their relative scores, using word sequence patterns in a training corpus as a guide. Speech recognition researchers have shown empirically that a word N-gram model can approximate the score of a word-sequence without recourse to higher-level knowledge, at least for some tasks. It is up to linguists to find speech recognition tasks where their wares are needed!
However, given that high-level linguistic knowledge MAY be useful, at least it may be worth developing some machine-learnt language models to try out. The following are some novel models of language which contrast with traditional linguistic concepts. Because of space limitations, these models can only be described here in outline; but see the references for fuller details.
One possible response is to start again from scratch, formulating computable machine learning criteria for putting words into wordclasses. For example, (Atwell 1987b, Atwell and Drakos 1987, Atwell and Elliott 1987) computed new wordclasses for words in a training corpus by clustering words with similar collocational behaviour; other researchers have experimented with other clustering and similarity metrics. However, these machine learning experiments also led to a range of incompatible results: different algorithms for clustering, and different definitions of collocational criteria, led to different word-classes. As a further response, (Hughes 1994, Hughes and Atwell 1994) proposed an algorithm for automatically evaluating a range of competing word-classifications against a specific 'target'.
In practice, many researchers would prefer to reuse an existing tagged corpus and corpus tagset. At present, this means they are restricted to a single corpus; but (Atwell et al 1994, Hughes et al 1995) discussed experiments with machine learning a mapping between corpus tagsets, using a Parallel Annotated Corpus as a training set.
[S [N Stephen_NP1 N][V left_VVD [N school_NN1 N][Nr last_MD year_NNT1 Nr]V] ._. S]
This is a compact textfile encoding of the following 2-dimensional syntax tree diagram:
S / | \ / | \ N V \ / /|\ \ / / | \ \ / / N Nr \ / / | | \ \ / / | | \ \ NP1 VVD NN1 MD NNT1 . / / | | | \ Stephen left school last year .
This phrase structure tree can be generated by, or broken down into, a set of context-free rules:
S --> N V . N --> NP1 V --> VVD N Nr N --> NN1 Nr --> MD NNT1
There are also a set of lexical rewrite rules: "NP1 --> Stephen", etc. In this way, a context-free grammar and parser can be 'learnt' from a treebank: for each sentence-tree, extract a context-free rule for each non-terminal node. If the machine learning algorithm also keeps a count of how frequently each context-free rule is found in the treebank, this yields a probabilistic context-free grammar (see Atwell 1988b, 1994, Souter and Atwell 1992, Leech and Fligelstone 1992). However, as mentioned above, a probabilistic context-free chart parser can require impractically long computation times for speech recognition applications, so alternative grammar-extraction methods may be worth investigation.
One alternative approach to parsing involves a Markov or tag-N-gram model; this was originally proposed as an extension to the CLAWS wordtagging algorithm (Atwell 1983, Atwell et al 1984), but could not be implemented until training resources such as the SEC treebank became available (Atwell 1993c, 1994). The model used is a variant of standard N-gram theory, in that both the training set and desired output are required to be an alternating sequence of wordtags and labelled bracket combinations; for example, the above parse-tree (leaving off the words for simplification) is processed as the tag-sequence:
[S [N NP1 N][V VVD [N NN1 N][Nr MD _ NNT1 Nr]V] . S]
Whereas the standard word N-gram model learns word-pairs, word-triples etc (see above), the tag-N-gram parser learns tag-triples such as:
[S [N NP1 NP1 N][V VVD VVD [N NN1 NN1 N][Nr MD MD _ NNT1 NNT1 Nr]V] . . S] [S
The parser implementation uses this adapted model for a "bracket-insertion" procedure. First, the words are wordtagged, using a CLAWS-like tag N-gram model; then phrase-structure-brackets are inserted between pairs of wordtags, guided by the tag-triples table. There is a final "tree-closing" procedure to ensure parse trees are correctly balanced. With experiments in parsing lattices, using equivalent sized training sets, (Pocock and Atwell 93, Atwell 94) found that the Markov Model based parser is much faster and more robust than a probabilistic chart parser developed using the same training treebank. Its optimal parsetree is not always structurally correct, but it is more likely to dominate the correct word-sequence, which is adequate for lattice disambiguation.
A variant N-gram approximation to phrase-structure syntax is used in the Vertical Strip Parser VSP (O'Donoghue 1993). In learning from a treebank, syntax trees are chopped into a series of Vertical Strips from root to leaves; for example, the above syntax tree is analysed into:
S S S S S S | | | | | | N V V V V . | | | | | NP1 VVD N Nr Nr | | | NN1 MD NNT1
The VSP learning algorithm stores the set of vertical strips occurring with each wordtag. In analysing new sentences, the parser finds a vertical strip for each wordtag which is compatible with its neighbours, and combines these into a well-formed syntax tree by merging nodes from the root down until they diverge.
"...(in time) one or ones before the one mentioned or now..."
and the LDOCE definition of "year" includes:
"...a measure of time equal to about 365 days..."
These definitions both contain the 'semantic primitive' "time", indicating a semantic overlap favouring cooccurrence of these two candidates; so the score of all word-lattice sequences including "last year" is incremented. This procedure is applied to all candidate-pairs in a sequence to evaluate each possible sequence, and the highest-scoring candidate sequence should be the most semantically consistent.
LDOCE also has a set of semantic field markers which provide a hierarchical taxonomic semantics at a higher-level of abstraction than the sense-definitions. Words have associated a small number of semantic field markers and (Jost and Atwell 1993, Jost 1994) showed that these can be used as semantic tags in a Markovian disambiguation algorithm. An alternative semantic tag set has been produced at Lancaster University (Wilson and Rayson 1993, Eyes and Leech 1993) and this may also prove applicable to semantic constraints for speech and handwriting recognition.
footnote 1: ICAME = International Computer Archive of Modern English, an international network of corpus linguists, with a base at Bergen University. ICAME publishes an annual Journal, and holds annual International Conferences; see (Souter and Atwell 1993) for a list of past venues and proceedings.
footnote 2: DTI = Department of Trade and Industry, which subsidises Industrial SALT research and development projects; EPSRC = Engineering and Physical Sciences Research Council, which funds University SALT research projects; HEFCs' NTI = Higher Education Funding Councils' New Technologies Initiative, which funds IT infrastructure for UK Universities, including SALT training resources.
footnote 3: As fellow Star Trek fans will know, the Undiscovered Country is the Future...
footnote 4: The Association for Computational Linguistics has recently set up a Special Interest Group on Natural Language Learning, ACL-SIGNLL; for further details, contact its President, David Powers, Flinders University (powers@cs.flinders.edu.au) or Secretary, Walter Daelemans, Tilburg University (walter@kub.nl).
REFERENCES
Arnfield, Simon and Eric Atwell 1993 "A syntax based grammar of stress sequences." in Simon Lucas (ed) "Grammatical Inference: theory, applications, and alternatives", pp71-77, Colloquium Proceedings 1993/092, Institution of Electrical Engineers, London.
Arnfield, Simon 1994 "Prosody and syntax in corpus based analysis of spoken English" PhD thesis, School of Computer Studies and Psychology Department, University of Leeds.
Atwell, Eric Steven 1983 "Constituent Likelihood Grammar" ICAME Journal no.7, pp34-67, Norwegian Computing Centre for the Humanities, Bergen.
Atwell, Eric Steven, Geoffrey Leech and Roger Garside 1984 "Analysis of the LOB Corpus: progress and prospects" in Jan Aarts and Willem Meijs (eds.) "Corpus Linguistics: Proceedings of the ICAME 4th International Conference", pp40-52, Rodopi, Amsterdam.
Atwell, Eric Steven 1987a "How to detect grammatical errors in a text without parsing it" in Bente Maegaard (ed), "Proceedings of the Third Conference of European Chapter of the Association for Computational Linguistics", pp38-45, Association for Computational Linguistics, New Jersey.
Atwell, Eric Steven 1987b "A parsing expert system which learns from corpus analysis" in Willem Meijs (ed.) "Corpus Linguistics and Beyond: Proceedings of the ICAME 7th International Conference", pp227-235, Rodopi, Amsterdam.
Atwell, Eric Steven, and Stephen Elliot 1987 "Dealing with ill-formed English text" in Roger Garside, Geoffrey Sampson and Geoffrey Leech (ed) "The computational analysis of English: a corpus-based approach", pp.120-138, Longman, Harlow.
Atwell, Eric Steven, and Nikos Drakos 1987 "Pattern Recognition Applied to the Acquisition of a Grammatical Classification System from Unrestricted English Text" in Bente Maegaard (ed.) "Proceedings of the Third Conference of European Chapter of the Association for Computational Linguistics", pp56-63, Association for Computational Linguistics, New Jersey.
Atwell, Eric Steven 1988 "Transforming a Parsed Corpus into a Corpus Parser" In Merja Kyto, Ossi Ihalainen and Matti Risanen (eds) "Corpus Linguistics, hard and soft: Proceedings of the ICAME 8th International Conference", pp61-70, Rodopi, Amsterdam.
Atwell, Eric Steven, Clive Souter and Tim O'Donoghue 1988 "Prototype Parser 1" COMMUNAL report no.17 to MoD, School of Computer Studies, Leeds University.
Atwell, Eric 1993a "Linguistic Constraints for Large-Vocabulary Speech Recognition" in Eric Atwell (ed) "Knowledge at Work in Universities", pp26-32, Leeds University Press, Leeds.
Atwell, Eric 1993b "Introduction and Overview of the HEFC Knowledge Based Systems Initiative" in Eric Atwell (ed) "Knowledge at Work in Universities", pp1-5, Leeds University Press, Leeds.
Atwell, Eric 1993c "Corpus-Based Statistical Modelling of English Grammar" in Clive Souter and Eric Atwell (eds.) "Corpus-based Computational Linguistics", pp195-215, Rodopi, Amsterdam.
Atwell, Eric, Simon Arnfield, George Demetriou, Steve Hanlon, John Hughes, Uwe Jost, Rob Pocock, Clive Souter and Joerg Ueberla 1993 "Multi-level Disambiguation Grammar Inferred from English Corpus, Treebank and Dictionary" in Simon Lucas (ed) "Grammatical Inference: theory, applications, and alternatives", pp91-97, Colloquium Proceedings 1993/092, Institution of Electrical Engineers, London.
Atwell, Eric 1994 "Speech-Oriented Probabilistic Parser Project: Final Report to MoD" School of Computer Studies, Leeds University.
Atwell, Eric, John Hughes and Clive Souter 1994 "AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models" in Judith Klavans (ed.) "Proceedings of ACL workshop on The Balancing Act: Combining Symbolic and Statistical Approaches to Language", pp 21-28, Association for Computational Linguistics, New Jersey.
Atwell, Eric, and Paul McKevitt 1994 "Pragmatic linguistic constraint models for large-vocabulary speech processing" in Paul McKevitt (ed.) "Integrating Speech and Natural Language Processing: AAAI94 Workshop Proceedings" pp58-64, American Association for Artificial Intelligence, Washington USA.
Atwell, Eric, Gavin Churcher and Clive Souter 1995 "Developing a corpus-based grammar model within a commercial continuous speech recognition package" In Collingham R (ed) Papers for the Institute of Acoustics Speech Group meeting on Integrating Speech Recognition and Natural Language Processing, pp5-6, Durham University.
Black, Ezra, Roger Garside, and Geoffrey Leech (eds.) 1993 "Statistically-driven computer grammars of English: the IBM/Lancaster approach", Rodopi, Amsterdam.
Demetriou, George 1992 "Lexical Disambiguation Using Constraint Handling In Prolog (CHIP)" MSc thesis, School of Computer Studies, Leeds University.
Demetriou, George 1993 "Lexical Disambiguation Using CHIP (Constraint Handling In Prolog)" in "Proceedings of the sixth European conference of the Association for Computational Linguistics", pp431-436, Association for Computational Linguistics, New Jersey.
Demetriou, George, and Eric Atwell 1994 "Machine-Learnable, Non-Compositional Semantics for Domain Independent Speech or Text Recognition" in "Proceedings of 2nd Hellenic-European Conference on Mathematics and Informatics (HERMIS)", pp103-104, Athens University of Economics and Business.
Eyes, Elizabeth, and Geoffrey Leech 1993 "Progress in UCREL research: improving corpus annotation practices" in Jan Aarts, Pieter de Haan and Nelleke Oostdijk (eds.) "English language corpora: design, analysis and exploitation", pp123-144, Rodopi, Amsterdam.
Gazdar, Gerald, and Christopher Mellish 1989 "Natural Language Processing in Pop-11: an introduction to computational linguistics" Addison-Wesley, Reading.
Gros, Jerneja, France Mihelic and Nikola Pavesic 1994 "Sentence hypothesization in a speech recognition and understanding system for the Slovene spoken language" in Lindsay Evett and Tony Rose (eds) "Proceedings of AISB workshop on Computational Linguistics for Speech and Handwriting Recognition", pp 91-96, Leeds University.
Haigh, Robin, Geoffrey Sampson and Eric Atwell 1988 "Project APRIL - a progress report" in "Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics (ACL)", pp104-112, Association for Computational Linguistics, New Jersey.
Hanlon, Stephen 1994 "A computational theory of contextual knowledge in machine reading" PhD thesis, School of Computer Studies, Leeds University.
Heritage, John 1988 "Explanations as accounts: a conversation analytic perspective" in Charles Antaki (ed.), "Analysing everyday explanation: a casebook of methods", pp127-144 Sage Publications, London.
Howarth, Peter 1995 "A computer-assisted study of collocations in academic prose, with special reference to grammatical structure and stylistic value" PhD thesis, School of English, Leeds University.
Hughes, John 1989 "A learning interface to the Realistic Annealing Parser" BSc Project Report, School of Computer Studies, Leeds University.
Hughes, John 1994 "Automatically acquiring a classification of words" PhD thesis, School of Computer Studies, Leeds University.
Hughes, John, and Eric Atwell 1994 "The Automated Evaluation of Inferred Word Classifications" in Tony Cohn (ed.) "Proceedings of European Conference on Artificial Intelligence (ECAI)", pp535-539, John Wiley, Chichester.
Hughes, John, Clive Souter and Eric Atwell 1995 "Automatic extraction of tagset mappings from Parallel-Annotated Corpora" in Evelyne Tzoukermann and Susan Armstrong (eds) "From text to tags: issues in multilingual language analysis" pp10-17, Proceedings of ACL-SIGDAT workshop, University College Dublin.
Jelinek, Fred 1990 "Self-organized language modeling for speech recognition" in Alex Waibel and Kai-Fu Lee (eds) "Readings in Speech Recognition", pp450-506 Morgan Kaufmann, San Mateo California.
Johansson, Stig, Eric Steven Atwell, Roger Garside and Geoffrey Leech 1986 "The Tagged LOB Corpus" Norwegian Computing Centre for the Humanities, Bergen.
Jost, Uwe and Eric Atwell 1993 "Deriving a probabilistic grammar of semantic markers from unrestricted English text" in Simon Lucas (ed) "Grammatical Inference: theory, applications, and alternatives", pp91-97, Colloquium Proceedings 1993/092, Institution of Electrical Engineers, London.
Jost, Uwe 1994 "Probabilistic language modelling for speech recognition" MSc thesis, School of Computer Studies, Leeds University.
Keenan, Francis 1992 "Large vocabulary syntactic analysis for text recognition" PhD thesis, Department of Computing, Nottingham Trent University.
Leech, Geoffrey 1983 "Principles of pragmatics" Longman, Harlow.
Leech, Geoffrey, Garside, Roger, and Atwell, Eric 1983a "Recent developments in the use of computer corpora in English language research" in Transactions of the Philological Society, Volume 1983, pp.23-40, Basil Blackwell, Oxford.
Leech, Geoffrey, Roger Garside, and Eric Steven Atwell 1983b "The automatic grammatical tagging of the LOB corpus" ICAME Journal no.7, pp13-33 Norwegian Computing Centre for the Humanities, Bergen.
Leech, Geoffrey 1986 "Automatic grammatical analysis and its educational applications" in Geoffrey Leech and Christopher Candlin (eds) "Computers in English language teaching and research: selected papers from the British Council Symposium" pp204-214, Longman, Harlow.
Leech, Geoffrey, and Roger Garside 1991 "Running a grammar factory: the production of syntactically analysed corpora or 'treebanks'" in Stig Johansson and Anna-Brita Stenstrom (eds) "English computer corpora", pp15-32 Mouton de Gruyter, Berlin.
Leech, Geoffrey, and Steven Fligelstone 1992 "Computers and corpus analysis" in Christopher Butler (ed) "Computers and written texts" pp115-140, Basil Blackwell, Oxford.
McKevitt, Paul 1991 "Analysing coherence of intention in natural-language dialogue" PhD thesis, Department of Computer Science, Exeter University.
Modd, Dan, and Eric Atwell 1994 "A Word Hypothesis Lattice Corpus - a benchmark for linguistic constraint models" in Lindsay Evett and Tony Rose (eds) "Proceedings of AISB workshop on Computational Linguistics for Speech and Handwriting Recognition", pp191-198, Leeds University.
O'Donoghue, Tim 1993 "Reversing the process of generation in systemic grammar" PhD thesis, School of Computer Studies, Leeds University.
Phillips, J D and Henry Thompson 1987 "A parsing tool for the Natural Language theme" Software Paper no.5, Department of Artificial Intelligence, Edinburgh University.
Rob Pocock and Eric Atwell 1993 "Treebank-Trained Probabilistic Parsing of Lattices" Technical Report 93.30, School of Computer Studies, Leeds University.
Souter, Clive 1993 "Harmonising a lexical database with a corpus-based grammar" in Clive Souter and Eric Atwell (eds.) "Corpus-based Computational Linguistics", pp181-193, Rodopi, Amsterdam.
Souter, Clive, and Eric Atwell 1992 "A Richly Annotated Corpus for Probabilistic Parsing" in "Proceedings of AAAI workshop on Statistically-Based NLP Techniques", pp28-38, American Association for Artificial Intelligence, San Jose California.
Souter, Clive, and Eric Atwell 1994 "Using Parsed Corpora: A review of current practice" in Nelleke Oostdijk and Pieter de Haan (eds) "Corpus-based Research Into Language", pp143-158 Rodopi, Amsterdam.
Taylor, Lita, Geoffrey Leech and Steve Fligelstone 1991 "A survey of English machine-readable corpora" in Stig Johansson and Anna-Brita Stenstrom (eds) "English computer corpora", pp319-354 Mouton de Gruyter, Berlin.
Wilson, Andrew, and Paul Rayson 1993 "Automatic content analysis of spoken discourse" in Clive Souter and Eric Atwell (eds.) "Corpus-based Computational Linguistics", pp215-227 Rodopi, Amsterdam.
Young, Steve 1995 "The state of the art in speech recognition" In Collingham R (ed) Papers for the Institute of Acoustics Speech Group meeting on Integrating Speech Recognition and Natural Language Processing, Durham University.