Introduction to Noun Phrase Detection

What is noun phrase detection?

Noun phrase detection is simply finding noun phrases in a given input document. For example, using a noun phrase detector on this document would yield terminology such as: regular expression, noun phrase, noun phrase detection, among others.

How is noun phrase finding done?

Noun phrase detection is traditionally done in two steps.

  1. Parts of speech of each word in the document are assigned;
  2. A fixed pattern of part of speech tags are searched for, and word/tag pairs matching the pattern, are pulled out from the text as noun phrases.

Note that this approach certainly works for more than just noun phrases, however, this is traditionally the focus of phrase detection: namely, the detection and tagging of noun phrases. For noun phrases, this pattern or regular expression is the following:

(Adjective | Noun)* (Noun Preposition)? (Adjective | Noun)* Noun

This regular expression is read in the following manner: Zero or more adjectives or nouns, followed by an option group of a noun and a preposition, followed again by zero or more adjectives or nouns, followed by a single noun. A sequence of tags matching this pattern ensures that the corresponding words make up a noun phrase.

In addition to simply pulling out the phrases, it is common to do some simple post processing to link variants together (For example, unpluralizing plural variants).

Why is noun phrase finding important?

Noun Phrase Finding is especially important to indexing utilities that track the content of the information provided in a document. It is well known that closed class words (words of a part of speech that are limited in number), such as determiners and other function words have very little information content. Open class words, (such as nouns, adjectives, and verbs) for the most part, contain the bulk of the information. So by extracting these interesting components we have a good idea of what the subject of the document is about. These frequencies and distributions of noun phrases and other content phrases in a document aid many search engines in computing a profile of the document for matching user queries.

In any case, there are many other applications that use noun phrase finding as a basis. In our specific case, termer is tuned for a slightly different purpose, the purpose of finding multi paragraph segment boundaries. Multi paragraph topical segmentation uses term distributions for calculating likely places in which an article switches topics. Please reference an introduction on segmentation for more details.

How are the approaches taken here different from other ones?

In essence, noun phrase finding has always consisted of the same several steps: tagging of the text, and recognizing the noun phrases. At Columbia, we have implemented a simplistic noun phrase finding called termer, as a front end utility for other uses, such as the segmentation program mentioned in the last section.

Perhaps the most noticable difference between other noun phrase detectors and the termer utility is because it does not use a generalized part of speech tagger as the first step. Many detectors use a part of speech tagger program to assign tags to each word. These programs have the goal of trying to make the most accurate assignment as possible, and occasionally they err. For the purposes of termer, we wanted to maximize the retrieval of noun phrases (or recall), and so we willingly to take wrong tags as long as all possible noun phrases phrases were caught (even some non noun phrases). This was accomplished using a specialed dictionary, derived from New York University's dictionary COMLEX, that was crafted to categorize any word that could be construed as a noun as a noun. Part of speech was then assigned by this dictionary lookup, a process much faster than traditional part of speech tagging.

A second difference is that termer isn't strictly just a noun phrase finder. In addition, it is crafted to find pronouns and certain discourse markers that make it useful towards the application that termer is used for, finding segment boundaries.


min@columbia.edu | Version 1.0 | Created on: Tue Jul 1 10:44:45 EDT 1997 | Last Modified: Tue Jul 1 10:44:45 EDT 1997