Morphologieanalyse und Lexikonaufbau (11. Vorlesung)
Dozent: Martin VolkDas Original liegt unter: A gentle introduction to SGML.
XML steht für "Extensible Markup Language". Wichtige Informationsquelle ist das W3C - das WWW-Consortium.
Die Text-Encoding Initiative (TEI) ist ein Kommittee, das die drei grossen Fachorganisationen ACL (Association for Computational Linguistics), ALLC (Association for Literary and Linguistic Computing) und AHC (Association for Computers in the Humanities) im Jahre 1987 gegründet haben. Ziel: Erarbeitung von Empfehlungen für ein Standard-Datenformat zum Textaustausch in den Geisteswissenschaften. Diese Guidelines liegen jetzt als Buch [Sperberg-McQueen und Burnard 94] vor und sind auch über das WWW (TEI-Guidelines) abrufbar.
<tei.2> <teiHeader> <fileDesc> <titleStmt> <title> The shortest TEI Document</title> <publicationStmt> <p> Published as part of TEI P2 </publicationStmt> <sourceDesc> <p> no source </sourceDesc> </fileDesc> </teiHeader> <text> <body> <p> Hello World! </p> </body> </text> </tei.2>
<p> für Paragraphe <emph> für Hervorhebungen <q> für Zitate <address> für Adressen <date> für Datumsangaben <abbr> für Abkürzungen <ptr> für Verweise <list> für Aufzählungen <l> für Zeilen (z.B. in Gedichten)
<tei.2> <teiHeader> ... </teiHeader> <text> <front> ... z.B. Vorwort eines Buches oder Gebrauchsanweisung für ein Lexikon </front> <body> <p> Hello World! </p> </body> <back> ... z.B. Anhänge </back> </text> </tei.2>
Die TEI hat auch spezielle Tags erarbeitet, die zur Markierung von Wörterbüchern dienen. Sie finden sich in [Sperberg-McQueen und Burnard 94] (Kapitel 12) und im WWW lokal unter 12: Print Dictionaries (130 KByte; ca. 55 Seiten) und im Original unter 12: Print Dictionaries.
<text> contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample.
<front> contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found before the start of a text proper.
<body> contains the whole body of a single unitary text, excluding any front or back matter.
<back> contains any appendixes, etc. following the main part of a text.
<div> contains a subdivision of the front, body, or back of a text.
<div0> contains the largest possible subdivision of the body of a text.
<div1> contains a first-level subdivision of the front, body, or back of a text (the largest, if <div0> is not used, the second largest if it is).
<entry> contains a reasonably well-structured dictionary entry.
<entryFree> contains a dictionary entry which does not necessarily conform to the constraints imposed by the <entry> element.
<superentry> groups successive entries for a set of homographs.
The following elements are used to encode these top-level constituents:
<form>
groups all the information on the written and spoken forms
of one
headword.
<gramGrp>
groups morpho-syntactic information about a lexical item,
e.g.
<pos>,
<gen>,
<number>,
<case>, or
<itype> (inflectional class).
<def>
contains definition text in a dictionary entry.
<trans>
contains translation text and related information (within
an
entry in a multilingual dictionary).
<eg>
(in a dictionary) contains an example text containing at
least one occurrence of the word form, used in the sense
being described; examples may be quoted from (named) authors or contrived.<usg>
contains usage information in a dictionary entry.<xr>
contains a phrase, sentence, or icon referring the reader to some other location in this or another text.<etym>
encloses the etymological information in a dictionary
entry.<re>
contains a dictionary entry for a lexical item related to the headword, such as a compound phrase or derived form,
embedded inside a larger entry.<note>
contains a note or annotation.
In a simple entry with no internal hierarchy, all top-level constituents appear at the <entry> level.
...
To simplify the electronic presentation of this document on systems with limited character sets, many of the pronunciations are presented using the transliteration found in the electronic edition of the Oxford Advanced Learner's Dictionary. Also, the middle dot in quoted entries is rendered with a full stop, while within the sample transcriptions hyphenation and syllabification points are indicated with |, regardless of their rendition in the source text. `` com.peti.tor /k@m"petit@(r)/ n person who competes. [OALD] ''
<entry> <form> <orth>competitor</orth> <hyph>com|peti|tor</hyph> <pron>k@m"petit@(r)</pron> </form> <gramGrp> <pos>n</pos> </gramGrp> <def>person who competes.</def> </entry>For the elements which appear within the <form> and <gramGrp> elements of this example, see section 12.3.1, Information on Written and Spoken Forms, and section 12.3.2, Grammatical Information.
As mentioned above, any top-level constituent can appear at any level when the hierarchical structure of the entry is more complex. The most obvious examples are <def> and <trans>, which appear at the <sense> level when several senses or translations exist: `` disproof (dIs"pru:f) n. 1. facts that disprove something. 2. the act of disproving. [CED] ''
<entry> <form> <orth>disproof</orth> <pron>dIs"pru:f</pron> </form> <gramGrp><pos>n></gramGrp> <sense n='1'><def>facts that disprove something.</def></sense> <sense n='2'><def>the act of disproving.</def></sense> </entry>
In the following example, <gramGrp> is used to distinguish two homographs: `` bray /breI/ n cry of an ass; sound of a trumpet. ▪ vt [VP2A] make a cry or sound of this kind. [OALD] ''
<entry> <form> <orth>bray</orth> <pron>breI</pron> </form> <hom> <gramGrp><pos>n</pos></gramGrp> <def>cry of an ass; sound of a trumpet.</def> </hom> <hom> <gramGrp> <pos>vt</pos> <subc>VP2A</subc> </gramGrp> <def>make a cry or sound of this kind.</def> </hom> </entry>
Information of the same kind can appear at different levels within the same entry; here, grammatical information occurs both at entry and homograph level. `` ca.reen /k@"ri:n/ vt,vi 1 [VP6A] turn (a ship) on one side for cleaning, repairing, etc. 2 [VP6A, 2A] (cause to) tilt, lean over to one side. [OALD] ''
<entry> <form> <orth>careen</orth> <hyph>ca|reen</hyph> <pron>k@"ri:n</pron> </form> <gramGrp> <pos>vt</pos> <pos>vi</pos> </gramGrp> <sense n='1'> <gramGrp><subc>VP6A</subc></gramGrp> <def>turn (a ship) on one side for cleaning, repairing, etc.</def> </sense> <sense n='2'> <gramGrp> <subc>VP6A</subc> <subc>VP2A</subc> </gramGrp> <def>(cause to) tilt, lean over to one side.</def> </sense> </entry>
Alone among the constituent groups, <form> can appear at the <superEntry> level as well as at the <entry>, <hom>, and <sense> levels: `` a.ban.don 1 /@"b&nd@n/ v [T1] 1 to leave completely and for ever; desert: The sailors abandoned the burning ship. 2 ... abandon 2 n [U] the state when one's feelings and actions are uncontrolled; freedom from control: The people were so excited that they jumped and shouted with abandon / in gay abandon. [LDOCE] ''
<superEntry> <form> <orth>abandon</orth> <hyph>a|ban|don</hyph> <pron>@"b&nd@n</pron> </form> <entry n='1'> <gramGrp> <pos>v</pos> <subc>T1</subc> </gramGrp> <sense n='1'><def>to leave completely and for ever ...</def> <!-- ... --> </sense> <sense n='2'> <!-- ... --> </sense> </entry> <entry n='2'> <gramGrp> <pos>n/pos> <subc>U</subc> </gramGrp> <def>the state when one's feelings and actions are uncontrolled; freedom from control</def> <!-- ... --> </entry> </superEntry>
The individual constituents are declared below, each in the section which documents it in more detail.