Lexikonorganisation mit SGML

Morphologieanalyse und Lexikonaufbau (11. Vorlesung)

Dozent: Martin Volk

Übersicht


Lexikalische Datenbanken

Eine kurze Einführung in SGML (Deutsch, ca. 8 Seiten)

Eine längere Einführung in SGML (Englisch, ca. 25 Seiten; lokale Kopie).

Das Original liegt unter: A gentle introduction to SGML.

Einige Klarstellungen zu SGML

Wann sollte SGML eingesetzt werden?

Unterschiede zwischen SGML und XML

XML steht für "Extensible Markup Language". Wichtige Informationsquelle ist das W3C - das WWW-Consortium.

Ziele von XML

Gegenüber SGML:
Gegenüber HTML:

Besonderheiten von XML

SGML-Lexikoneinträge nach der TEI-Empfehlung

Die Text-Encoding Initiative (TEI) ist ein Kommittee, das die drei grossen Fachorganisationen ACL (Association for Computational Linguistics), ALLC (Association for Literary and Linguistic Computing) und AHC (Association for Computers in the Humanities) im Jahre 1987 gegründet haben. Ziel: Erarbeitung von Empfehlungen für ein Standard-Datenformat zum Textaustausch in den Geisteswissenschaften. Diese Guidelines liegen jetzt als Buch [Sperberg-McQueen und Burnard 94] vor und sind auch über das WWW (TEI-Guidelines) abrufbar.

Gliederung der Guidelines 'TEI-Proposal 3'

Minimale TEI-Textstruktur

<tei.2>
<teiHeader>
  <fileDesc>
    <titleStmt>
      <title> The shortest TEI Document</title>
    <publicationStmt>
      <p> Published as part of TEI P2
    </publicationStmt>
    <sourceDesc>
      <p> no source 
    </sourceDesc>
  </fileDesc>
</teiHeader>

<text>
  <body>
  <p> Hello World! </p>
  </body>
</text>
</tei.2>  

SGML-Tags, die in allen TEI-Dokumenten verwendbar sind.

<p> 		für Paragraphe
<emph>   	für Hervorhebungen
<q> 		für Zitate
<address>	für Adressen
<date>	   	für Datumsangaben
<abbr>	   	für Abkürzungen
<ptr>		für Verweise
<list>		für Aufzählungen
<l>	     	für Zeilen (z.B. in Gedichten)

Standard Textstruktur

<tei.2>
<teiHeader> ... </teiHeader>
<text>
  <front> 
  ... z.B. Vorwort eines Buches oder Gebrauchsanweisung für ein Lexikon
  </front>
  <body>
  <p> Hello World! </p>
  </body>
  <back>
  ... z.B. Anhänge
  </back>
</text>
</tei.2>  

SGML-Tags für Lexika (Print Dictionaries)

Die TEI hat auch spezielle Tags erarbeitet, die zur Markierung von Wörterbüchern dienen. Sie finden sich in [Sperberg-McQueen und Burnard 94] (Kapitel 12) und im WWW lokal unter 12: Print Dictionaries (130 KByte; ca. 55 Seiten) und im Original unter 12: Print Dictionaries.

Overall Structure

<text> contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample.

<front> contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found before the start of a text proper.

<body> contains the whole body of a single unitary text, excluding any front or back matter.

<back> contains any appendixes, etc. following the main part of a text.

<div> contains a subdivision of the front, body, or back of a text.

<div0> contains the largest possible subdivision of the body of a text.

<div1> contains a first-level subdivision of the front, body, or back of a text (the largest, if <div0> is not used, the second largest if it is).

<entry> contains a reasonably well-structured dictionary entry.

<entryFree> contains a dictionary entry which does not necessarily conform to the constraints imposed by the <entry> element.

<superentry> groups successive entries for a set of homographs.

Groups and Constituents

As noted above, dictionary entries, and subordinate levels within dictionary entries, may comprise several constituent parts, each providing a different type of information about the word treated. The top-level constituents of dictionary entries are: Any of the hierarchical levels ( <entry>, <entryFree>, <hom>, <sense>) may contain any of these top-level constituents, since information about word form, particular grammatical information, special pronunciation, usage information, etc., may apply to an entire entry, or to only one homograph, or only to a particular sense. The examples below illustrate this point.

The following elements are used to encode these top-level constituents:

In a simple entry with no internal hierarchy, all top-level constituents appear at the <entry> level.

...

To simplify the electronic presentation of this document on systems with limited character sets, many of the pronunciations are presented using the transliteration found in the electronic edition of the Oxford Advanced Learner's Dictionary. Also, the middle dot in quoted entries is rendered with a full stop, while within the sample transcriptions hyphenation and syllabification points are indicated with |, regardless of their rendition in the source text. `` com.peti.tor /k@m"petit@(r)/ n person who competes. [OALD] ''

<entry>
  <form>
    <orth>competitor</orth>
    <hyph>com|peti|tor</hyph>
    <pron>k@m"petit@(r)</pron>
  </form>
  <gramGrp>
    <pos>n</pos>
  </gramGrp>
  <def>person who competes.</def>
</entry>
For the elements which appear within the <form> and <gramGrp> elements of this example, see section 12.3.1, Information on Written and Spoken Forms, and section 12.3.2, Grammatical Information.

As mentioned above, any top-level constituent can appear at any level when the hierarchical structure of the entry is more complex. The most obvious examples are <def> and <trans>, which appear at the <sense> level when several senses or translations exist: `` disproof (dIs"pru:f) n. 1. facts that disprove something. 2. the act of disproving. [CED] ''

<entry>
  <form>
    <orth>disproof</orth>
    <pron>dIs"pru:f</pron>
  </form>
  <gramGrp><pos>n</gramGrp>
  <sense n='1'><def>facts that disprove something.</def></sense>
  <sense n='2'><def>the act of disproving.</def></sense>
</entry>

In the following example, <gramGrp> is used to distinguish two homographs: `` bray /breI/ n cry of an ass; sound of a trumpet. ▪ vt [VP2A] make a cry or sound of this kind. [OALD] ''

<entry>
  <form>
    <orth>bray</orth>
    <pron>breI</pron>
  </form>
  <hom>
    <gramGrp><pos>n</pos></gramGrp>
    <def>cry of an ass; sound of a trumpet.</def>
  </hom>
  <hom>
    <gramGrp>
      <pos>vt</pos>
      <subc>VP2A</subc>
    </gramGrp>
    <def>make a cry or sound of this kind.</def>
  </hom>
</entry>
 

Information of the same kind can appear at different levels within the same entry; here, grammatical information occurs both at entry and homograph level. `` ca.reen /k@"ri:n/ vt,vi 1 [VP6A] turn (a ship) on one side for cleaning, repairing, etc. 2 [VP6A, 2A] (cause to) tilt, lean over to one side. [OALD] ''

<entry>
  <form>
    <orth>careen</orth>
    <hyph>ca|reen</hyph>
    <pron>k@"ri:n</pron>
  </form>
  <gramGrp>
    <pos>vt</pos>
    <pos>vi</pos>
  </gramGrp>
  <sense n='1'>
    <gramGrp><subc>VP6A</subc></gramGrp>
    <def>turn (a ship) on one side for cleaning,
         repairing, etc.</def>
  </sense>
  <sense n='2'>
    <gramGrp>
      <subc>VP6A</subc>
      <subc>VP2A</subc>
    </gramGrp>
    <def>(cause to) tilt, lean over to one side.</def>
  </sense>
</entry>

Alone among the constituent groups, <form> can appear at the <superEntry> level as well as at the <entry>, <hom>, and <sense> levels: `` a.ban.don 1 /@"b&amp;nd@n/ v [T1] 1 to leave completely and for ever; desert: The sailors abandoned the burning ship. 2 ... abandon 2 n [U] the state when one's feelings and actions are uncontrolled; freedom from control: The people were so excited that they jumped and shouted with abandon / in gay abandon. [LDOCE] ''

<superEntry>
  <form>
    <orth>abandon</orth>
    <hyph>a|ban|don</hyph>
    <pron>@"b&amp;nd@n</pron>
  </form>
  <entry n='1'>
    <gramGrp> <pos>v</pos> <subc>T1</subc> </gramGrp>
    <sense n='1'><def>to leave completely and for ever ...</def>
      <!-- ... -->
    </sense>
    <sense n='2'>
      <!-- ... -->
    </sense>
  </entry>
  <entry n='2'>
    <gramGrp> <pos>n/pos> <subc>U</subc> </gramGrp>
    <def>the state when one's feelings and actions are
            uncontrolled; freedom from control</def>
    <!-- ... -->
  </entry>
</superEntry>

The individual constituents are declared below, each in the section which documents it in more detail.


Martin Volk
Date of last modification:
Source: http://www.ifi.unizh.ch