SGML is an international standard for the description of marked-up electronic text. More exactly, SGML is a metalanguage, that is, a means of formally describing a language, in this case, a markup language. Before going any further we should define these terms.
Historically, the word markup has been used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font and so forth. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special markup codes inserted into electronic texts to govern formatting, printing, or other processing.
Generalizing from that sense, we define markup, or (synonymously) encoding, as any means of making explicit an interpretation of a text. At a banal level, all printed texts are encoded in this sense: punctuation marks, use of capitalization, disposition of letters around the page, even the spaces between words, might be regarded as a kind of markup, the function of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings or simple syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is in principle, like transcribing a manuscript from scriptio continua, a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be interpreted.
By markup language we mean a set of markup conventions used together for encoding texts. A markup language must specify what markup is allowed, what markup is required, how markup is to be distinguished from text, and what the markup means. SGML provides the means for doing the first three; documentation such as these Guidelines is required for the last.
The present chapter attempts to give an informal introduction---much less formal than the standard itself---to those parts of SGML of which a proper understanding is necessary to make best use of these Guidelines.
With descriptive instead of procedural markup the same document can readily be processed by many different pieces of software, each of which can apply different processing instructions to those parts of it which are considered relevant. For example, a content analysis program might disregard entirely the footnotes embedded in an annotated text, while a formatting program might extract and collect them all together for printing at the end of each chapter. Different sorts of processing instructions can be associated with the same parts of the file. For example, one program might extract names of persons and places from a document to create an index or database, while another, operating on the same text, might print names of persons and places in a distinctive typeface.
If documents are of known types, a special purpose program (called a parser) can be used to process a document claiming to be of a particular type and check that all the elements required for that document type are indeed present and correctly ordered. More significantly, different documents of the same type can be processed in a uniform way. Programs can be written which take advantage of the knowledge encapsulated in the document structure information, and which can thus behave in a more intelligent fashion.
Structural units of this kind are most often used to identify specific locations or reference points within a text ( ``the third sentence of the second paragraph in chapter ten''; ``canto 10, line 1234''; ``page 412,'' etc.) but they may also be used to subdivide a text into meaningful fragments for analytic purposes ( ``is the average sentence length of section 2 different from that of section 5?'' ``how many paragraphs separate each occurrence of the word nature?'' ``how many pages?''). Other structural units are more clearly analytic, in that they characterize a section of a text. A dramatic text might regard each speech by a different character as a unit of one kind, and stage directions or pieces of action as units of another kind. Such an analysis is less useful for locating parts of the text ( ``the 93rd speech by Horatio in Act 2'') than for facilitating comparisons between the words used by one character and those of another, or those used by the same character at different points of the play.
In a prose text one might similarly wish to regard as units of different types passages in direct or indirect speech, passages employing different stylistic registers (narrative, polemic, commentary, argument, etc.), passages of different authorship and so forth. And for certain types of analysis (most notably textual criticism) the physical appearance of one particular printed or manuscript source may be of importance: paradoxically, one may wish to use descriptive markup to describe presentational features such as typeface, line breaks, use of white space and so forth.
These textual structures overlap with each other in complex and unpredictable ways. Particularly when dealing with texts as instantiated by paper technology, the reader needs to be aware of both the physical organization of the book and the logical structure of the work it contains. Many great works (Sterne's Tristram Shandy for example) cannot be fully appreciated without an awareness of the interplay between narrative units (such as chapters or paragraphs) and page divisions. For many types of research, it is the interplay between different levels of analysis which is crucial: the extent to which syntactic structure and narrative structure mesh, or fail to mesh, for example, or the extent to which phonological structures reflect morphology.
Within a marked up text (a document instance), each element must be explicitly marked or tagged in some way. The standard provides for a variety of different ways of doing this, the most commonly used being to insert a tag at the beginning of the element (a start-tag) and another at its end (an end-tag). The start- and end-tag pair are used to bracket off the element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, a quotation element in a text might be tagged as follows:
... Rosalind's remarks <quote>This is the silliest stuff that ere I heard of!</quote> clearly indicate ...As this example shows, a start-tag takes the form <name>, where the opening angle bracket indicates the start of the start-tag, ``name'' is the generic identifier of the element which is being delimited, and the closing angle bracket indicates the end of a tag. An end-tag takes an identical form, except that the opening angle bracket is followed by a solidus (slash) character, so that the corresponding end-tag would be </name>.
To illustrate this, we will consider a very simple structural model.
Let us assume that we wish to identify within an anthology only poems,
their titles, and the stanzas and lines of which they are composed. In
SGML terms, our document type is the
anthology, and it
consists of a series of
poems. Each poem has embedded
within it one element, a title, and several occurrences of another, a
stanza, each stanza having embedded within it a number of line elements.
Fully marked up, a text conforming to this model might appear as
follows:
The example is taken from William Blake's
Songs of
innocence and experience (1794). The markup is designed for
illustrative purposes and is not TEI-conformant.
<anthology> <poem><title>The SICK ROSE</title> <stanza> <line>O Rose thou art sick.</line> <line>The invisible worm,</line> <line>That flies in the night</line> <line>In the howling storm:</line> </stanza> <stanza> <line>Has found out thy bed</line> <line>Of crimson joy:</line> <line>And his dark secret love</line> <line>Does thy life destroy.</line> </stanza> </poem> <!-- more poems go here --> </anthology>
It should be stressed that this example does not use the same names as are proposed for corresponding elements elsewhere in these Guidelines: the above is not a valid TEI document. It will however serve as an introduction to the basic notions of SGML. White space and line breaks have been added to the example for the sake of visual clarity only; they have no particular significance in the SGML encoding itself. Also, the line
<!-- more poems go here -->is an SGML comment and is not treated as part of the text.
This example makes no assumptions about the rules governing, for example, whether or not a title can appear in places other than preceding the first stanza, or whether lines can appear which are not included in a stanza: that is why its markup appears so verbose. In such cases, the beginning and end of every element must be explicitly marked, because there are no identifiable rules about which elements can appear where. In practice, however, rules can usually be formulated to reduce the need for so much tagging. For example, considering our greatly over-simplified model of a poem, we could state the following rules:
<anthology> <poem><title>The SICK ROSE <stanza> <line>O Rose thou art sick. <line>The invisible worm, <line>That flies in the night <line>In the howling storm: <stanza> <line>Has found out thy bed <line>Of crimson joy: <line>And his dark secret love <line>Does thy life destroy. <poem> <!-- more poems go here --> </anthology>
The ability to use rules stating which elements can be nested
within others to simplify markup is a very important
characteristic of SGML. Before considering these rules
further, you may wish to consider how text marked up in the
form above could be processed by a computer for very many
different purposes. A simple indexing program could extract
only the relevant text elements in order to make a list of
titles, or of words used in the poem text; a simple formatting
program could insert blank lines between stanzas, perhaps
indenting the first line of each, or inserting a stanza
number. Different parts of each poem could be typeset in
different ways. A more ambitious analytic program could
relate the use of punctuation marks to stanzaic and metrical divisions.
Note that this simple example has not addressed the problem of marking
elements such as sentences explicitly; the implications of this are
discussed below in section
2.5.2, Concurrent Structures
.
Scholars wishing to see the implications of changing
the stanza or line divisions chosen by the editor of this poem
can do so simply by altering the position of the tags. And of
course, the text as presented above can be transported from
one computer to another and processed by any program (or
person) capable of making sense of the tags embedded within it
with no need for the sort of transformations and translations
needed to move word processor files around.
At present, SGML is most widely used in environments where uniformity of document structure is a major desideratum. In the production of technical documentation, for example, it is of major importance that sections and subsections should be properly nested, that cross references should be properly resolved and so forth. In such situations, documents are seen as raw material to match against pre-defined sets of rules. As discussed above, however, the use of simple rules can also greatly simplify the task of tagging accurately elements of less rigidly constrained texts. By making these rules explicit, the scholar reduces his or her own burdens in marking up and verifying the electronic text, while also being forced to make explicit an interpretation of the structure and significant particularities of the text being encoded.
<!ELEMENT anthology - - (poem+)> <!ELEMENT poem - - (title?, stanza+)> <!ELEMENT title - O (#PCDATA) > <!ELEMENT stanza - O (line+) > <!ELEMENT line O O (#PCDATA) >These five lines are examples of formal SGML element declarations. A declaration, like an element, is delimited by angle brackets; the first character following the opening bracket must be an exclamation mark, followed immediately by one of a small set of SGML-defined keywords, specifying the kind of object being declared. The five declarations above are all of the same type: each begins with an ELEMENT keyword, indicating that it declares an element, in the technical sense defined above. Each consists of three parts: a name or group of names, two characters specifying minimization rules, and a content model. Each of these parts is discussed further below. Components of the declaration are separated by white space, that is one or more blanks, tabs or newlines.
The first part of each declaration above gives the generic identifier of the element which is being declared, for example poem, title, etc. It is possible to declare several elements in one statement, as discussed below.
<!ELEMENT couplet O O (line1, line2) >
The elements <line1> and <line2> (which are distinguished to enable studies of rhyme scheme, for example) have exactly the same content model as the existing <line> element. They can therefore share the same declaration. In this situation, it is convenient to supply a name group as the first component of a single element declaration, rather than give a series of declarations differing only in the names used. A name group is a list of GIs connected by any group connector and enclosed in parentheses, as follows:
<!ELEMENT (line | line1 | line2) O O (#PCDATA) >The declaration for the <poem> element can now be changed to include all three possibilities:
<!ELEMENT poem - O (title?, (stanza+ | couplet+ | line+) ) >That is, a poem consists of an optional title, followed by one or several stanzas, or one or several couplets, or one or several lines. Note the difference between this definition and the following:
<!ELEMENT poem - O (title?, (stanza | couplet | line)+ ) >The second version, by applying the occurrence indicator to the group rather than to each element within it, would allow for a single poem to contain a mixture of stanzas, couplets or blank verse.
Quite complex models can easily be built up in this way, to match the structural complexity of many types of text. As a further example, consider the case of stanzaic verse in which a refrain or chorus appears. A refrain may be composed of repetitions of the line element, or it may simply be text, not divided into verse lines. A refrain can appear at the start of a poem only, or as an optional addition following each stanza. This could be expressed by a content model such as the following:
<!ELEMENT refrain - - (#PCDATA | line+)> <!ELEMENT poem - O (title?, ( (line+) | (refrain?, (stanza, refrain?)+ ) )) >That is, a poem consists of an optional title, followed by either a sequence of lines, or an un-named group, which starts with an optional refrain, followed by one of more occurrences of another group, each member of which is composed of a stanza followed by an optional refrain. A sequence such as refrain - stanza - stanza - refrain follows this pattern, as does the sequence stanza - refrain - stanza - refrain. The sequence refrain - refrain - stanza - stanza does not, however, and neither does the sequence ``stanza - refrain - refrain - stanza.'' Among other conditions made explicit by this content model are the requirements that at least one stanza must appear in a poem, if it is not composed simply of lines, and that if there is both a title and a stanza they must appear in that order.
To cope with this, SGML allows for any content model to be further modified by means of an exception list. There are two types of exception: inclusions, that is, additional elements that can be included at any point in the model group or any of its constituent elements; and exclusions, that is, elements that cannot be included within the current model.
To extend our declarations further to allow for annotations and variant readings, which we will assume can appear anywhere within the text of a poem, we first need to add declarations for these two elements:
<!ELEMENT (note | variant) - - (#PCDATA)>The note and variant elements must have both start- and end-tags, since they can appear anywhere. Rather than add them to the content model for each type of poem, we can add them in the form of an inclusion list to the poem element, which now reads:
<!ELEMENT poem - O (title?, (stanza+ | couplet+ | line+) ) +(note | variant) >The plus sign at the start of the (NOTE | VARIANT) name list indicates that this is an inclusion exception. With this addition, notes or variants can appear at any point in the content of a poem element---even those (such as <title>) for which we have defined a content model of #PCDATA. They can thus also appear within notes or variants!
If we wanted for some reason to prevent notes or variants appearing within titles, we could add an exclusion exception to the declaration for <title> above:
<!ELEMENT title - O (#PCDATA) -(note | variant) >The minus sign at the start of the (NOTE | VARIANT) name list indicates that this is an exclusion exception. With this addition, notes and variants will be prohibited from appearing within titles, notwithstanding their potential inclusion implied by the previous addition to the content model for <poem>.
In the same way, we could prevent notes and variants from nesting within notes and variants by modifying the definition above to read
<!ELEMENT (note | variant) - - (#PCDATA) -(note | variant) >The meticulous reader will note that this precludes both variants within notes and notes within variants. Inclusion and exclusion exceptions should be used with care as their ramifications may not be immediately apparent.
|-------------------title | | |----line1 | |----line2 |------POEM1---|----stanza1---|----line3 | | |----line4 | | | | |----line5 | |----stanza2---|----line6 | |----line7 | |----line8 anthology-| | | |-------------------title | | | | |----line1 | | |----line2 |------POEM2---|----stanza1---|----line3 |----line4 |----line5
Clearly, there are many such trees that might be drawn to describe the structure of this or other anthologies. Some of them might be representable as further subdivisions of this tree: for example, we might subdivide the lines into individual words, since no word crosses a line boundary. But equally clearly there are many other trees that might be drawn which do not fit within this tree. We might, for example, be interested in syntactic structures --- which rarely respect the formal boundaries of verse. Or, to take a simpler example, we might want to represent the pagination of different editions of the same text.
One way of doing this would be to group the lines and titles of our current model into pages. A declaration for such an element is simple enough:
<!ELEMENT page - - ((title?, line+)+) >That is, a page consists of one or more unnamed groups, each of which contains an optional title, followed by a sequence of lines. (Note, incidentally, that this model prohibits a title appearing on its own at the foot of a page). However, simply inserting the element <page> into the hierarchy already defined is not as easy as it might seem. Some poems are longer than a single page, and other pages contain more than one poem. We cannot therefore insert the element <page> between <anthology> and <poem> in the hierarchy, nor can it go between <poem> and <stanza>, nor yet in both places at once! What is needed is the ability to create a separate hierarchy, with the same elements at the bottom (the stanzas, lines and titles), but combined into a different superstructure. This is the ability which the CONCUR feature of SGML gives.
A separate document type definition must be created for each hierarchic tree into which the text is to be structured. The definition we have so far built up for the anthology looks, in full, like this:
<!DOCTYPE anthology [ <!ELEMENT anthology - - (poem+) > <!ELEMENT poem - - (title?, stanza+) > <!ELEMENT stanza - O (line+) > <!ELEMENT (title | line) - O (#PCDATA) > ]>As this example shows, the name of a document type must always be the same as the name of the largest element in it, that is the element at the top of the hierarchy. The syntax used is discussed further below (see section 2.9.2, The DTD ). Let us now add to this declaration a second definition for a concurrent document type, which we will call a paged anthology, or <p.anth> for short:
<!DOCTYPE p.anth [ <!ELEMENT p.anth - - (page+) > <!ELEMENT page - - ((title?, line+)+) > <!ELEMENT (title|line) - O (#PCDATA) > ]>
We have now defined two different ways of looking at the same basic text---the PCDATA components grouped by both these document type definitions into lines or titles. In one view, the lines are grouped into stanzas and poems; in the other they are grouped into pages only. Notice that it is exactly the same text which is visible in both views: the two hierarchies simply allow us to arrange it in two different ways.
To mark up the two views, it will be necessary to indicate which hierarchy each element belongs to. This is done by including the name of the document type (the view) within parentheses immediately before the identifier concerned, inside both start- and end-tags. Thus, pages (which are only visible in the <p.anth> document type) must be tagged with a <(p.anth)page> tag at their start and a </(p.anth)page> at their end. In the same way, as poems and stanzas appear only in the <anthology> document type, they must now be tagged using <(anthology)poem> and <(anthology)stanza> tags respectively. For the line and title elements, however, which appear in both hierarchies, no document type specification need be given: any tag containing only a name is assumed to mark an element present in every active document type.
As a simple example, let us assume that Blake's poem appears in some paged anthology, with the page break occurring half way through the first stanza. The poem might then be marked up as follows:
<(anthology)anthology> <(p.anth)p.anth> <(p.anth)page> <!-- other titles and lines on this page here --> <(anthology)poem><title>The SICK ROSE <(anthology)stanza> <line>O Rose thou art sick. <line>The invisible worm, (p.anth)page> <(p.anth)page> <line>That flies in the night <line>In the howling storm: <(anthology)stanza> <line>Has found out thy bed <line>Of crimson joy: <line>And his dark secret love <line>Does thy life destroy. (anthology)poem> <!-- rest of material on this page here --> (p.anth)page> (p.anth)p.anth) (anthology)anthology>
It is now possible to select only the elements concerned with a particular view from the text, even though both are represented in the tagging. A processor concerned only with the pagination will see only those elements whose tags include the P.ANTH specification, or which have no specification at all. A processor concerned only with the ANTHOLOGY view of things will not see the page breaks. And a processor concerned to inter-relate the two views can do so unambiguously.
A note of caution is appropriate: CONCUR is an optional feature of SGML, and not all available SGML software systems support it, while those which do, do not always do so according to the letter of the standard. For that reason, if for no other, wherever these Guidelines have identified a potential application of CONCUR, they also invariably suggest alternative methods as well. For fuller discussion of these issues, see chapter 31, Multiple Hierarchies .
Note also that we cannot introduce a new element, a page number for example, into the <p.anth> document type, since there is no existing data in the <anthology> document type which could be fitted into it. One way of adding that extra information is discussed in the next section.
Although different elements may have attributes with the same name, (for example, in the TEI scheme, every element is defined as having an id attribute), they are always regarded as different, and may have different values assigned to them. If an element has been defined as having attributes, the attribute values are supplied in the document instance as attribute-value pairs inside the start-tag for the element occurrence. An end-tag may not contain an attribute-value specification, since it would be redundant.
For example
<poem id=P1 status="draft"> ... </poem>The <poem> element has been defined as having two attributes: id and status. For the instance of a <poem> in this example, represented here by an ellipsis, the id attribute has the value P1 and the status attribute has the value draft. An SGML processor can use the values of the attributes in any way it chooses; for example, a formatter might print a poem element which has the status attribute set to draft in a different way from one with the same attribute set to revised; another processor might use the same attribute to determine whether or not poem elements are to be processed at all. The id attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which can be used for cross reference purposes, as discussed further below.
Like elements, attributes are declared in the SGML document type declaration, using rather similar syntax. As well as specifying its name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute and a default value.
The following declarations could be used to define the two attributes we have specified above for the <poem> element:
<!ATTLIST poem id ID #IMPLIED status (draft | revised | published) draft >
The declaration begins with the symbol ATTLIST, which introduces an attribute list specification. The first part of this specifies the element (or elements) concerned. In our example, attributes have been declared only for the <poem> element. If several elements share the same attributes, they may all be defined in a single declaration; just as with element declarations, several names may be given in a parenthesized list. Following this name (or list of names), is a series of rows, one for each attribute being declared, each containing three parts. These specify the name of the attribute, the type of value it takes, and a default value respectively.
Attribute names ( id and status in this example) are subject to the same restrictions as other names in SGML; they need not be unique across the whole DTD, however, but only within the list of attributes for a given element.
The second part of an attribute specification can take one of two forms, both illustrated above. The first case uses one of a number of special keywords to declare what kind of value an attribute may take. In the example above, the special keyword ID is used to indicate that the attribute ID will be used to supply a unique identifying value for each poem instance (see further the discussion below). Among other possible SGML keywords are
CDATA
IDREF
NMTOKEN
NUMBER
In the example above, a list of the possible values for the status attribute has been supplied. This means that a parser can check that no <poem> is defined for which the status attribute does not have one of draft, revised, or published as its value. Alternatively, if the declared value had been either CDATA or NAME, a parser would have accepted almost any string of characters ( status=awful or status=12345678 if it had been a NMTOKEN; status="anything goes" or status = "well, ALMOST anything" if it were CDATA). Sometimes, of course, the set of possible values cannot be pre-defined. Where it can, as in this case, it is generally better to do so.
The last piece of each information in each attribute definition specifies how a parser should interpret the absence of the attribute concerned. This can be done by supplying one of the special keywords listed below, or (as in this case) by supplying a specific value which is then regarded as the value for every element which does not supply a value for the attribute concerned. Using the example above, if a poem is simply tagged <poem>, the parser will treat it exactly as if it were tagged <poem status=draft>. Alternatively, one of the following keywords may be used to specify a default value for an attribute:
#REQUIRED
#IMPLIED
#CURRENT
For example, if the attribute definition above were rewritten as
<!ATTLIST poem id ID #IMPLIED status (draft | revised | published) #CURRENT >then poems which appear in the anthology simply tagged <poem> would be treated as if they had the same status as the preceding poem. If the keyword were #REQUIRED rather than #CURRENT, the parser would report such poems as erroneously tagged, as it would if any value other than draft, published, or revised were supplied. The use of #CURRENT implies that whatever value is specified for this attribute on the first poem will apply to all subsequent poems, until altered by a new value. Only the status of the first poem need therefore be supplied, if all are the same.
It is sometimes necessary to refer to an occurrence of one textual element from within another, an obvious example being phrases such as ``see note 6'' or ``as discussed in chapter 5.'' When a text is being produced the actual numbers associated with the notes or chapters may not be certain. If we are using descriptive markup, such things as page or chapter numbers, being entirely matters of presentation, will not in any case be present in the marked up text: they will be assigned by whatever processor is operating on the text (and may indeed differ in different applications). SGML therefore provides a special mechanism by which any element occurrence may be given a special identifier, a kind of label, which may be used to refer to it from anywhere else within the same text. The cross-reference itself is regarded as an element occurrence of a specific kind, which must also be declared in the DTD. In each case, the identifying label (which may be arbitrary) is supplied as the value of a special attribute.
Suppose, for example, we wish to include a reference within the notes on one poem that refers to another poem. We will first need to provide some way of attaching a label to each poem: this is done by defining an attribute for the <poem> element, as suggested above.
<!ATTLIST poem id ID #IMPLIED >
Here we define an attribute id, the value of which must be of type ID. It is not required that any attribute of type ID have the name id as well; it is however a useful convention almost universally observed. Note that not every poem need carry an id attribute and the parser may safely ignore the lack of one in those which do not. Only poems to which we intend to refer need use this attribute; for each such poem we should now include in its start-tag some unique identifier, for example:
Text of poem with identifier 'ROSE' Text of poem with identifier 'P40' This poem has no identifier
Next we need to define a new element for the cross reference itself. This will not have any content---it is only a pointer---but it has an attribute, the value of which will be the identifier of the element pointed at. This is achieved by the following declarations:
<!ELEMENT poemref - O EMPTY > <!ATTLIST poemref target IDREF #REQUIRED >
The <poemref> element needs no end-tag because it has no content. It has a single attribute called target. The value of this attribute must be of type IDREF (the keyword used for cross reference pointers of this type) and it must be supplied.
With these declarations in force, we can now encode a reference to the poem with id Rose as follows:
Blake's poem on the sick rose...
When an SGML parser encounters this empty element it will simply check that an element exists with the identifier Rose. Different SGML processors could take any number of additional actions: a formatter might construct an exact page and line reference for the location of the poem in the current document and insert it, or just quote the poem's title or first lines. A hypertext style processor might use this element as a signal to activate a link to the poem being referred to. The purpose of the SGML markup is simply to indicate that a cross reference exists: it does not determine what the processor is to do with it.
<!ENTITY tei "Text Encoding Initiative">defines an entity whose name is tei and whose value is the string ``Text Encoding Initiative.''
<!ENTITY ChapTwo SYSTEM "sgmlmkup.txt">This defines a system entity whose name is ChapTwo and whose value is the text associated with the system identifier --- in this case, the system identifier is the name of an operating system file and the replacement text of the entity is the contents of the file.
Once an entity has been declared, it may be referenced anywhere within a document. This is done by supplying its name prefixed with the ampersand character and followed by the semicolon. The semicolon may be omitted if the entity reference is followed by a space or record end.
When an SGML parser encounters such an
entity reference,
it immediately substitutes the value declared for the entity name.
Thus, the passage
``The work of the &tei has only just begun''
will be interpreted by an SGML processor exactly as if it read
``The
work of the Text Encoding Initiative has only just begun''. In the
case of a system entity, it is, of course, the contents of the operating
system file which are substituted, so that the passage
``The following
text has been suppressed: &ChapTwo;'' will be expanded to include
the whole of whatever the system finds in the file
sgmlmkup.txt.
Strictly speaking, SGML does not
require system entities to be files; they can in principle be any data
source available to the SGML processor: files, results of database
queries, results of calls to system functions --- anything at all. It
is simpler, however, when first learning SGML, to think of system
entities as referring to files, and this discussion therefore ignores
the other possibilities. All existing SGML processors do support the
use of system entities to refer to files; fewer support the other
possible uses of system entities.
This obviously saves typing, and simplifies the task of maintaining consistency in a set of documents. If the printing of a complex document is to be done at many sites, the document body itself might use an entity reference, such as &site;, wherever the name of the site is required. Different entity declarations could then be added at different sites to supply the appropriate string to be substituted for this name, with no need to change the text of the document itself.
This string substitution mechanism has many other applications. It can be used to circumvent the notorious inadequacies of many computer systems for representing the full range of graphic characters needed for the display of modern English (let alone the requirements of other modern scripts or of ancient languages). So-called special characters not directly accessible from the keyboard (or if accessible not correctly translated when transmitted) may be represented by an entity reference.
Suppose, for example, that we wish to encode the use of ligatures in early printed texts. The ligatured form of ct might be distinguished from the non-ligatured form by encoding it as &ctlig; rather than ct. Other special typographic features such as leafstops or rules could equally well be represented by mnemonic entity references in the text. When processing such texts, an entity declaration would be added giving the desired representation for such textual elements. If, for example, ligatured letters are of no interest, we would simply add a declaration such as
<!ENTITY ctlig "ct" >and the distinction present in the source document would be removed. If, on the other hand, a formatting program capable of representing ligatured characters is to be used, we might replace the entity declaration to give whatever sequence of characters such a program requires as the expansion.
A list of entity declarations is known as an entity set. Standard entity sets are provided for use with most SGML processors, in which the names used will normally be taken from the lists of such names published as an annex to the SGML standard and elsewhere, as mentioned above.
The replacement values given in an entity declaration are, of course, highly system dependent. If the characters to be used in them cannot be typed in directly, SGML provides a mechanism to specify characters by their numeric values, known as character references. A character reference is distinguished from other characters in the replacement string by the fact that it begins with a special symbol, conventionally the sequence &#, and ends with the normal semicolon. For example, if the formatter to be used represents the ligatured form of ct by the characters c and t prefixed by the character with decimal value 102, the entity declaration would read:
<!ENTITY ctlig "fct" >Note that character references will generally not make sense if transferred to another hardware or software environment: for this reason, their use is only recommended in situations like this.
Useful though the entity reference mechanism is for dealing with occasional departures from the expected character set, no one would consider using it to encode extended passages, such as quotations in Greek or Russian in an English text. In such situations, different mechanisms are appropriate. These are discussed elsewhere in these Guidelines (see chapter 4, Characters and Character Sets ).
A special form of entities, parameter entities, may be used within SGML markup declarations; these differ from the entities discussed above (which technically are known as general entities) in two ways:
<!ENTITY % TEI.prose 'INCLUDE'> <!ENTITY % TEI.extensions.dtd SYSTEM 'mystuff.dtd'>
The TEI document type definition makes extensive use of parameter entities to control the selection of different tag sets and to make it easier to modify the TEI DTD. Numerous examples of their use may thus be found in chapter 3, Structure of the TEI Document Type Definition .
SGML provides the marked section construct to handle such practical requirements of document production. In general, as the examples above are intended to suggest, it is more obviously useful in the production of new texts than in the encoding of pre-existing texts. Most users of the TEI encoding scheme will never need to use marked sections, and may wish to skip the remainder of this discussion. The TEI DTD makes extensive use of marked sections, however, and this section should be read and understood carefully by anyone wishing to follow in detail the discussions in chapter 3, Structure of the TEI Document Type Definition .
The special processing offered for marked sections in SGML can be of several types, each associated with one of the following keywords:
INCLUDE
IGNORE
CDATA
RCDATA
TEMP
When a marked section occurs in the text, it is preceded by a marked-section start string, which contains one or more keywords from the list above; its end is marked by a marked-section close string. The second and last lines of the following example are the start and close of a marked section to be ignored:
In such cases, the bank will reimburse the customer for all losses. <![ IGNORE [ Liability is limited to $50,000. ]]>Of the marked section keywords, the most important for understanding the TEI DTD are INCLUDE and IGNORE; these can be used to include and exclude portions of a document --- or a DTD --- selectively, so as to adjust it to relevant circumstances (e.g. to allow a user to select portions of the DTD relevant to the document in question).
The literal keywords INCLUDE and IGNORE, however, are not much use in adjusting a DTD or a document to a user's requirements, however. (To change the text above to include the excluded sentence, for example, a user would have to edit the text manually and change IGNORE to INCLUDE. It might be thought just as easy to add and delete the sentence manually.) But the keywords need not be given as literal values; they can be represented by a parameter entity reference. In a document with many sentences which should be included only in Maryland, for example, each such sentence can be included in a marked section whose keyword is represented by a reference to a parameter entity named Maryland. The earlier example would then be:
In such cases, the bank will reimburse the customer for all losses. <![ %Maryland; [ Liability is limited to $50,000. ]]>When the entity Maryland is defined as IGNORE, the marked sections so marked will all be excluded. If the definition is changed to the following, the marked sections will be included in the document:
<!ENTITY % Maryland 'INCLUDE'>When parameter entities are used in this way to control marked sections in a DTD, the external DTD file normally contains a default declaration. If the user wishes to override the default (as by including the Maryland sections), adding an appropriate declaration to the DTD subset suffices to override the default.
The examples of parameter entity declarations at the end of the preceding section can now be better understood. The declarations
<!ENTITY % TEI.prose 'INCLUDE'> <!ENTITY % TEI.extensions.dtd SYSTEM 'mystuff.dtd'>have the effect of including in the DTD all the sections marked as relevant to prose, since in the external DTD files such sections are all included in marked sections controlled by the parameter entity TEI.prose. They also override the default declaration of TEI.extensions.dtd (which declares this entity as an empty string), so as to include the file mystuff.dtd in the DTD.
An SGML document consists of an SGML prolog and a document instance. The prolog contains an SGML declaration (described below) and a document type definition, which contains element and entity declarations such as those described above. Different software systems may provide different ways of associating the document instance with the prolog; in some cases, for example, the prolog may be hard-wired into the software used, so that it is completely invisible to the user.
At its simplest the document type definition consists simply of a base document type definition (possibly also one or more concurrent document type definitions) which is prefixed to the document instance. For example:
<!DOCTYPE my.dtd [ <!-- all declarations for MY.DTD go here --> ... ]> <my.dtd> This is an instance of a MY.DTD type document </my.dtd>
More usually, the document type definition will be held in a separate file and invoked by reference, as follows:
<!DOCTYPE tei.2 system "tei2.dtd" [ ]> <tei.2> This is an instance of an unmodified TEI type document </tei.2>Here, the text of the TEI.2 document type definition is not given explicitly, but the SGML processor is told that it may be read from the file with the system identifier given in quotation marks. The square brackets may still be supplied, as in this example, even though they enclose nothing.
The part enclosed by square brackets is known as the document type declaration subset or DTD subset. Its purpose is to specify any modification to be made to the DTD being invoked, thus:
<!DOCTYPE tei.2 SYSTEM "tei2.dtd" [ <!ENTITY tla "Three Letter Acronym"> <!ELEMENT my.tag - - (#PCDATA)> <!-- any other special-purpose declarations or re-definitions go in here --> ]> <tei.2> This is an instance of a modified TEI.2 type document, which may contain <my.tag>my special tags</my.tag> and references to my usual entities such as &tla;. </tei.2>In this case, the document type definition in force includes first the contents of the DTD subset, and then the contents of the file specified after the keyword SYSTEM. The order is important, because in SGML only the first declaration of an entity counts. In the above example, therefore, the declaration of the entity tla in the DTD subset would take precedence over any declaration of the same entity in the file tei2.dtd. It is perfectly legal SGML for entities to be declared twice; this is the usual method for allowing user modification of SGML DTDs. (Elements, by contrast, may not be declared more than once; if a declaration for <my.tag> were contained in file tei.dtd, the SGML parser would signal an error.) Combining and extending the TEI document type definitions is discussed further in chapter 3, Structure of the TEI Document Type Definition .
<!DOCTYPE tei.2 [ <!ENTITY chap1 system "chap1.txt"> <!ENTITY chap2 system "chap2.txt"> <!ENTITY chap3 "-- not yet written --"> ]> <tei.2> <teiHeader> ... </teiHeader> <text> <front> ... </front> <body> &chap1; &chap2; &chap3; ... </body> </text> </tei.2>
In this example, the DTD contained in file tei2.dtd has been extended by entity declarations for each chapter of the work. The first two are system entities referring to the file in which the text of particular chapters is to be found; the third a dummy, indicating that the text does not yet exist (alternatively, an entity with a null value could be used). In the document instance, the entity references &chap1; etc. will be resolved by the parser to give the required contents. The chapter files themselves will not, of course, contain any element, attribute list, or entity declarations---just tagged text.
A structured editor is a kind of intelligent word-processor. It can use information extracted from a processed DTD to prompt the user with information about which elements are required at different points in a document as the document is being created. It can also greatly simplify the task of preparing a document, for example by inserting tags automatically.
A formatter operates on a tagged document instance to produce a printed form of it. Many typographic distinctions, such as the use of particular typefaces or sizes, are intimately related to structural distinctions, and formatters can thus usefully take advantage of descriptive markup. It is also possible to define the tagging structure expected by a formatting program in SGML terms, as a concurrent document structure.
Text-oriented database management systems typically use inverted file indexes to point into documents, or subdivisions of them. A search can be made for an occurrence of some word or word pattern within a document or within a subdivision of one. Meaningful subdivisions of input documents will of course be closely related to the subdivisions specified using descriptive markup. It is thus simple for textual database systems to take advantage of SGML-tagged documents. Much research work is also currently going into ways of extending the capabilities of existing (non-text) database systems to take advantage of the structuring information made explicit by SGML markup.
Hypertext systems improve on other methods of handling text by supporting associative links within and across documents. Again, the basic building block needed for such systems is also a basic building block of SGML markup: the ability to identify and to link together individual document elements comes free as a part of the SGML way of doing things. By tagging links explicitly, rather than using proprietary software, developers of hypertexts can be sure that the resources they create will continue to be useful. To load an SGML document into a hypertext system requires only a processor which can correctly interpret SGML tags such as those discussed in chapter 14, Linking, Segmentation, and Alignment .