sgmltrans

TTT: Text Tokenisation Tool
Prev		Next

sgmltrans [-h] [-u base-url] [-d doctype] [-r rulefile] [-p][inputs]

-u base-url

Use this URL as the base URL when resolving relative URLs. The value specified for this argument is passed to SFFopen or a similar stream creation function.

-d doctype

Use the doctype found in this file in preference to anything on the input stream. The file can be any of

an XML file

an XML file with no body (i.e. just a doctype)

an NSG file

a .ddb file

-h

Print usage information for the program.

-r rulefile

Specifies the name of a file which describes a set of rules for processing the XML input.

-p

If specified, he program will merely print out the rules which are being used, and not process the input

Description

sgmltrans is a program for translating XML files into some other format (which could be HTML or LaTeX or ...). It is loosely based on COST and other SGML programs, in that one specifies actions to do at SGML start tags, end tags and text content. In sgmltrans, these actions are restricted to printing some text to the output stream.

The sgmltrans rule file consists of an ordered list of rules. A rule consists of a LT XML query which describes the elements to which the rule will apply; and a pair of format strings, which specify the strings that will be printed when we encounter (a) a start tag for a matching element, and (b) when we encounter an end tag.

The format strings are printed as literal strings with the exception of the two special characters $ and \.

The character \ forms part of an escape sequence characters depending on the following character:

\n is replaced by a newline.

\t is replaced by a tab.

\\ is replaced by a single \.

for any other X \X is left unchanged as \X.

The format strings may contain special variables denoting the name of the SGML element and the values of attributes. These are $gi and $attributeName, where attributeName is the name of an attribute defined for the element (if the input file is $notsgml; the attribute name should be upper case, because the normalisation process will upper-case the attribute names in the input). These variables will be replaced by either the element name or the value of the attribute for an SGML element which matches the rule. The lines containing format strings must start with a tab.

For example, given the rule:

.*/W
        ""
        "/$TAG\n"

the input file:

          <W TAG="A">The</W>
          <W TAG="B">cat</W>

will be converted into

          The/A
          cat/B

For each element found in the input file, the rules are tried in their order in the rule file, until one is found whose query matches the element. Once a rule has matched, no more rules are applied to this element.

Every rule file should contain a default rule which matches all elements, which will be used for elements which do not match any earlier rule. The default rule

          .*
               ""
               ""

prints nothing for elements which match it. Since all other rules are tried before the default rule, this is often as required

Finally, rules can also be specified to apply particular transformations to text bodies of elements. A rule query which ends in # matches text content. These rules are called data rules. Instead of a pair of start/end format strings, data rules contain a set of text transformations, currently just literal strings, but hopefully in future general regular expressions, of the form

        "searchString" --> "replacementString"

will also be supported.

Each transformation is applied globally to the text content before it is printed.

So for example:

.*/W/#
        "&lt;" --> "$<$"

could be useful if you were trying to produce LaTeX source from an XML file. sgmltrans is still an experimental program. Thus it is not particularly efficient and its functionality is limited in a number of ways. We intend to improve it on the basis of experience.

Prev	Home	Next
User callable programs	Up	sgmlperl