Chapter 10. The Program sgdelmarkup

In a number of our pipelines we use the program sgdelmarkup to remove various parts of the mark-up that our processes have added. For example, in the runmuc pipeline, one of the output files should to be exactly the same as the input file except for the NUMEX and TIMEX elements that have been added. In this case, all the W and PHR mark up which was needed during processing must be removed so that only relevant mark-up remains.

sgdelmarkup uses the LT QUERY query language to specify the mark-up that is to be removed. Thus

   $TTT/bin/sgdelmarkup -q ".*/PHR"

will remove any PHR tags anywhere in the document, while

   $TTT/bin/sgdelmarkup -q ".*/P/PHR"

will remove only those PHR tags which occur as daughters of P elements. To find out about other options for sgdelmarkup type $TTT/bin/sgdelmarkup -h.

By default an XML element which matches the query is replaced by its contents but the -p option can be used to construct more complex replacements. For example, the following command

   $TTT/bin/sgdelmarkup -q ".*/W" -p "{#}_{C}" 

removes W element mark-up and appends an underscore and the value of the C attribute to the contents of the element. Thus <W C='DT'>the</W> would become the_DT.

sgdelmarkup is a useful, general-purpose program when handling XML documents. There are several other useful programs which are part of the official LT XML distribution which we have also included in this release, specifically, sgmltrans, sgmlperl and sggrep. We use sgmltrans in most of our pipelines to translate XML to HTML and sgmlperl and sggrep are used in the citation/bibliography pipelines. The following appendix is taken from the official LT XML documentation for these programs (http://www.ltg.ed.ac.uk/software/xml/).