Readme for OCR19thSAC
=====================

The subdirectory images contains the raw scans (300 dpi) in the format YEAR/YEAR-PAGE.SUFFIX. Where SUFFIX is normally tif, but unfortunately sometimes also png or jpg.

The subdirectory texts contains the recognized pages in plain text, encoded using UTF-8 with Unix line endings (LF).
The file format is YEAR/YEAR-PAGE-LANG.SUFFIX, where SUFFIX is either crowd (crowd-corrected version) or ocr (raw output of OCR software).
LANG defines the language(s) identified on each page; see below what values it can take.

The texts are provided in four versions:

* unfiltered:  
  Complete corpus, without filtering, page-wise aligned with the scan images.
  Best suited for training an OCR engine (by using the crowd-corrected ground truth).  
  The LANG part of the filename can be: de, fr, en, it, ch-de, rm, or unknown.
  Multilingual pages have multiple abbreviations, separated by a plus sign (+).
  The paragraphs are *not* aligned across snapshots (ocr/crowd).

* unfiltered.tsv:
  The same as the previous, but with added image coordinates.  
  Unlike the other versions, the text is given in verticalized form (one token per line), and soft hyphens (U+00AD) are retained.

  Each token is preceded by its coordinates, ie. a quadruple <left, width, top, height>, indicating the rectangular region that each token occupied in the original image, in terms of pixels.
  Unfortunately, coordinates are missing for some tokens (indicated through "-").
  Because of dehyphenation, some tokens consist of multiple rectangles, which are separated by a pipe character ("|").
  If a dehyphenated token contains a soft hyphen, the mapping of the rectangles to the word parts is trivial; however, the soft hyphen is not always present.

  The token is separated from the coordinates through a TAB character.
  Paragraphs are separated by a blank line.

* page-filtered:  
  Corpus filtered at the page level.  
  Suited for training a post-correction system.  
  Some pages were removed from the corpus if they contained large tables or graphics, or more than one language. Also, individual paragraphs were omitted if they occurred only in one of the two snapshots (crowd/ocr) – ie. paragraphs that were deleted or inserted in the correction phase.  
  The paragraphs (separated by a newline) are aligned across snapshots (crowd/ocr) through position:
  eg. the third paragraph in file X.ocr corresponds to the third paragraph in file X.crowd.
  The LANG part of the filename is either de or fr.

* par-filtered:
  Derived from page-filtered by additional filtering on the paragraph level.  
  May be better suited for training a post-correction system than the above.  
  In this version, we filtered out paragraphs that showed an unusual change in length.
  Specifically, we removed paragraphs that increase by 10% or more (measured in characters) when moving from ocr to crowd or vice versa. This reduces the number of badly aligned paragraphs (which were mainly caused by chunks of texts being moved from one paragraph to another in the correction process).

Please note that there was no yearbook in year 1870.

Please cite our LREC paper if you make use of the resources:

@inproceedings{clematide-et-al:2016:LREC,
  author = "Clematide, Simon and Furrer, Lenz and Volk, Martin",
  title = "Crowdsourcing an {OCR} Gold Standard for a {German} and {French} Heritage Corpus",
  booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)",
  location = "Portoro{\v z}, Slovenia",
  url = "http://www.lrec-conf.org/proceedings/lrec2016/pdf/917_Paper.pdf",
  year = "2016"
}

For more information on the SAC corpus: http://textberg.ch
For licensing, see http://pub.cl.uzh.ch/purl/OCR19thSAC/license.html.

2016/12/12
Contact: simon.clematide@cl.uzh.ch
Simon Clematide
Lenz Furrer
Martin Volk

Institute of Computational Linguistics
Andreasstr. 15
CH-8050 Zürich