LRDE Document Binarization Dataset (LRDE DBD)

From TC11
Revision as of 16:57, 30 May 2013 by Liwicki (talk | contribs) (Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2010-08-03 |- | {{Last updated}} |} |} =Contact Author=…")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Datasets -> Datasets List -> Current Page

Created: 2010-08-03
Last updated: 2013-005-30

Contact Author

Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire  F-94276 Le Kremlin-Bicetre  France

Current Version

1.0

Keywords

Document binarization, Magazine, Scanned

Description

This dataset is composed of documents images extracted from the same French magazine : Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

  • 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
  • 125 numerical "clean documents" created from the "original documents" where images have been removed.
  • 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Metadata

Text Lines Localization Information has been made available by applying text line localization algorithms. The size category of the text depends on the x-height and is considered with the following rule: 0 < small <= 30 < medium <= 55 < large < +inf

  • 123 large text lines localization (clean)
  • 320 medium text lines localization (clean).
  • 9551 small text lines localization (clean).
  • 123 large text lines localization (original).
  • 320 medium text lines localization (original).
  • 9551 small text lines localization (original).
  • 123 large text lines localization (scanned).
  • 320 medium text lines localization (scanned).
  • 9551 small text lines localization (scanned).


Ground Truth Data

The following ground truth data is available: Binarization and OCR ground-truths for the LRDE DBD Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.