LRDE Document Binarization Dataset (LRDE DBD)
Datasets -> Datasets List -> Current Page
|
Contact Author
Thierry Géraud – thierry.geraud@lrde.epita.fr EPITA Research and Development Laboratory (LRDE) 14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France
Current Version
1.0
Keywords
Document binarization, Magazine, Scanned
Description
This dataset is composed of documents images extracted from the same French magazine : Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.
The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)
- 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
- 125 numerical "clean documents" created from the "original documents" where images have been removed.
- 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".
Metadata
Text Lines Localization Information has been made available by applying text line localization algorithms. The size category of the text depends on the x-height and is considered with the following rule: 0 < small <= 30 < medium <= 55 < large < +inf
- 123 large text lines localization (clean)
- 320 medium text lines localization (clean).
- 9551 small text lines localization (clean).
- 123 large text lines localization (original).
- 320 medium text lines localization (original).
- 9551 small text lines localization (original).
- 123 large text lines localization (scanned).
- 320 medium text lines localization (scanned).
- 9551 small text lines localization (scanned).
Ground Truth Data
The following ground truth data is available: Binarization and OCR ground-truths for the LRDE DBD Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.