Difference between revisions of "LRDE Document Binarization Dataset (LRDE DBD)"

From TC11
Jump to: navigation, search
(Description)
(Description)
Line 26: Line 26:
  
 
=Description=
 
=Description=
This dataset is composed of documents images extracted from the same French magazine : Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.
+
This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.
  
 
The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)
 
The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)
Line 34: Line 34:
 
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".
 
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".
  
Purpose of the three document qualities :
+
Purpose of the three document qualities:
 
* Original : evaluate the binarization quality on perfect documents mixing text and images.
 
* Original : evaluate the binarization quality on perfect documents mixing text and images.
 
* Clean : evaluate the binarization quality on perfect document with text only.
 
* Clean : evaluate the binarization quality on perfect document with text only.

Revision as of 17:13, 30 May 2013

Datasets -> Datasets List -> Current Page

Created: 2013-05-30
Last updated: 2013-005-30

Contact Author

Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire  F-94276 Le Kremlin-Bicetre  France

Current Version

1.0

Keywords

Document binarization, Magazine, Scanned

Description

This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

  • 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
  • 125 numerical "clean documents" created from the "original documents" where images have been removed.
  • 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:

  • Original : evaluate the binarization quality on perfect documents mixing text and images.
  • Clean : evaluate the binarization quality on perfect document with text only.
  • Scanned : evaluate the binarization quality on slightly degraded documents with text only.

Ground Truth Data

Related Tasks

Software

  • A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
  • A Python script is provided to launch the benchmark and compute scores.
  • C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
  • 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

References

  • G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[1]]

Submitted Files

Version 1.0


This page is editable only by TC11 Officers .