Difference between revisions of "LRDE Document Binarization Dataset (LRDE DBD)"

From TC11
Jump to: navigation, search
(Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2010-08-03 |- | {{Last updated}} |} |} =Contact Author=…")
 
 
(10 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
{|
 
{|
 
|-
 
|-
| '''Created: '''2010-08-03
+
| '''Created: '''2013-05-30
 
|-
 
|-
 
| {{Last updated}}
 
| {{Last updated}}
Line 18: Line 18:
 
EPITA Research and Development Laboratory (LRDE)
 
EPITA Research and Development Laboratory (LRDE)
 
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France
 
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France
  +
  +
=Copyright=
  +
  +
LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [http://www.nouvelobs.com/ Le Nouvel Observateur]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.
  +
  +
You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.
  +
  +
If you use this dataset, please also cite the most appropriate paper from this list:
  +
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]
  +
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]
  +
  +
This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.
   
 
=Current Version=
 
=Current Version=
Line 26: Line 38:
   
 
=Description=
 
=Description=
  +
The dataset is also available at [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD]
This dataset is composed of documents images extracted from the same French magazine : Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.
 
  +
  +
This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.
   
 
The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)
 
The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)
Line 34: Line 48:
 
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".
 
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".
   
  +
Purpose of the three document qualities:
=Metadata=
 
  +
* Original : evaluate the binarization quality on perfect documents mixing text and images.
Text Lines Localization Information has been made available by applying text line localization algorithms. The size category of the text depends on the x-height and is considered with the following rule: 0 < small <= 30 < medium <= 55 < large < +inf
 
  +
* Clean : evaluate the binarization quality on perfect document with text only.
  +
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.
   
  +
=Ground Truth Data=
* 123 large text lines localization (clean)
 
* 320 medium text lines localization (clean).
+
* [[Ground Truth for LRDE DBD text line localization]]
  +
* [[Ground Truth for LRDE DBD binarization]]
* 9551 small text lines localization (clean).
 
  +
* [[Ground Truth for LRDE DBD OCR]]
* 123 large text lines localization (original).
 
* 320 medium text lines localization (original).
 
* 9551 small text lines localization (original).
 
* 123 large text lines localization (scanned).
 
* 320 medium text lines localization (scanned).
 
* 9551 small text lines localization (scanned).
 
   
  +
=Related Tasks=
  +
* [[Document Binarization Evaluation for LRDE DBD]]
  +
* [[OCR Evaluation for LRDE DBD]]
   
  +
=Software=
  +
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
  +
* A Python script is provided to launch the benchmark and compute scores.
  +
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
  +
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.
   
  +
Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)
=Ground Truth Data=
 
  +
The following ground truth data is available: Binarization and OCR ground-truths for the LRDE DBD
 
  +
Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).
Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.
 
  +
  +
=References=
  +
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]
  +
  +
=Submitted Files=
  +
==Version 1.0==
  +
Please refer to [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD] for downloading the files from the origninal datasets site.
  +
  +
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_orig-1.0.zip Original images] (213 Mb)
  +
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_clean-1.0.zip Clean Documents images] (67 Mb)
  +
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_scanned-1.0.zip Scanned Documents] (583 Mb)
  +
  +
  +
  +
----
  +
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Latest revision as of 01:22, 4 July 2013

Datasets -> Datasets List -> Current Page

Created: 2013-05-30
Last updated: 2013-007-04

Contact Author

Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire  F-94276 Le Kremlin-Bicetre  France

Copyright

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from Le Nouvel Observateur. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

Current Version

1.0

Keywords

Document binarization, Magazine, Scanned

Description

The dataset is also available at http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD

This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

  • 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
  • 125 numerical "clean documents" created from the "original documents" where images have been removed.
  • 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:

  • Original : evaluate the binarization quality on perfect documents mixing text and images.
  • Clean : evaluate the binarization quality on perfect document with text only.
  • Scanned : evaluate the binarization quality on slightly degraded documents with text only.

Ground Truth Data

Related Tasks

Software

  • A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
  • A Python script is provided to launch the benchmark and compute scores.
  • C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
  • 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

References

  • G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[1]]

Submitted Files

Version 1.0

Please refer to http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD for downloading the files from the origninal datasets site.



This page is editable only by TC11 Officers .