OCR Evaluation for LRDE DBD

From TC11
Revision as of 16:24, 3 July 2013 by Liwicki (talk | contribs) (Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Keywords= scann…")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Datasets -> Datasets List -> Current Page

Created: 2013-05-30
Last updated: 2013-007-03

Keywords

scanned, magazine, documents, OCR


Description

OCR evaluation: Lines are extracted from the binarization outputs and OCR (Tesseract) is run in order to compare to OCR ground-truth. It is performed from binarization of “clean”, “scanned” and “original” documents.

Purpose of the three document qualities :

  • Original : evaluate the binarization quality on perfect documents mixing text and images.
  • Clean : evaluate the binarization quality on perfect document with text only.
  • Scanned : evaluate the binarization quality on slightly degraded documents with text only.

Lines for OCR evaluation are also grouped by size: small, medium and large. (0 < small <= 30 < medium <= 55 < large < +inf). It shows how robust is a binarization algorithm to objects with different sizes in a single document.

Evaluation Protocol

Tools are provided to read and process all the data.

A setup script is provided to download and configure the benchmarking environment.

A Python script is provided to launch the benchmark and compute scores.

C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.

6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

A setup script is available to download and setup the benchmark system. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).


Related Dataset

Related Ground Truth Data

Submitted Files

Version 1.0


This page is editable only by TC11 Officers .