TC11 - User contributions [en]

Datasets

2016-05-16T09:24:01Z

Liwicki: revision for new site

{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

=Important Notice=
[[Image:ThisWayToTheDatasets.png|250px|right|link=http://datasets.iapr-tc11.org| Datasets List]]

The datasets are maintained at http://datasets.iapr-tc11.org The old dataset repository will remain accessible during [[Datasets List | here]]

=Overview – Message from TC-11=
[[Image:ThisWayToTheDatasets_OldRepo.png|200px|right|link=http://datasets.iapr-tc11.org| Datasets List]]

It is extremely important for the Document Image Analysis and Recognition community to be able to cross check and reproduce results described in published papers in the field. In order to achieve this, any datasets used as the basis for publications should be publicly available, as is the norm in many other disciplines.

Authors are actively encouraged to submit the datasets they used to train and / or evaluate their algorithms to the TC-11 in order for them to be published on the TC-11 Web site.

This initiative is not restricted to datasets. At TC-11 we are interested in archiving online any piece of data (ground-truth data, software, etc) which would allow to easily reproduce results, set new targets, foster healthy competition, encourage collaboration and generally advance the DIAR field as a whole.

A wealth of datasets and corresponding ground truth data are already available through the TC-11 [http://datasets.iapr-tc11.org Web portal].

If you wish to contribute, please read below about the procedure to submit material to the TC-11 web-portal. The dataset curators will be notified as soon as the dataset is uploaded.
For any comments or suggestions, please contact Marcus Liwicki, the dataset curator at marcus.liwicki (at) unifr.ch

=Submission Protocol=
In order to submit a protocol please create an account on the TC11 datasets portal (http://datasets.iapr-tc11.org) and follow the online submission instructions. For any problems, please contact the dataset curators.

=Copyright Note=
TC-11 provides dataset hosting services as a benefit to the international research community. If it is determined that copyrighted material is improperly included in a dataset submitted to inclusion on the TC-11 website, we will immediately remove the offending material upon notification of the copyright holder.

By submitting a dataset for inclusion to the TC-11 Web site, the author certifies that he/she has the right to publish the dataset and any associated data in the public domain and the act of doing so does not violate intellectual property rights or copyrights of some third party.

The TC-11 will provide a service through which the submitted dataset and any associated data will be made public to the Document Analysis community worldwide. In case any legal dispute arises in the future in relation to the publishing of this dataset and associated data in the public domain, the author will hold TC-11 free from any wrongdoing and accept responsibility for the publication of these data.

By submitting a dataset and associated data to the TC-11, you explicitly accept that any third party can independently submit additional information that relates to the original dataset (e.g. additional ground-truth data, software, etc).

We strongly encourage the authors, where they own the copyrights of the submitted information, to consider offering it to the community under a [http://creativecommons.org/choose/ creative commons license]. See [http://wiki.creativecommons.org/Before_Licensing/ this link] for guidelines about how to choose a proper Creative Commons license.

=Useful Definitions=
'''Dataset''': A collection of data along with metadata information, as required to use these data.

'''Metadata''': Metadata is information specific to a particular dataset. Metadata are usually tightly structured within the dataset itself (e.g. information encoded within the filenames of submitted images). Metadata can only be submitted at the time of submission of the dataset.

'''Ground Truth Specification''': The definition of the required information that accurately describes a particular aspect of the data at a high level where agreement between different human observers can be established, as well as the definition of an appropriate structure (format) for storing this information.

'''Ground Truth Data''': A set of data conforming to a particular ground truth specification and relating to a specific dataset. Ground Truth Data can be submitted at any time, while different Ground Truth Data (corresponding to different aspects of the data) can be associated with the same dataset.

'''Task''': A well defined process to evaluate algorithms in the context of a specific scientific problem. A task would typically provide a specific evaluation protocol, and link to specific resources as required (a dataset, and usually related ground truth data). Tasks should correspond to open challenges in the field. If you undertake any of the tasks defined and you have published results or code available, we would really like to know!

'''Resources''': Any other type of related resources that are not specifically covered by the above definitions. Examples would include software to browse and visualise a dataset, software to create ground truth data, algorithms to do performance evaluation, codecs, reports, publications, etc.

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

DAS-Discussion:Systems that improve with use (2016)

2016-05-16T09:14:11Z

Liwicki: Report by Ido

Back to [[DAS-Discussion:Index]]
{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

== DAS Working Subgroup Meeting: Systems that improve with use ==
Authors:
* Ido Kissos, Tel Aviv Israel - Improving printed Arabic OCR with ML methods.
Participants:
* Marc-Peter Schambach, Siemens, Germany: Handwriting recognition, address recognition.
* Abdel Belaid, University of Lorraine, France. Administrative docs, info extraction, table detection, entity recognition, OCR and evaluation.
* Nicolas Ragaut, University of Tours, France. Medical image analysis, doc image analysis. Handwriting and OCR.
* Brian Davis, Utah, USA. Computer assited transcription (geneological docs).
* Anh Le, TUAT, Japan - Handwiritng recognition.

=== Problem Definition ===
* Data centric systems break with usage environment change
* Goal: Handling changes over time in data centric systems
** Data changes
** User need or preferences
** Training data is not representative
** Problem domain adaptation
* Method: Exploit usage-data to improve systems
** Explicit: explicit data labeling in workflow, dedicated configuration modules
** Implicit: behavior change, continuous evaluation, negative feedbacks

=== Challenges ===
# Algorithmic – online training with the new data
# Architecture – how to close the loop
# Labeling and data correctness – interpret all kinds of user behavior
# Privacy – getting feedback is non-trivial
# Psychologic – we do not want to look at systems that get worse with new data
# Economic – it may have no business case

=== Solutions - Things to keep in mind ===
* Prior analysis of possible drifts
* Implication of drifts: know the robustness of your model to new data, the gains of more training data
* “One Click” training
* Online evaluation
** Measure performance over time
** Update ground truth
** Boosting
** Independent evaluation model - having an adversary classifier to online evaluate performances?

=== Full Version ===
Until recently data changes would occur in intervals longer than the system’s life-span, as well as they did not mainly rely on vast amount and diverse data. Possible data or user need drifts would be handled by new versions of a software, and that would also be the software companies business model - sell new versions of the same product. Nowadays many software become more data-centric, relying on classifier modules as their core ability. In the “data-era” these changes happen fast, features for classification are sometimes hidden not, hence users do not realize their data drifts until the systems looses its liability. Therefore - the systems should adapt!

Why don’t we develop such systems - because our labs don’t update with real data. Drifts happen within large times. You’re not looking for it - it is expensive, time consuming. Maybe there are psychological issues with this, we don’t want to see this going worst. In academy we are publication-oriented, not production-oriented.

Do we need architecturally to support automatic retrain? If systems get better by adding new data, we must support it.
Maybe it is difficult to a full automatic system, just semi automatic. Users will not be willing to give a full new labeling as part of their daily use, but one can expect to have partial or implicit labeling. Systems have to learn from its errors so it will be able to correct it in the future, and modelling errors can be a difficult task. A system has to be self confident about its claim - a one class classifier may do it better to predict its false positive errors.

What about online evaluation - evaluating after benchmark end delivery? Building an self-online adversary system that evaluates the main system without ground truth, after delivery time. Sometimes the evaluation can be measured explicitly/implicitly by users activity. The feedback on the data should not necessarily come from the user but the processing chain.

Business problem: if I build a system that adapts to all changes there will be no incentive for the next project. People are paying only once, the economics are against the system. Then you move to subscription models, not purchase.

Big data: from the day of delivery it is not adapted to your problem. No real “ground truth”.

Privacy: we are not allowed to get back the data from the user. governments are strict about it, and it is getting worst. On the other hand big corporations have all data, or a big enough representation of it. Trying to anonymize methods data to allow research on top of them, or ease the regulation to make the data market more fair.

Robustness of the training: what will be the impact of the improvement in the future. Try to make up a score to tell how much more training should we need for how much improvement. What are the gains. Maybe they should be standardized.

DAS-Discussion:Systems that improve with use (2016)

2016-05-16T09:13:58Z

Liwicki: Report by Ido

Back to [[DAS-Discussion:Index]]
{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

== DAS Working Subgroup Meeting: Systems that improve with use ==
Authors:
* Ido Kissos, Tel Aviv Israel - Improving printed Arabic OCR with ML methods.
Participants:
* Marc-Peter Schambach, Siemens, Germany: Handwriting recognition, address recognition.
* Abdel Belaid, University of Lorraine, France. Administrative docs, info extraction, table detection, entity recognition, OCR and evaluation.
* Nicolas Ragaut, University of Tours, France. Medical image analysis, doc image analysis. Handwriting and OCR.
* Brian Davis, Utah, USA. Computer assited transcription (geneological docs).
* Anh Le, TUAT, Japan - Handwiritng recognition.

=== Problem Definition ===
* Data centric systems break with usage environment change
* Goal: Handling changes over time in data centric systems
** Data changes
** User need or preferences
** Training data is not representative
** Problem domain adaptation
* Method: Exploit usage-data to improve systems
** Explicit: explicit data labeling in workflow, dedicated configuration modules
** Implicit: behavior change, continuous evaluation, negative feedbacks

=== Challenges ===
# Algorithmic – online training with the new data
# Architecture – how to close the loop
# Labeling and data correctness – interpret all kinds of user behavior
# Privacy – getting feedback is non-trivial
# Psychologic – we do not want to look at systems that get worse with new data
# Economic – it may have no business case

=== Solutions - Things to keep in mind ===
* Prior analysis of possible drifts
* Implication of drifts: know the robustness of your model to new data, the gains of more training data
* “One Click” training
* Online evaluation
** Measure performance over time
** Update ground truth
** Boosting
** Independent evaluation model - having an adversary classifier to online evaluate performances?

=== Full Version ===
Until recently data changes would occur in intervals longer than the system’s life-span, as well as they did not mainly rely on vast amount and diverse data. Possible data or user need drifts would be handled by new versions of a software, and that would also be the software companies business model - sell new versions of the same product. Nowadays many software become more data-centric, relying on classifier modules as their core ability. In the “data-era” these changes happen fast, features for classification are sometimes hidden not, hence users do not realize their data drifts until the systems looses its liability. Therefore - the systems should adapt!

Why don’t we develop such systems - because our labs don’t update with real data. Drifts happen within large times. You’re not looking for it - it is expensive, time consuming. Maybe there are psychological issues with this, we don’t want to see this going worst. In academy we are publication-oriented, not production-oriented.

Do we need architecturally to support automatic retrain? If systems get better by adding new data, we must support it.
Maybe it is difficult to a full automatic system, just semi automatic. Users will not be willing to give a full new labeling as part of their daily use, but one can expect to have partial or implicit labeling. Systems have to learn from its errors so it will be able to correct it in the future, and modelling errors can be a difficult task. A system has to be self confident about its claim - a one class classifier may do it better to predict its false positive errors.

What about online evaluation - evaluating after benchmark end delivery? Building an self-online adversary system that evaluates the main system without ground truth, after delivery time. Sometimes the evaluation can be measured explicitly/implicitly by users activity. The feedback on the data should not necessarily come from the user but the processing chain.

Business problem: if i build a system that adapts to all changes there will be no incentive for the next project. People are paying only once, the economics are against the system. Then you move to subscription models, not purchase.

Big data: from the day of delivery it is not adapted to your problem. No real “ground truth”.

Privacy: we are not allowed to get back the data from the user. governments are strict about it, and it is getting worst. On the other hand big corporations have all data, or a big enough representation of it. Trying to anonymize methods data to allow research on top of them, or ease the regulation to make the data market more fair.

Robustness of the training: what will be the impact of the improvement in the future. Try to make up a score to tell how much more training should we need for how much improvement. What are the gains. Maybe they should be standardized.

DAS-Discussion:Index

2016-05-16T07:19:05Z

Liwicki:

{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}
|}
== DAS Discussion Groups ==

It is a long-standing tradition to have discussion groups at the DAS workshops.
These are small working groups discussing about topics of special interest to attendees.
Till 2008, Henry Baird coordinated the discussions. In 2010, Marcus Liwicki took over the organization.

To make the highlights of the discussions available to the community, we make the summaries of the discussions available on this website. They mainly contain lists of (a) new sound & reliable methods, and (b) urgent open problems.

== Reports of previous discussions ==

=== 2010 ===
* [[DAS-Discussion:Information Extraction]] by Partha Pratim Roy and Prateek Sarkar
* [[DAS-Discussion:Camera-Based DIA]] by Masakazu Iwamura and Elisa H. Barney Smith
* [[DAS-Discussion:Datasets and Benchmarks]] by Dimos Karatzas and Bart Lamiroy
=== 2012 ===
* [[DAS-Discussion:Datasets, Benchmarks, Competition, and Continuity of Research]] by Bart Lamiroy
=== 2014 ===
* [[DAS-Discussion:DAS 2024]] by Faisal Shafait
* [[DAS-Discussion:Information Extraction (2014)]] by Nibal Nayef
=== 2016 ===
* [[DAS-Discussion:Systems that improve with use (2016)]] by Ido Kissos

DAS-Discussion:Information Extraction (2014)

2015-01-02T13:54:26Z

Liwicki: /* DAS Working Subgroup Meeting: Information Extraction */

Back to [[DAS-Discussion:Index]]
{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

== DAS Working Subgroup Meeting: Information Extraction ==
Authors:
* Nibal Nayef
Participants:
* Yoshinori AKAO (Japanese police)
* Saddok KEBAIRI (Itesoft)
* Manaba OHTA
* Xin TAO
* Ronaldo MESSINA (a2ia)
* Nibal NAYEF (France)
* Bao

=== Introduction ===
We have totally different views of information extraction
Different tasks:
* Entity spotting (numbers, words, ….)
* Graphics spotting (logos, symbols, tables etc.)
* Semantics after text recognition
* Logical structure

=== What is a document ??!! ===
We have many types of documents [and increasing]:
* Digitally born documents
* Camera / mobile captured
* Scanned
..

To extract any kind of information from any type of document, we need a sort of “prerequisite” module, so that IE modules can work on all document types

=== Problems of IE ===
* What kind of semantic information should we extract?: Technical terms, ….
* Define the logical structure of a document
* Same information in different representations: Same name in different languages
* What are the ground truth data, size of training data?: Use human voting to build GT
* Ultimate goal: Automatic and complete understanding of document contents.
* Application: Enrich Data Mining

=== Approaches ===
CRF, NLP, and all methods for word/graphic spotting

=== Future Directions ===
Combine methods from different fields:
* Image processing
* Natural language processing

Take into account that documents are drastically changing

DAS-Discussion:Information Extraction (2014)

2015-01-02T13:54:02Z

Liwicki: Created page with "Back to DAS-Discussion:Index {| style="width: 100%" |- | align="right" | {| |- | {{Last updated}} |} |} == DAS Working Subgroup Meeting: Information Extraction == Autho..."

Back to [[DAS-Discussion:Index]]
{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

== DAS Working Subgroup Meeting: Information Extraction ==
Authors:
* Nibal Nayef
Participants:
* Yoshinori AKAO (Japanese police)
* Saddok KEBAIRI (Itesoft)
* Manaba OHTA
* Xin TAO
* Ronaldo MESSINA (a2ia)
* Nibal NAYEF (me !)
* Bao

=== Introduction ===
We have totally different views of information extraction
Different tasks:
* Entity spotting (numbers, words, ….)
* Graphics spotting (logos, symbols, tables etc.)
* Semantics after text recognition
* Logical structure

=== What is a document ??!! ===
We have many types of documents [and increasing]:
* Digitally born documents
* Camera / mobile captured
* Scanned
..

To extract any kind of information from any type of document, we need a sort of “prerequisite” module, so that IE modules can work on all document types

=== Problems of IE ===
* What kind of semantic information should we extract?: Technical terms, ….
* Define the logical structure of a document
* Same information in different representations: Same name in different languages
* What are the ground truth data, size of training data?: Use human voting to build GT
* Ultimate goal: Automatic and complete understanding of document contents.
* Application: Enrich Data Mining

=== Approaches ===
CRF, NLP, and all methods for word/graphic spotting

=== Future Directions ===
Combine methods from different fields:
* Image processing
* Natural language processing

Take into account that documents are drastically changing

DAS-Discussion:Index

2015-01-02T13:49:17Z

Liwicki: /* 2014 */

DAS-Discussion:DAS 2024

2015-01-02T13:48:20Z

Liwicki:

Back to [[DAS-Discussion:Index]]
{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

== DAS Working Subgroup Meeting: DAS 2024 ==
Authors:
* Faisal Shafait

=== Progress in the last 10 years (DAS 2004) ===
* Open source OCR – gOCR
* Should we start using cameras?
* Handwriting recognition – only in the hand of experts
* Keynote talks
* Thomas M. Breuel The future of document imaging in the era of electronic documents [http://www.dsi.unifi.it/DAS04/Breuel-DAS04.pdf]
* Franco Lotti Image quality issues in digitization projects of historical documents [http://www.dsi.unifi.it/DAS04/Lotti-DAS04.pdf]

=== Progress in the next 10 years ===
* Paper will be used but less and less over time
* Reusable ink
* Scanners will disappear
* Mobile based document recognition will mature to a state that it matches the performance of scanned documents under (almost) all capture conditions
* Majority of “pre-processing” problems will be solved for historical document processing
* Will there be a DAS 2024? Yes :)
* What will be the topics we will be addressing?

=== DAS 2014 keynote talks ===
* An Inside Look into ABBYY OCR Technology
* From Academia to Industry, the knowledge transfert in Document Analysis
* Document Evolution drives Document Analysis

=== The Next Big Thing? ===
* Understanding “Born-Digital” documents
* Something “Magical” for knowledge extraction from documents – similar to LSTM for handwriting recognition

DAS-Discussion:Index

2015-01-02T13:47:52Z

Liwicki: /* 2014 */

DAS-Discussion:DAS 2024

2015-01-02T13:44:49Z

Liwicki: Created page with "Back to DAS-Discussion:Index {| style="width: 100%" |- | align="right" | {| |- | {{Last updated}} |} |} == DAS Working Subgroup Meeting: Datasets and Benchmarks == Auth..."

Back to [[DAS-Discussion:Index]]
{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

== DAS Working Subgroup Meeting: Datasets and Benchmarks ==
Authors:
* Faisal Shafait

=== Progress in the last 10 years (DAS 2004) ===
* Open source OCR – gOCR
* Should we start using cameras?
* Handwriting recognition – only in the hand of experts
* Keynote talks
* Thomas M. Breuel The future of document imaging in the era of electronic documents [http://www.dsi.unifi.it/DAS04/Breuel-DAS04.pdf]
* Franco Lotti Image quality issues in digitization projects of historical documents [http://www.dsi.unifi.it/DAS04/Lotti-DAS04.pdf]

=== Progress in the next 10 years ===
* Paper will be used but less and less over time
* Reusable ink
* Scanners will disappear
* Mobile based document recognition will mature to a state that it matches the performance of scanned documents under (almost) all capture conditions
* Majority of “pre-processing” problems will be solved for historical document processing
* Will there be a DAS 2024? Yes :)
* What will be the topics we will be addressing?

=== DAS 2014 keynote talks ===
* An Inside Look into ABBYY OCR Technology
* From Academia to Industry, the knowledge transfert in Document Analysis
* Document Evolution drives Document Analysis

=== The Next Big Thing? ===
* Understanding “Born-Digital” documents
* Something “Magical” for knowledge extraction from documents – similar to LSTM for handwriting recognition

DAS-Discussion:Index

2015-01-02T13:38:00Z

Liwicki: /* 2014 */

DAS-Discussion:Index

2015-01-02T13:37:22Z

Liwicki:

DAS-Discussion:Datasets, Benchmarks, Competition, and Continuity of Research

2015-01-02T13:33:08Z

Liwicki:

Back to [[DAS-Discussion:Index]]
{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

== DAS Working Subgroup Meeting: Datasets and Benchmarks ==
Authors:
* Bart Lamiroy (Secretary) – Université de Lorraine

Further Participants:
* Elisa Barney Smith – Boise State University
* Abdel Belaïd – Université de Lorraine
* John Fletcher – Canon
* Liangcai Gao – Peking University
* Albert Gordo – CVC Barcelona
* Masakazu Iwamura – Osaka Prefecture University
* Dan Lopresti – Lehigh University
* Tomohsa Matsushita – Tokyo University of Agriculture and Technology
* Jean-‐Yves Ramel – Université de Tours
* Marc-‐Peter Schambach – Siemens
* Ray Smith (Moderator) – Google Inc.

=== Context ===

The goal of this discussion group is to address the availability, use and dissemination of benchmarks, datasets and ground truth in order to promote subjective and reproducible assessment of document analysis methods, collaboration and exchange of research results in the document analysis domain. The main idea is that “what you measure is what improves”, and that it is difficult to obtain reliable measures expressing the global progress of the state‐of‐the‐art.

=== Topic Discussion History ===
As a brief reminder of the evolution of this topic as discussed during other DAS editions, we refer the interested reader to the TC‐11 website. In 2010 the main focus of discussion essentially related to making datasets other reference material available to the community and how to provide centralized access to it, how to credit and value contributors and how to maintain a level of control (data curation, availability over time, …) that would insure that the data and algorithms remain usable and useful over an as long as possible period of time. The reported discussions were essentially concerned with feasibility of these concepts, rather than impact, and focused on the TC‐11 initiative of data collection and the DAE platform (http://dae.cse.lehigh.edu).

=== Discussion Topics ===

During the DAS 1012 edition, the following potential discussion topics were identified after a short brainstorming session, ranked by order of (subjectively) perceived importance:
1. When is a problem stated? Should CFPs be more specific to what topics to address and how they should (could) be measured? How does this relate to hosting competitions? Interaction with whole or end‐to‐end evaluation systems.
2. What are the fundamental reasons to the perceived difficulties to sharing data sets? (public vs. copyright vs. privacy)
3. Would it be a good idea to more formally integrate the availability of data sets and reports of benchmarking into the acceptance criteria for publications.
4. Is there a risk of data sets directing research? Is this good or bad?
5. Open binaries/open source?

=== When is a Problem Stated? ===
This question is considered by the discussion panel members as an essential preliminary step to D. Lopresti and G. Nagy's paper “When is a Problem Solved” in ICDAR 2011, and relates to the initially identified issue concerning the difficulty of measuring the overall contributions of individual research results to the improvement of the global state‐of‐the‐art. Stating a problem is related to measuring some level of achievement, and therefore directly correlated to expressing ground truth. One may conjecture that a problem is stated when there is consensus on the ground truth on the one hand and there is a data set collection of statistically proven significance. Measurement of advancement toward solving a stated problem would then consist of:
* track record of results over time,
* defined best practices by the community,

This means that the evolution of the best practices (and the track record of the results) could give a more precise view of the improvement of the agreed upon state‐the‐art. This also means that there is a need of commenting and annotating the reference data sets by the community and also that there may be a need to evaluate individual research results within the scope of broader criteria (e.g. contribution in end‐to‐end application evaluation) The general consensus of the discussion panel is that there might be an interest in experimenting a more formal approach to managing tracks in conferences and acceptance criteria to particular events or publications, by clearly stating (at the time of the CFP) the benchmark to which contributions need to measured. This could consist of:
* specific problem statements,
* hosting competitions in direct relation with the track or conference and creating strong incentives for all submissions to compete,
* ensuring continuity of both data sets, ground truth, and algorithm availability year after year,
* requiring that reviewers have reasonable access to the data sets and have the means of checking the reported results.

However, it is extremely important to stress that this should never be the sole criteria for acceptance and publication of papers since there is a significant risk of limiting innovating non‐mainstream approaches and the emergence of investigations into new (previously not considered, or considered uninteresting) problems. This is discussed in one of items developed below.

=== Difficulties in Sharing Data Sets ===
On this issue, the discussions have rather identified a number of open issues, without necessary finding ways of solving them. The issues are:
* Making data sets available is not a technical issue but a cultural one, not only related to legal issues, but also to the need of acknowledgement by peers and ROI with respect to the effort/cost of creating data sets.
* Although data sets may be of significant interest, and not be limited by intellectual property or copyright, they may be restricted for publication because of privacy issues. In that case, anonymization processes may not necessarily be possible or appropriate, and are always very costly. Approaches of creating synthetic data sets may yield solutions in some cases.
* DMCA protection is probably the most convenient framework for academia to reduce the risk of distributing data sets of which the origin cannot be totally guaranteed copyright infringement free.
* On the other hand, companies are often reluctant to release data, either because of overly concerned legal departments and zero‐risk policies, or because of the significant competitive advantage particular data sets may yield. With respect to the issues mentioned to open and verifiable access to reported results it may be possible to conceive non--‐disclosure bound access to datasets, while still giving reasonable possibilities to verify reported results (e.g. the open access to provenance data – and not the original data – in systems like the DAE platform)

=== Changing Acceptance Criteria ===
Imposing to confront results to a previously agreed upon benchmark prior to acceptance for publication may prove to be a double edged sword. The discussions have tried to identify pros and cons.
* As already hinted previously, this would require a shift in the way some events are publicized and organized, since the CFP would necessarily include all the required information (evaluation procedures, data sets, benchmark infrastructure, ...)
* Imposing stringent benchmarking criteria may not prove a good idea for smaller events, confronted to basic economics and affect the number of submissions and the acceptance rate too strongly.
* On the other hand, some mature topics should very strongly impose the use of standard benchmarks.
* This would also require a shift in the review/acceptance process: ◦ In order to preserve the possibility to publish innovative non--‐mainstream new research evaluation should integrate some level of weighting setting a cursor between “correctly benchmarked and conforming to criteria” and “out of scope with respect to criteria, but potentially groundbreaking new topic”, for instance
* The possibility to have conditional acceptance and a response phase after review.
* Extra load on reviewers, and requirements to be able to correctly verify claimed results.
* There may be an extra load of reviewers
* It would be interesting to get the broader community's feeling about this.

=== Data Sets Direct Research ===
Before data sets become commonly accepted and agreed upon bases for benchmarking, they should undergo some community approval. This raises quite some potentially controversial issues:
* Data‐driven research evaluation may have a very good impact if the data is good, buy may be harmful is data is bad.
* Data sets progressively get out of date as knowledge evolves. (one might consider a problem underpinning a data set solved, when the set is considered obsolete)
* Special interest groups can try to dominate or influence decisions.
* Some datasets may not be considered of interest in specific cases of 3rd party supported research.

DAS-Discussion:Datasets, Benchmarks, Competition, and Continuity of Research

2015-01-02T13:25:01Z

Liwicki:

== DAS Working Subgroup Meeting: Datasets and Benchmarks ==
Authors:
* Bart Lamiroy (Secretary) – Université de Lorraine

Further Participants:
* Elisa Barney Smith – Boise State University
* Abdel Belaïd – Université de Lorraine
* John Fletcher – Canon
* Liangcai Gao – Peking University
* Albert Gordo – CVC Barcelona
* Masakazu Iwamura – Osaka Prefecture University
* Dan Lopresti – Lehigh University
* Tomohsa Matsushita – Tokyo University of Agriculture and Technology
* Jean-‐Yves Ramel – Université de Tours
* Marc-‐Peter Schambach – Siemens
* Ray Smith (Moderator) – Google Inc.

=== Context ===

The goal of this discussion group is to address the availability, use and dissemination of benchmarks, datasets and ground truth in order to promote subjective and reproducible assessment of document analysis methods, collaboration and exchange of research results in the document analysis domain. The main idea is that “what you measure is what improves”, and that it is difficult to obtain reliable measures expressing the global progress of the state‐of‐the‐art.

=== Topic Discussion History ===
As a brief reminder of the evolution of this topic as discussed during other DAS editions, we refer the interested reader to the TC‐11 website. In 2010 the main focus of discussion essentially related to making datasets other reference material available to the community and how to provide centralized access to it, how to credit and value contributors and how to maintain a level of control (data curation, availability over time, …) that would insure that the data and algorithms remain usable and useful over an as long as possible period of time. The reported discussions were essentially concerned with feasibility of these concepts, rather than impact, and focused on the TC‐11 initiative of data collection and the DAE platform (http://dae.cse.lehigh.edu).

=== Discussion Topics ===

During the DAS 1012 edition, the following potential discussion topics were identified after a short brainstorming session, ranked by order of (subjectively) perceived importance:
1. When is a problem stated? Should CFPs be more specific to what topics to address and how they should (could) be measured? How does this relate to hosting competitions? Interaction with whole or end‐to‐end evaluation systems.
2. What are the fundamental reasons to the perceived difficulties to sharing data sets? (public vs. copyright vs. privacy)
3. Would it be a good idea to more formally integrate the availability of data sets and reports of benchmarking into the acceptance criteria for publications.
4. Is there a risk of data sets directing research? Is this good or bad?
5. Open binaries/open source?

=== When is a Problem Stated? ===
This question is considered by the discussion panel members as an essential preliminary step to D. Lopresti and G. Nagy's paper “When is a Problem Solved” in ICDAR 2011, and relates to the initially identified issue concerning the difficulty of measuring the overall contributions of individual research results to the improvement of the global state‐of‐the‐art. Stating a problem is related to measuring some level of achievement, and therefore directly correlated to expressing ground truth. One may conjecture that a problem is stated when there is consensus on the ground truth on the one hand and there is a data set collection of statistically proven significance. Measurement of advancement toward solving a stated problem would then consist of:
* track record of results over time,
* defined best practices by the community,

This means that the evolution of the best practices (and the track record of the results) could give a more precise view of the improvement of the agreed upon state‐the‐art. This also means that there is a need of commenting and annotating the reference data sets by the community and also that there may be a need to evaluate individual research results within the scope of broader criteria (e.g. contribution in end‐to‐end application evaluation) The general consensus of the discussion panel is that there might be an interest in experimenting a more formal approach to managing tracks in conferences and acceptance criteria to particular events or publications, by clearly stating (at the time of the CFP) the benchmark to which contributions need to measured. This could consist of:
* specific problem statements,
* hosting competitions in direct relation with the track or conference and creating strong incentives for all submissions to compete,
* ensuring continuity of both data sets, ground truth, and algorithm availability year after year,
* requiring that reviewers have reasonable access to the data sets and have the means of checking the reported results.

However, it is extremely important to stress that this should never be the sole criteria for acceptance and publication of papers since there is a significant risk of limiting innovating non‐mainstream approaches and the emergence of investigations into new (previously not considered, or considered uninteresting) problems. This is discussed in one of items developed below.

DAS-Discussion:Index

2015-01-02T13:10:16Z

Liwicki: /* 2012 */

DAS-Discussion:Datasets, Benchmarks, Competition, and Continuity of Research

2015-01-02T13:09:06Z

Liwicki: Created page with "Participants"

Participants

DAS-Discussion:Index

2015-01-02T13:08:37Z

Liwicki: /* 2012 */

Datasets List

2013-08-24T21:01:45Z

Liwicki: /* Software and Tools */

[[Datasets]] -> [[Datasets List]]

{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

See the datasets [[Datasets per Journal / Conference|sorted according to the Journal / Conference]] they first appeared in.

= Complex Text Containers =
== Scene Text ==
* [[MSRA Text Detection 500 Database (MSRA-TD500)]]
* [[The Street View Text Dataset]]
* [[The Street View House Numbers (SVHN) Dataset]]
* [[NEOCR: Natural Environment OCR Dataset]]
* [[KAIST Scene Text Database]]
* [[ICDAR 2003 Robust Reading Competitions]]
* [[ICDAR 2005 Robust Reading Competitions]]



= Machine-printed Documents =

* [[Table Ground Truth for the UW3 and UNLV datasets]]
* [[The DocLab Dataset for Evaluating Table Interpretation Methods]]
* [[http://www.digitisation.eu/data/ the IMPACT data base]] The dataset contains more than half a million representative text-based images compiled by a number of major European libraries. Covering texts from as early as 1500, and containing material from newspapers, books, pamphlets and typewritten notes, the dataset is an invaluable resource for future research into imaging technology, OCR and language enrichment.
* [http://dataset.primaresearch.org/ PRImA Layout Analysis Dataset]
* [http://www.dfki.uni-kl.de/~shafait/downloads.html DFKI Dewarping Contest Dataset (CBDAR 2007)] The dataset, that was used in the CBDAR 2007 Dewarping Contest, contains 102 camera captured documents with their corresponding ASCII text ground-truth. Additionally, text-line level ground-truth was also prepared to benchmark curled text-line segmentation algorithms. Part of the dataset (76 out of 102 pages) were also scanned with a flat-bed scanner to create a ground-truth image for image based evaluation of page dewarping algorithms.
* [http://diuf.unifr.ch/diva/APTI/ APTI: Arabic Printed Text Image Database]
* [[LRDE Document Binarization Dataset (LRDE DBD)]] This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010. The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution).
* [http://ciir.cs.umass.edu/downloads/ocr-evaluation/ RETAS OCR Evaluation Dataset] The RETAS dataset (used in the paper by Yalniz and Manmatha, ICDAR'11) is created to evaluate the optical character recognition (OCR) accuracy of real scanned books. The dataset contains real OCR outputs for 160 scanned books (100 English, 20 French, 20 German, 20 Spanish) downloaded from the Internet Archive website. The corresponding ground truth text for each scanned book is obtained from the Project Gutenberg database. The OCR output of each scanned book is aligned with its ground truth at the word and character level and the alignment output is provided along with estimated OCR accuracies. The dataset is provided for research purposes.

= Graphical Documents =

* [[Chem-Infty Dataset: A ground-truthed dataset of Chemical Structure Images]]
* [[Braille Dataset - Shiraz University]]
* [http://www.eurecom.fr/~huet/work.html TradeMarks Image Database] - By way of Benoit Huet, 999 trademark and logo images

= Mixed Content Documents =
* [http://www.umiacs.umd.edu/~zhugy/Tobacco800.html Tobacco800 Document Image Database] - composed of 1290 document images collected and scanned using a wide variety of equipment over time.

= Handwritten Documents =
== On-line and Off-line ==

* [[ICDAR 2009 Signature Verification Competition (SigComp2009)]]

* [[ICFHR 2010 Signature Verification Competition (4NSigComp2010)]]

* [[ICDAR 2011 Signature Verification Competition (SigComp2011)]]

* [[ICFHR 2012 Signature Verification Competition (4NSigComp2012)]]

* [http://www.nlpr.ia.ac.cn/databases/handwriting/Home.html CASIA Online and Offline Chinese Handwriting Databases] - The Chinese handwriting datasets were produced by 1,020 writers using Anoto pen on papers, such that both online and offline data were obtained. Both the online and the offline dataset consists of three subsets for isolated characters (DB1.0–1.2, about 3.9 million samples of 7,356 classes) and three for handwritten texts (DB2.0–2.2, about 5,090 pages and 1.35 million characters). The datasets are free for academic research for handwritten document segmentation and retrieval, character and text line recognition, writer adaptation and identification.

* [[Persian Heritage Image Binarization Dataset (PHIBD 2012)]] This dataset contains 15 historical and old manuscript images collected from the historical records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation including bleed-through, faded ink, and blur. The dataset is the first in a series to provide document images and their ground truth as a contribution to Document image analysis and recognition (DAIR) community. It is planned to provide more data and ground-truth information in the fture.

== On-line ==
* [[CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions]]

* [[Devanagari Character Dataset]]

* [[Harbin Institute of Technology Opening Recognition Corpus for Chinese Characters (HIT-OR3C)]]

* [[IAM Online Document Database (IAMonDo-database)]]

* [http://www.iam.unibe.ch/fki/databases/iam-on-line-handwriting-database IAM On-Line Handwriting Database]

* [http://hwr.nici.kun.nl/unipen/ UNIPEN database] (Click on link 'CDROMs')

* [http://www.tuat.ac.jp/~nakagawa/database/ Nakagawa Lab Online Handwriting Database]


* [http://www.ai.rug.nl/~lambert/unipen/icdar-03-competition/ The Informal Competition of Recognizing On-line Words (ICROW)] by the Unipen Foundation

== Off-line ==
* [http://www.rimes-database.fr/wiki/doku.php The Rimes Database] comprises 12,723 handwritten pages corresponding to 5605 mails of two to three pages. It was collected by asking volunteers to write a letter given one of nine predefined scenarios related to business/customer relations. The dataset has been used in numerous competitions in ICDAR and ICFHR. It is available for research purposes only, through the Web site of the authors.

* [[IBN SINA: A database for research on processing and understanding of Arabic manuscripts images]]

* [http://www.cedar.buffalo.edu/Databases/CDROM1/ CEDAR Off-line Handwriting CDROM1]

* [[CVL-Database]] - An Off-line Database for Writer Retrieval, Writer Identification and Word Spotting

* [http://www.iam.unibe.ch/fki/databases/iam-handwriting-database IAM Database] - A full English sentence database for off-line handwriting recognition.

* [http://prhlt.iti.upv.es/page/projects/multimodal/idoc/germana The GERMANA Dataset] - GERMANA is the result of digitising and annotating a 764-page Spanish manuscript entitled “Noticias y documentos relativos a Doña Germana de Foix, ́última Reina de Aragón", written in 1891 by Vicent Salvador. It contains approximately 21K text lines manually marked and transcribed by palaeography experts.

* [http://prhlt.iti.upv.es/page/projects/multimodal/idoc/rodrigo The RODRIGO Dataset] - RODRIGO is the result of digitising and annotating a manuscript dated 1545. Digitisation was done at 300dpi in color by the Spanish Culture Ministry. The original manuscript is a 853-page bound volume, entitled "Historia de España del arçobispo Don Rodrigo", completely written in old Castilian (Spanish) by a single author. Annotation exists for text blocks, lines and transcriptions, resulting in approximately 20K lines and 231K running words from a lexicon of 17K words.

* [http://marg.nlm.nih.gov/ MARG- Medical Article Records Groundtruth] - A freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its "ground truth". Please contact Dr. George Thoma (thoma@lhc.nlm.nih.gov) at the National Library of Medicine for more information.

* [http://kornai.com/Hindi/ Hindi font samples] by Andras Kornai, June 5 2003

= Software and Tools =
* [http://www.digitisation.eu/tools/ Tools from the IMPACT Centre of Competence] The tools offered by the Impact Centre of Competence are software components which have been developed by the different technical IMPACT partners during the IMPACT project (2008-2012). Generally, a "tool" is a piece of software which operates on image or text data modifying the data or extracting information from it. Every IMPACT tool has a specific functionality related to OCR, to the the pre- and post-processing stages. They include new approaches in areas such as image enhancement, segmentation, and document structuring, alongside existing and experimental OCR engines.
* [http://lampsrv02.umiacs.umd.edu/projdb/project.php?id=53 GEDI: Groundtruthing Environment for Document Images] - A generic annotation tool for scanned text documents.
* [http://www2.parc.com/isl/groups/pda/pixlabeler/index.html PixLabeler] - a research tool for labeling elements in a document image at a pixel level.
* [http://code.google.com/p/ocropus/ OCRopus(tm)] - The OCRopus(tm) open source document analysis and OCR system
* [http://htk.eng.cam.ac.uk/ The Hidden Markov Model Toolkit (HTK)] - a portable toolkit for building and manipulating hidden Markov models
* [https://github.com/meierue/RNNLIB Bidirectional Long-Short Term Memory Networks] - Implementation of Bidirectional Long-Short Term Memory Networks (BLSTM) combined with Connectionist Temporal Classification (CTC) - including examples for Arabic recognition.
* [http://www.speech.sri.com/projects/srilm/ SRILM - The SRI Language Modeling Toolkit] - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
* [http://torch5.sourceforge.net/ Torch 5] - a Matlab-like environment for state-of-the-art machine learning algorithms.
* [http://www.prtools.org/ PRTools] - a Matlab based toolbox for pattern recognition



----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Datasets List

2013-08-24T20:52:21Z

Liwicki: /* Machine-printed Documents */

[[Datasets]] -> [[Datasets List]]

{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

See the datasets [[Datasets per Journal / Conference|sorted according to the Journal / Conference]] they first appeared in.

= Complex Text Containers =
== Scene Text ==
* [[MSRA Text Detection 500 Database (MSRA-TD500)]]
* [[The Street View Text Dataset]]
* [[The Street View House Numbers (SVHN) Dataset]]
* [[NEOCR: Natural Environment OCR Dataset]]
* [[KAIST Scene Text Database]]
* [[ICDAR 2003 Robust Reading Competitions]]
* [[ICDAR 2005 Robust Reading Competitions]]



= Machine-printed Documents =

* [[Table Ground Truth for the UW3 and UNLV datasets]]
* [[The DocLab Dataset for Evaluating Table Interpretation Methods]]
* [[http://www.digitisation.eu/data/ the IMPACT data base]] The dataset contains more than half a million representative text-based images compiled by a number of major European libraries. Covering texts from as early as 1500, and containing material from newspapers, books, pamphlets and typewritten notes, the dataset is an invaluable resource for future research into imaging technology, OCR and language enrichment.
* [http://dataset.primaresearch.org/ PRImA Layout Analysis Dataset]
* [http://www.dfki.uni-kl.de/~shafait/downloads.html DFKI Dewarping Contest Dataset (CBDAR 2007)] The dataset, that was used in the CBDAR 2007 Dewarping Contest, contains 102 camera captured documents with their corresponding ASCII text ground-truth. Additionally, text-line level ground-truth was also prepared to benchmark curled text-line segmentation algorithms. Part of the dataset (76 out of 102 pages) were also scanned with a flat-bed scanner to create a ground-truth image for image based evaluation of page dewarping algorithms.
* [http://diuf.unifr.ch/diva/APTI/ APTI: Arabic Printed Text Image Database]
* [[LRDE Document Binarization Dataset (LRDE DBD)]] This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010. The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution).
* [http://ciir.cs.umass.edu/downloads/ocr-evaluation/ RETAS OCR Evaluation Dataset] The RETAS dataset (used in the paper by Yalniz and Manmatha, ICDAR'11) is created to evaluate the optical character recognition (OCR) accuracy of real scanned books. The dataset contains real OCR outputs for 160 scanned books (100 English, 20 French, 20 German, 20 Spanish) downloaded from the Internet Archive website. The corresponding ground truth text for each scanned book is obtained from the Project Gutenberg database. The OCR output of each scanned book is aligned with its ground truth at the word and character level and the alignment output is provided along with estimated OCR accuracies. The dataset is provided for research purposes.

= Graphical Documents =

* [[Chem-Infty Dataset: A ground-truthed dataset of Chemical Structure Images]]
* [[Braille Dataset - Shiraz University]]
* [http://www.eurecom.fr/~huet/work.html TradeMarks Image Database] - By way of Benoit Huet, 999 trademark and logo images

= Mixed Content Documents =
* [http://www.umiacs.umd.edu/~zhugy/Tobacco800.html Tobacco800 Document Image Database] - composed of 1290 document images collected and scanned using a wide variety of equipment over time.

= Handwritten Documents =
== On-line and Off-line ==

* [[ICDAR 2009 Signature Verification Competition (SigComp2009)]]

* [[ICFHR 2010 Signature Verification Competition (4NSigComp2010)]]

* [[ICDAR 2011 Signature Verification Competition (SigComp2011)]]

* [[ICFHR 2012 Signature Verification Competition (4NSigComp2012)]]

* [http://www.nlpr.ia.ac.cn/databases/handwriting/Home.html CASIA Online and Offline Chinese Handwriting Databases] - The Chinese handwriting datasets were produced by 1,020 writers using Anoto pen on papers, such that both online and offline data were obtained. Both the online and the offline dataset consists of three subsets for isolated characters (DB1.0–1.2, about 3.9 million samples of 7,356 classes) and three for handwritten texts (DB2.0–2.2, about 5,090 pages and 1.35 million characters). The datasets are free for academic research for handwritten document segmentation and retrieval, character and text line recognition, writer adaptation and identification.

* [[Persian Heritage Image Binarization Dataset (PHIBD 2012)]] This dataset contains 15 historical and old manuscript images collected from the historical records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation including bleed-through, faded ink, and blur. The dataset is the first in a series to provide document images and their ground truth as a contribution to Document image analysis and recognition (DAIR) community. It is planned to provide more data and ground-truth information in the fture.

== On-line ==
* [[CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions]]

* [[Devanagari Character Dataset]]

* [[Harbin Institute of Technology Opening Recognition Corpus for Chinese Characters (HIT-OR3C)]]

* [[IAM Online Document Database (IAMonDo-database)]]

* [http://www.iam.unibe.ch/fki/databases/iam-on-line-handwriting-database IAM On-Line Handwriting Database]

* [http://hwr.nici.kun.nl/unipen/ UNIPEN database] (Click on link 'CDROMs')

* [http://www.tuat.ac.jp/~nakagawa/database/ Nakagawa Lab Online Handwriting Database]


* [http://www.ai.rug.nl/~lambert/unipen/icdar-03-competition/ The Informal Competition of Recognizing On-line Words (ICROW)] by the Unipen Foundation

== Off-line ==
* [http://www.rimes-database.fr/wiki/doku.php The Rimes Database] comprises 12,723 handwritten pages corresponding to 5605 mails of two to three pages. It was collected by asking volunteers to write a letter given one of nine predefined scenarios related to business/customer relations. The dataset has been used in numerous competitions in ICDAR and ICFHR. It is available for research purposes only, through the Web site of the authors.

* [[IBN SINA: A database for research on processing and understanding of Arabic manuscripts images]]

* [http://www.cedar.buffalo.edu/Databases/CDROM1/ CEDAR Off-line Handwriting CDROM1]

* [[CVL-Database]] - An Off-line Database for Writer Retrieval, Writer Identification and Word Spotting

* [http://www.iam.unibe.ch/fki/databases/iam-handwriting-database IAM Database] - A full English sentence database for off-line handwriting recognition.

* [http://prhlt.iti.upv.es/page/projects/multimodal/idoc/germana The GERMANA Dataset] - GERMANA is the result of digitising and annotating a 764-page Spanish manuscript entitled “Noticias y documentos relativos a Doña Germana de Foix, ́última Reina de Aragón", written in 1891 by Vicent Salvador. It contains approximately 21K text lines manually marked and transcribed by palaeography experts.

* [http://prhlt.iti.upv.es/page/projects/multimodal/idoc/rodrigo The RODRIGO Dataset] - RODRIGO is the result of digitising and annotating a manuscript dated 1545. Digitisation was done at 300dpi in color by the Spanish Culture Ministry. The original manuscript is a 853-page bound volume, entitled "Historia de España del arçobispo Don Rodrigo", completely written in old Castilian (Spanish) by a single author. Annotation exists for text blocks, lines and transcriptions, resulting in approximately 20K lines and 231K running words from a lexicon of 17K words.

* [http://marg.nlm.nih.gov/ MARG- Medical Article Records Groundtruth] - A freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its "ground truth". Please contact Dr. George Thoma (thoma@lhc.nlm.nih.gov) at the National Library of Medicine for more information.

* [http://kornai.com/Hindi/ Hindi font samples] by Andras Kornai, June 5 2003

= Software and Tools =
* [http://lampsrv02.umiacs.umd.edu/projdb/project.php?id=53 GEDI: Groundtruthing Environment for Document Images] - A generic annotation tool for scanned text documents.
* [http://www2.parc.com/isl/groups/pda/pixlabeler/index.html PixLabeler] - a research tool for labeling elements in a document image at a pixel level.
* [http://code.google.com/p/ocropus/ OCRopus(tm)] - The OCRopus(tm) open source document analysis and OCR system
* [http://htk.eng.cam.ac.uk/ The Hidden Markov Model Toolkit (HTK)] - a portable toolkit for building and manipulating hidden Markov models
* [https://github.com/meierue/RNNLIB Bidirectional Long-Short Term Memory Networks] - Implementation of Bidirectional Long-Short Term Memory Networks (BLSTM) combined with Connectionist Temporal Classification (CTC) - including examples for Arabic recognition.
* [http://www.speech.sri.com/projects/srilm/ SRILM - The SRI Language Modeling Toolkit] - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
* [http://torch5.sourceforge.net/ Torch 5] - a Matlab-like environment for state-of-the-art machine learning algorithms.
* [http://www.prtools.org/ PRTools] - a Matlab based toolbox for pattern recognition



----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Writer Identification and Word Spotting for the CVL Database

2013-08-24T12:01:26Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Description= T…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Description=

The CVL-database has 311 writers and was designed for writer retrieval and identification. The database consists of 7 (27 writers) respectively 5 (284 writers) different texts (101069 words at all).
Additionally each page is labeled and provides the coordinates of the bounding boxes of each word (punctuations are not annotated) encoded using an XML-file. Thus, the CVL database can also be used for the evaluation of word-spotting methods. In contrast to the IAM database the number of pages of each writer is distributed more equally.

=Evaluation Protocol=

Evaluation for Writer Identification: It is suggested to use the evaluation metrics from the Writer Identification Contest (ICDAR)

=Related Dataset=
* [[CVL-Database]]

=Related Ground Truth Data=
* [[Bounding Boxes, IDs, and Transcription for the CVL Database]]

=References=
Markus Diem, Stefan Fiel, Florian Kleber and Robert Sablatnig, CVL-Database: An Off-line Database for Writer Retrieval, Writer Identification and Word Spotting, In Proc. of the 12th Int. Conference on Document Analysis and Recognition (ICDAR) 2013, forthcoming.

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Bounding Boxes, IDs, and Transcription for the CVL Database

2013-08-24T11:59:43Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Keywords= =Des…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=

=Description=

=Software=

The GT-Viewer allows to view the ground truth. All words of the text are surrounded by a bounding

box (punctuation is not considered), which has been automatically calculated and manually checked

by two individuals. The information is stored in a XML file. For a description on the XML file

please take a look at the referenced paper. A XML-Parser (C++) is available to read the GT data.

=Related Dataset=
* [[CVL-Database]]

=Related Tasks=
* [[Writer Identification and Word Spotting for the CVL Database]]

=Submitted Files=
The ground truth is in the files of the database. Refer to [[CVL-Database]]
* [http://www.iapr-tc11.org/dataset/CVL/GtViewer.zip GT-Viewer] (7.75 MB)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers]].

CVL-Database

2013-08-24T11:55:37Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-07-22 |- | {{Last updated}} |} |} CVL-Database - A…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-07-22
|-
| {{Last updated}}
|}

|}

CVL-Database - An Off-line Database for Writer Retrieval, Writer Identification and Word

Spotting

=Contact Author=
Markus Diem
Stefan Fiel
Florian Kleber
Robert Sablatnig
sab@caa.tuwien.ac.at

=Copyright=
CVL Database is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License

[http://creativecommons.org/licenses/by-nc/3.0/]

This database may be used for non-commercial research purpose only. If you publish material based

on this database, we request you to include a reference to the publication listed below.

=Current Version=
1.0

=Keywords=
Writer Identification, Word Spotting, Cursive Handwriting

=Description=
The CVL Database is a public database for writer retrieval, writer identification and word

spotting. The database consists of 7 different handwritten texts (1 German and 6 Englisch Texts)

and 309 different writers. For each text a rgb color image (300 dpi) comprising the handwritten

text and the printed text sample is available as well as a cropped version (only handwritten). An

unique id identifies the writer, whereas the Bounding Boxes for each single word are stored in an

XML file.

The CVL-database consists of images with cursively handwritten german and english texts which has

been choosen from literary works. All pages have an unique writer id and the text number

(separated by a dash) at the upper right corner, followed by the printed sample text. The text is

placed between two horizontal separatores. Beneath the printed text individuals have been asked

to write the text using a ruled undersheet to prevent curled text lines. The layout follows the

style of the [http://www.iam.unibe.ch/fki/databases/iam-handwriting-database|IAM database].

Samples of the following texts have been used:
* Edwin A. Abbot - Flatland: A Romance of Many Dimension (92 words).
* William Shakespeare - Mac Beth (49 words).
* Wikipedia - Mailüfterl (73 words, under CC Attribution-ShareALike License).
* Charles Darwin - Origin of Species (52 words).
* Johann Wolfgang von Goethe - Faust. Eine Tragödie (50 words).
* Oscar Wilde - The Picture of Dorian Gray (66 words).
* Edgar Allan Poe - The Fall of the House of Usher (78 words).

=Metadata and Technical Details=
All pages have a unique writer id and the text number (separated by a dash) at the upper right

corner, followed by the printed sample text. The text is placed between two horizontal

separators. The files are named according the unique writer id and the text number. In addition,

text lines and words are extracted. Their filename convention is the same with the text line

number and word number respectively added at the end. For word images, the GT entry is the last

part of the filename. The Bounding Boxes for each single word are stored in an XML file according

the unique id.

=Ground Truth Data=
* [[Bounding Boxes, IDs, and Transcription for the CVL Database]]

=Related Tasks=
* [[Writer Identification and Word Spotting for the CVL Database]]

=References=
Markus Diem, Stefan Fiel, Florian Kleber and Robert Sablatnig, CVL-Database: An Off-line Database

for Writer Retrieval, Writer Identification and Word Spotting, In Proc. of the 12th Int.

Conference on Document Analysis and Recognition (ICDAR) 2013, forthcoming.

=Submitted Files=
==Version 1.0==
Please refer to [http://caa.tuwien.ac.at/cvl/research/cvl-database/index.html

http://caa.tuwien.ac.at/cvl/research/cvl-database/index.html] for downloading the files from the

origninal datasets site.

* [http://www.iapr-tc11.org/dataset/CVL/cvl-database.zip Original images] (3.92 GB)
* [http://www.iapr-tc11.org/dataset/CVL/cvl-database-cropped.zip Cropped images] (1.12 GB)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Datasets List

2013-08-24T11:50:13Z

Liwicki: /* Off-line */

[[Datasets]] -> [[Datasets List]]

{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

See the datasets [[Datasets per Journal / Conference|sorted according to the Journal / Conference]] they first appeared in.

= Complex Text Containers =
== Scene Text ==
* [[MSRA Text Detection 500 Database (MSRA-TD500)]]
* [[The Street View Text Dataset]]
* [[The Street View House Numbers (SVHN) Dataset]]
* [[NEOCR: Natural Environment OCR Dataset]]
* [[KAIST Scene Text Database]]
* [[ICDAR 2003 Robust Reading Competitions]]
* [[ICDAR 2005 Robust Reading Competitions]]



= Machine-printed Documents =

* [[Table Ground Truth for the UW3 and UNLV datasets]]
* [[The DocLab Dataset for Evaluating Table Interpretation Methods]]
* [http://dataset.primaresearch.org/ PRImA Layout Analysis Dataset]
* [http://www.dfki.uni-kl.de/~shafait/downloads.html DFKI Dewarping Contest Dataset (CBDAR 2007)] The dataset, that was used in the CBDAR 2007 Dewarping Contest, contains 102 camera captured documents with their corresponding ASCII text ground-truth. Additionally, text-line level ground-truth was also prepared to benchmark curled text-line segmentation algorithms. Part of the dataset (76 out of 102 pages) were also scanned with a flat-bed scanner to create a ground-truth image for image based evaluation of page dewarping algorithms.
* [http://diuf.unifr.ch/diva/APTI/ APTI: Arabic Printed Text Image Database]
* [[LRDE Document Binarization Dataset (LRDE DBD)]] This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010. The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution).
* [http://ciir.cs.umass.edu/downloads/ocr-evaluation/ RETAS OCR Evaluation Dataset] The RETAS dataset (used in the paper by Yalniz and Manmatha, ICDAR'11) is created to evaluate the optical character recognition (OCR) accuracy of real scanned books. The dataset contains real OCR outputs for 160 scanned books (100 English, 20 French, 20 German, 20 Spanish) downloaded from the Internet Archive website. The corresponding ground truth text for each scanned book is obtained from the Project Gutenberg database. The OCR output of each scanned book is aligned with its ground truth at the word and character level and the alignment output is provided along with estimated OCR accuracies. The dataset is provided for research purposes.

= Graphical Documents =

* [[Chem-Infty Dataset: A ground-truthed dataset of Chemical Structure Images]]
* [[Braille Dataset - Shiraz University]]
* [http://www.eurecom.fr/~huet/work.html TradeMarks Image Database] - By way of Benoit Huet, 999 trademark and logo images

= Mixed Content Documents =
* [http://www.umiacs.umd.edu/~zhugy/Tobacco800.html Tobacco800 Document Image Database] - composed of 1290 document images collected and scanned using a wide variety of equipment over time.

= Handwritten Documents =
== On-line and Off-line ==

* [[ICDAR 2009 Signature Verification Competition (SigComp2009)]]

* [[ICFHR 2010 Signature Verification Competition (4NSigComp2010)]]

* [[ICDAR 2011 Signature Verification Competition (SigComp2011)]]

* [[ICFHR 2012 Signature Verification Competition (4NSigComp2012)]]

* [http://www.nlpr.ia.ac.cn/databases/handwriting/Home.html CASIA Online and Offline Chinese Handwriting Databases] - The Chinese handwriting datasets were produced by 1,020 writers using Anoto pen on papers, such that both online and offline data were obtained. Both the online and the offline dataset consists of three subsets for isolated characters (DB1.0–1.2, about 3.9 million samples of 7,356 classes) and three for handwritten texts (DB2.0–2.2, about 5,090 pages and 1.35 million characters). The datasets are free for academic research for handwritten document segmentation and retrieval, character and text line recognition, writer adaptation and identification.

* [[Persian Heritage Image Binarization Dataset (PHIBD 2012)]] This dataset contains 15 historical and old manuscript images collected from the historical records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation including bleed-through, faded ink, and blur. The dataset is the first in a series to provide document images and their ground truth as a contribution to Document image analysis and recognition (DAIR) community. It is planned to provide more data and ground-truth information in the fture.

== On-line ==
* [[CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions]]

* [[Devanagari Character Dataset]]

* [[Harbin Institute of Technology Opening Recognition Corpus for Chinese Characters (HIT-OR3C)]]

* [[IAM Online Document Database (IAMonDo-database)]]

* [http://www.iam.unibe.ch/fki/databases/iam-on-line-handwriting-database IAM On-Line Handwriting Database]

* [http://hwr.nici.kun.nl/unipen/ UNIPEN database] (Click on link 'CDROMs')

* [http://www.tuat.ac.jp/~nakagawa/database/ Nakagawa Lab Online Handwriting Database]


* [http://www.ai.rug.nl/~lambert/unipen/icdar-03-competition/ The Informal Competition of Recognizing On-line Words (ICROW)] by the Unipen Foundation

== Off-line ==
* [http://www.rimes-database.fr/wiki/doku.php The Rimes Database] comprises 12,723 handwritten pages corresponding to 5605 mails of two to three pages. It was collected by asking volunteers to write a letter given one of nine predefined scenarios related to business/customer relations. The dataset has been used in numerous competitions in ICDAR and ICFHR. It is available for research purposes only, through the Web site of the authors.

* [[IBN SINA: A database for research on processing and understanding of Arabic manuscripts images]]

* [http://www.cedar.buffalo.edu/Databases/CDROM1/ CEDAR Off-line Handwriting CDROM1]

* [[CVL-Database]] - An Off-line Database for Writer Retrieval, Writer Identification and Word Spotting

* [http://www.iam.unibe.ch/fki/databases/iam-handwriting-database IAM Database] - A full English sentence database for off-line handwriting recognition.

* [http://prhlt.iti.upv.es/page/projects/multimodal/idoc/germana The GERMANA Dataset] - GERMANA is the result of digitising and annotating a 764-page Spanish manuscript entitled “Noticias y documentos relativos a Doña Germana de Foix, ́última Reina de Aragón", written in 1891 by Vicent Salvador. It contains approximately 21K text lines manually marked and transcribed by palaeography experts.

* [http://prhlt.iti.upv.es/page/projects/multimodal/idoc/rodrigo The RODRIGO Dataset] - RODRIGO is the result of digitising and annotating a manuscript dated 1545. Digitisation was done at 300dpi in color by the Spanish Culture Ministry. The original manuscript is a 853-page bound volume, entitled "Historia de España del arçobispo Don Rodrigo", completely written in old Castilian (Spanish) by a single author. Annotation exists for text blocks, lines and transcriptions, resulting in approximately 20K lines and 231K running words from a lexicon of 17K words.

* [http://marg.nlm.nih.gov/ MARG- Medical Article Records Groundtruth] - A freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its "ground truth". Please contact Dr. George Thoma (thoma@lhc.nlm.nih.gov) at the National Library of Medicine for more information.

* [http://kornai.com/Hindi/ Hindi font samples] by Andras Kornai, June 5 2003

= Software and Tools =
* [http://lampsrv02.umiacs.umd.edu/projdb/project.php?id=53 GEDI: Groundtruthing Environment for Document Images] - A generic annotation tool for scanned text documents.
* [http://www2.parc.com/isl/groups/pda/pixlabeler/index.html PixLabeler] - a research tool for labeling elements in a document image at a pixel level.
* [http://code.google.com/p/ocropus/ OCRopus(tm)] - The OCRopus(tm) open source document analysis and OCR system
* [http://htk.eng.cam.ac.uk/ The Hidden Markov Model Toolkit (HTK)] - a portable toolkit for building and manipulating hidden Markov models
* [https://github.com/meierue/RNNLIB Bidirectional Long-Short Term Memory Networks] - Implementation of Bidirectional Long-Short Term Memory Networks (BLSTM) combined with Connectionist Temporal Classification (CTC) - including examples for Arabic recognition.
* [http://www.speech.sri.com/projects/srilm/ SRILM - The SRI Language Modeling Toolkit] - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
* [http://torch5.sourceforge.net/ Torch 5] - a Matlab-like environment for state-of-the-art machine learning algorithms.
* [http://www.prtools.org/ PRTools] - a Matlab based toolbox for pattern recognition



----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Ground Truth for LRDE DBD binarization

2013-07-03T16:28:11Z

Liwicki: /* Version 1.0 */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
scanned, magazine, documents, binarization

=Description=

125 binarized images for "clean documents".

Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.

Purpose of the three document qualities :

* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Related Dataset=
* [[LRDE Document Binarization Dataset (LRDE DBD)]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_bin_gt-1.0.zip Binarization groundtruth] (0 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Ground Truth for LRDE DBD OCR

2013-07-03T16:27:49Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Keywords= scann…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
scanned, magazine, documents, binarization

=Description=

125 binarized images for "clean documents".

Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.

Purpose of the three document qualities :

* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Related Dataset=
* [[LRDE Document Binarization Dataset (LRDE DBD)]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_bin_gt-1.0.zip Binarization groundtruth] (0 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

OCR Evaluation for LRDE DBD

2013-07-03T16:24:54Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Keywords= scann…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
scanned, magazine, documents, OCR

=Description=

OCR evaluation: Lines are extracted from the binarization outputs and OCR (Tesseract) is run in order to compare to OCR ground-truth. It is performed from binarization of “clean”, “scanned” and “original” documents.

Purpose of the three document qualities :

* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

Lines for OCR evaluation are also grouped by size: small, medium and large. (0 < small <= 30 < medium <= 55 < large < +inf). It shows how robust is a binarization algorithm to objects with different sizes in a single document.

=Evaluation Protocol=

Tools are provided to read and process all the data.

A setup script is provided to download and configure the benchmarking environment.

A Python script is provided to launch the benchmark and compute scores.

C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.

6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

A setup script is available to download and setup the benchmark system. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=Related Dataset=
* [[LRDE Document Binarization Dataset (LRDE DBD)]]

=Related Ground Truth Data=
* [[Ground Truth for LRDE DBD OCR]]

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/LRDE/lrde-dbd-tools-1.0.zip Tools for processing] (0.08 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

LRDE Document Binarization Dataset (LRDE DBD)

2013-07-03T16:22:10Z

Liwicki:

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France

=Copyright=

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [http://www.nouvelobs.com/ Le Nouvel Observateur]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

=Current Version=
1.0

=Keywords=
Document binarization, Magazine, Scanned

=Description=
The dataset is also available at [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD]

This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

* 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
* 125 numerical "clean documents" created from the "original documents" where images have been removed.
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:
* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Ground Truth Data=
* [[Ground Truth for LRDE DBD text line localization]]
* [[Ground Truth for LRDE DBD binarization]]
* [[Ground Truth for LRDE DBD OCR]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]
* [[OCR Evaluation for LRDE DBD]]

=Software=
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
* A Python script is provided to launch the benchmark and compute scores.
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=References=
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]

=Submitted Files=
==Version 1.0==
Please refer to [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD] for downloading the files from the origninal datasets site.

* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_orig-1.0.zip Original images] (213 Mb)
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_clean-1.0.zip Clean Documents images] (67 Mb)
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_scanned-1.0.zip Scanned Documents] (583 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Document Binarization Evaluation for LRDE DBD

2013-07-03T16:21:45Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Keywords= scann…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
scanned, magazine, documents, binarization

=Description=

Pixel-wise evaluation: binarization outputs are compared to binarization ground-truth, pixel-wise. It is performed from binarization of “clean documents”.

Purpose of the three document qualities :

* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Evaluation Protocol=
Binarization algorithms should handles PNG format as input and output images.

Binarization outputs must set background to False and objects to True.

Tools are provided to read and process all the data.

=Related Dataset=
* [[LRDE Document Binarization Dataset (LRDE DBD)]]

=Related Ground Truth Data=
* [[Ground Truth for LRDE DBD binarization]]

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/LRDE/lrde-dbd-tools-1.0.zip Tools for processing] (0.08 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Ground Truth for LRDE DBD binarization

2013-07-03T16:15:58Z

Liwicki:

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
scanned, magazine, documents, binarization

=Description=

125 binarized images for "clean documents".

Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.

Purpose of the three document qualities :

* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Related Dataset=
* [[LRDE Document Binarization Dataset (LRDE DBD)]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_bin_gt-1.0.zip Binarization groundtruth] (21 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Ground Truth for LRDE DBD binarization

2013-07-03T16:15:10Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Keywords= scann…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
scanned, magazine, documents, binarization

=Description=

125 binarized images for "clean documents".

Image groundtruths have been produced using a semi-automatic process: a global thresholding followed by some manual adjustments.

Purpose of the three document qualities :

* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Related Dataset=
* [[Document Binarization Evaluation for LRDE DBD]]

=Related Tasks=
* none

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_bin_gt-1.0.zip Binarization groundtruth] (21 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Ground Truth for LRDE DBD text line localization

2013-07-03T16:08:20Z

Liwicki:

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
scanned, magazine, documents, text line localization

=Description=

Text Lines Localization Information has been made available by applying text line localization algorithms. The size category of the text depends on the x-height and is considered with the following rule: 0 < small <= 30 < medium <= 55 < large < +inf

* 123 large text lines localization (clean)
* 320 medium text lines localization (clean).
* 9551 small text lines localization (clean).
* 123 large text lines localization (original).
* 320 medium text lines localization (original).
* 9551 small text lines localization (original).
* 123 large text lines localization (scanned).
* 320 medium text lines localization (scanned).
* 9551 small text lines localization (scanned).

The text lines dataset covers only a subset of the full-document dataset. It is generated from the binarization of the full-document images.
Text line localizations are stored as bounding box coordinates in text files.

Purpose of the three document qualities :

* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Related Dataset=
* [[LRDE Document Binarization Dataset (LRDE DBD)]]

=Related Tasks=
* none

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_textlines-1.0.zip Text lines localization] (9.8 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

LRDE Document Binarization Dataset (LRDE DBD)

2013-07-03T16:04:03Z

Liwicki: /* Version 1.0 */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France

=Copyright=

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [http://www.nouvelobs.com/ Le Nouvel Observateur]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

=Current Version=
1.0

=Keywords=
Document binarization, Magazine, Scanned

=Description=
The dataset is also available at [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD]

This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

* 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
* 125 numerical "clean documents" created from the "original documents" where images have been removed.
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:
* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Ground Truth Data=
* [[Ground Truth for LRDE DBD text line localization]]
* [[Ground Truth for LRDE DBD binarization]]
* [[Ground Truth for LRDE DBD OCR]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Software=
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
* A Python script is provided to launch the benchmark and compute scores.
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=References=
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]

=Submitted Files=
==Version 1.0==
Please refer to [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD] for downloading the files from the origninal datasets site.

* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_orig-1.0.zip Original images] (213 Mb)
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_clean-1.0.zip Clean Documents images] (67 Mb)
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_scanned-1.0.zip Scanned Documents] (583 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

LRDE Document Binarization Dataset (LRDE DBD)

2013-07-03T16:03:34Z

Liwicki:

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France

=Copyright=

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [http://www.nouvelobs.com/ Le Nouvel Observateur]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

=Current Version=
1.0

=Keywords=
Document binarization, Magazine, Scanned

=Description=
The dataset is also available at [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD]

This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

* 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
* 125 numerical "clean documents" created from the "original documents" where images have been removed.
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:
* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Ground Truth Data=
* [[Ground Truth for LRDE DBD text line localization]]
* [[Ground Truth for LRDE DBD binarization]]
* [[Ground Truth for LRDE DBD OCR]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Software=
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
* A Python script is provided to launch the benchmark and compute scores.
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=References=
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]

=Submitted Files=
==Version 1.0==
Please refer to [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD] for downloading the files from the origninal datasets site.

* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_orig-1.0.zip Original images] (213 Mb)
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_clean-1.0.zip Clean Documents images] (67 Mb)
* [http://www.iapr-tc11.org/dataset/LRDE/nouvel_obs_2402_scanned-1.0.zip Scanned Documents] (583 Mb)

nouvel_obs_2402_clean-1.0.zip

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Binarization of PHIBD 2012 dataset

2013-07-03T15:50:16Z

Liwicki: /* Version 1.0 */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Description=
Binarization of handwritten Document Images.

There are actually two tasks, depending on the nature of the binarization method used.

# For regular binarization methods, the task is to binarize all 15 document images.
# For learning-based binarization methods, the task is to use images number 1 to 5 for training, and then binarize images number 6 to 16.

A few baseline methods have been provided: PC (phase congruency) binarization method [Ziaei2012], and SGL/BGL binarization method [Farrahi2009, Farrahi2010]. The SGL/BGL method uses a rough binarization as its initialization.

=Evaluation Protocol=

# For regular methods, the average F-measure of the binarized images against the provided ground truth is used as the performance of the binarization method in question.
# For learning-based methods, the average F-measure of the binarized images number 6 to 15 against the provided ground truth is used as the performance of the binarization method in question.

{| border="1" cellspacing="0" cellpadding="5" align="center"
|-
! Method
!Whole set
!Training set
!Test set
|-
|Otsu (regular)
|82.09
|90.76
|77.75
|-
|PC (regular)
|90.91
|92.33
|90.20
|-
|SGL/BGL (upper bound; rough bin: GT)
|94.55
|95.37
|94.14
|-
|SGL/BGL (upper bound; rough bin: PC)
|91.79
|93.29
|91.04
|-
|SGL/BGL (learning) (rough bin: PC)
|N/A
|N/A
|89.94
|}

A metacode of a learning-based binarization method based on stroke gray level (SGL) and background gray level (BGL) is provided. The executable of the method will be provided in near future.

The proposed learning-based binarization method uses the SGL and the BGL to determine a locally-adaptive threshold value based on a parameter (alpha). The optimal selection of this parameter is the learning part of this method.

=Related Dataset=
* [[Persian Heritage Image Binarization Dataset (PHIBD 2012)]]

=Related Ground Truth Data=
* [[Binarized images for PHIBD 2012 dataset]]

=References=
* [Ziaei2013] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam, and Mohamed Cheriet. Persian historical document dataset with introduction to PhaseGT: A ground truthing application, to be submitted to ICDAR’13.
* [Ziaei2012] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam and Mohamed Cheriet, Historical Document Binarization Based on Phase Information of Images, in ACCV’12 Workshop on e-Heritage, Daejeon, South Korea, Nov 5-10, 2012.
* [Farrahi2009] Reza Farrahi Moghaddam, and Mohamed Cheriet, RSLDI: Restoration of single-sided low-quality document images, Pattern Recognition, Volume 42, Issue 12, p.3355–3364 (2009) DOI: 10.1016/j.patcog.2008.10.021
* [Farrahi2010] Reza Farrahi Moghaddam, and Mohamed Cheriet, A multi-scale framework for adaptive binarization of degraded document images, Pattern Recognition, Volume 43, Issue 6, Number 6, p.2186–2198 (2010) DOI: 10.1016/j.patcog.2009.12.024
* [Cheriet2012] Mohamed Cheriet, Reza Farrahi Moghaddam, and Rachid Hedjam, A learning framework for the optimization and automation of document binarization methods, Computer Vision and Image Understanding, Volume Accepted, p.– (2012) DOI: 10.1016/j.cviu.2012.11.003

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/PHIBD2012/Training.txt Train Set Meta Data] (0 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/Test.txt Test Set Meta Data] (0 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/Otsu_PHIBD_2012.zip Otsu_PHIBD_2012 baseline method] (0.39 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/PC_PHIBD_2012.zip PC_PHIBD_2012 baseline method] (0.33 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/SGLBGL_PHIBD_2012.zip SGLBGL_PHIBD_2012 baseline method] (0.36 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/SGLBGL_PC_PHIBD_2012.zip SGLBGL_PC_PHIBD_2012 baseline method] (0.36 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/SGLBGL_PC_TrainTest_PHIBD_2012.zip SGLBGL_PC_TrainTest_PHIBD_2012 baseline method] (0.36 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/SGLBGL_metacode.m SGLBGL_metacode.m Sample Program] (0 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Binarization of PHIBD 2012 dataset

2013-07-03T15:48:52Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Description= Bi…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Description=
Binarization of handwritten Document Images.

There are actually two tasks, depending on the nature of the binarization method used.

# For regular binarization methods, the task is to binarize all 15 document images.
# For learning-based binarization methods, the task is to use images number 1 to 5 for training, and then binarize images number 6 to 16.

A few baseline methods have been provided: PC (phase congruency) binarization method [Ziaei2012], and SGL/BGL binarization method [Farrahi2009, Farrahi2010]. The SGL/BGL method uses a rough binarization as its initialization.

=Evaluation Protocol=

# For regular methods, the average F-measure of the binarized images against the provided ground truth is used as the performance of the binarization method in question.
# For learning-based methods, the average F-measure of the binarized images number 6 to 15 against the provided ground truth is used as the performance of the binarization method in question.

{| border="1" cellspacing="0" cellpadding="5" align="center"
|-
! Method
!Whole set
!Training set
!Test set
|-
|Otsu (regular)
|82.09
|90.76
|77.75
|-
|PC (regular)
|90.91
|92.33
|90.20
|-
|SGL/BGL (upper bound; rough bin: GT)
|94.55
|95.37
|94.14
|-
|SGL/BGL (upper bound; rough bin: PC)
|91.79
|93.29
|91.04
|-
|SGL/BGL (learning) (rough bin: PC)
|N/A
|N/A
|89.94
|}

A metacode of a learning-based binarization method based on stroke gray level (SGL) and background gray level (BGL) is provided. The executable of the method will be provided in near future.

The proposed learning-based binarization method uses the SGL and the BGL to determine a locally-adaptive threshold value based on a parameter (alpha). The optimal selection of this parameter is the learning part of this method.

=Related Dataset=
* [[Persian Heritage Image Binarization Dataset (PHIBD 2012)]]

=Related Ground Truth Data=
* [[Binarized images for PHIBD 2012 dataset]]

=References=
* [Ziaei2013] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam, and Mohamed Cheriet. Persian historical document dataset with introduction to PhaseGT: A ground truthing application, to be submitted to ICDAR’13.
* [Ziaei2012] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam and Mohamed Cheriet, Historical Document Binarization Based on Phase Information of Images, in ACCV’12 Workshop on e-Heritage, Daejeon, South Korea, Nov 5-10, 2012.
* [Farrahi2009] Reza Farrahi Moghaddam, and Mohamed Cheriet, RSLDI: Restoration of single-sided low-quality document images, Pattern Recognition, Volume 42, Issue 12, p.3355–3364 (2009) DOI: 10.1016/j.patcog.2008.10.021
* [Farrahi2010] Reza Farrahi Moghaddam, and Mohamed Cheriet, A multi-scale framework for adaptive binarization of degraded document images, Pattern Recognition, Volume 43, Issue 6, Number 6, p.2186–2198 (2010) DOI: 10.1016/j.patcog.2009.12.024
* [Cheriet2012] Mohamed Cheriet, Reza Farrahi Moghaddam, and Rachid Hedjam, A learning framework for the optimization and automation of document binarization methods, Computer Vision and Image Understanding, Volume Accepted, p.– (2012) DOI: 10.1016/j.cviu.2012.11.003

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/PHIBD2012/Training.txt Train Set Meta Data] (0 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/Test.txt Test Set Meta Data] (0 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/Otsu_PHIBD_2012.zip baseline method] (0.39 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/PC_PHIBD_2012.zip baseline method] (0.33 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/SGLBGL_PHIBD_2012.zip baseline method] (0.36 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/SGLBGL_PC_PHIBD_2012.zip baseline method] (0.36 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/SGLBGL_PC_TrainTest_PHIBD_2012.zip baseline method] (0.36 Mb)
* [http://www.iapr-tc11.org/dataset/PHIBD2012/SGLBGL_metacode.m Sample Program] (0 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Binarized images for PHIBD 2012 dataset

2013-07-03T15:31:01Z

Liwicki: /* Version 1.0 */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
Binarization, Text detection, Ground truth

=Description=
For each image, a manually binarized and verified image is provided on which the text pixels are marked in black while the rest of pixels are marked in white. An automatic binarization method for ground truthing was used prior to the manual verification of the binary images in order to speed up the process. The automatic binarization method will be described in a paper (to be submitted to ICDAR’13).

=Related Dataset=
* [[Persian Heritage Image Binarization Dataset (PHIBD 2012)]]

=Related Tasks=
* [[Binarization of PHIBD 2012 dataset]]

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/PHIBD2012/GT.zip Ground truth images] (0.28 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers]].

Persian Heritage Image Binarization Dataset (PHIBD 2012)

2013-07-03T15:30:40Z

Liwicki: /* Version 1.0 */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Hossein Ziaie Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet
Synchromedia Laboratory
ETS, Montreal, (Quebec) Canada
H3C 1K3
E-mail: mohamed.cheriet@etsmtl.ca
Tel: +1(514)396-8972
Fax: +1(514)396-8595

=Copyright=

=Current Version=
1.0

=Keywords=
Document Image Binarization, Persian Heritage, Handwritten manuscripts

=Description=
This dataset contains 15 historical and old manuscript images collected from the historical

records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated

with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation

including bleed-through, faded ink, and blur. The dataset is the first in a series to provide

document images and their ground truth as a contribution to Document image analysis and

recognition (DAIR) community.

It is planned to increase the dataset in future and to create a dataset which also covers the tasks of understanding in the near future.

=Metadata and Technical Details=
As metadata, the types of degradation on each document image have been provided in two text

files: 1) for images number 1 to 5 and 2) for images number 6 to 15. It is worth noting that

images number 1 to 5 are considered as the training set while images number 6 to 15 are

considered as the test set for those binarization methods that are based on a learning technique.

Also, the estimated line height and stroke width for each image are provided in these files.

The original document images are 4.9MB, while their ground truth images are 324KB.

=Ground Truth Data=
* [[Binarized images for PHIBD 2012 dataset]]

=Related Tasks=
* [[Binarization of PHIBD 2012 dataset]]

=Software=
A metacode of a learning-based binarization method based on stroke gray level (SGL) and

background gray level (BGL) is provided. The executable of the method will be provided in near

future.

The proposed learning-based binarization method uses the SGL and the BGL to determine a locally-

adaptive threshold value based on a parameter (alpha). The optimal selection of this parameter is

the learning part of this method.

=References=
* [Ziaei2013] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam, and Mohamed Cheriet. Persian

historical document dataset with introduction to PhaseGT: A ground truthing application, to be

submitted to ICDAR’13.
* [Ziaei2012] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam and Mohamed Cheriet, Historical

Document Binarization Based on Phase Information of Images, in ACCV’12 Workshop on e-Heritage,

Daejeon, South Korea, Nov 5-10, 2012.
* [Farrahi2009] Reza Farrahi Moghaddam, and Mohamed Cheriet, RSLDI: Restoration of single-sided

low-quality document images, Pattern Recognition, Volume 42, Issue 12, p.3355–3364 (2009) DOI:

10.1016/j.patcog.2008.10.021
* [Farrahi2010] Reza Farrahi Moghaddam, and Mohamed Cheriet, A multi-scale framework for adaptive

binarization of degraded document images, Pattern Recognition, Volume 43, Issue 6, Number 6,

p.2186–2198 (2010) DOI: 10.1016/j.patcog.2009.12.024
* [Cheriet2012] Mohamed Cheriet, Reza Farrahi Moghaddam, and Rachid Hedjam, A learning framework

for the optimization and automation of document binarization methods, Computer Vision and Image

Understanding, Volume Accepted, p.– (2012) DOI: 10.1016/j.cviu.2012.11.003

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/PHIBD2012/Original.zip Original images] (5 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Binarized images for PHIBD 2012 dataset

2013-07-03T15:28:23Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Keywords= Binar…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
Binarization, Text detection, Ground truth

=Description=
For each image, a manually binarized and verified image is provided on which the text pixels are marked in black while the rest of pixels are marked in white. An automatic binarization method for ground truthing was used prior to the manual verification of the binary images in order to speed up the process. The automatic binarization method will be described in a paper (to be submitted to ICDAR’13).

=Related Dataset=
* [[Persian Heritage Image Binarization Dataset (PHIBD 2012)]]

=Related Tasks=
* [[Binarization of PHIBD 2012 dataset]]

=Submitted Files=
==Version 1.0==
* [http://www.iapr-tc11.org/dataset/PHIBD2012/GT.zip Ground truth images] (2.8 Mb)

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers]].

Persian Heritage Image Binarization Dataset (PHIBD 2012)

2013-05-30T18:25:05Z

Liwicki:

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Hossein Ziaie Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet
Synchromedia Laboratory
ETS, Montreal, (Quebec) Canada
H3C 1K3
E-mail: mohamed.cheriet@etsmtl.ca
Tel: +1(514)396-8972
Fax: +1(514)396-8595

=Copyright=

=Current Version=
1.0

=Keywords=
Document Image Binarization, Persian Heritage, Handwritten manuscripts

=Description=
This dataset contains 15 historical and old manuscript images collected from the historical

records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated

with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation

including bleed-through, faded ink, and blur. The dataset is the first in a series to provide

document images and their ground truth as a contribution to Document image analysis and

recognition (DAIR) community.

It is planned to increase the dataset in future and to create a dataset which also covers the tasks of understanding in the near future.

=Metadata and Technical Details=
As metadata, the types of degradation on each document image have been provided in two text

files: 1) for images number 1 to 5 and 2) for images number 6 to 15. It is worth noting that

images number 1 to 5 are considered as the training set while images number 6 to 15 are

considered as the test set for those binarization methods that are based on a learning technique.

Also, the estimated line height and stroke width for each image are provided in these files.

The original document images are 4.9MB, while their ground truth images are 324KB.

=Ground Truth Data=
* [[Binarized images for PHIBD 2012 dataset]]

=Related Tasks=
* [[Binarization of PHIBD 2012 dataset]]

=Software=
A metacode of a learning-based binarization method based on stroke gray level (SGL) and

background gray level (BGL) is provided. The executable of the method will be provided in near

future.

The proposed learning-based binarization method uses the SGL and the BGL to determine a locally-

adaptive threshold value based on a parameter (alpha). The optimal selection of this parameter is

the learning part of this method.

=References=
* [Ziaei2013] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam, and Mohamed Cheriet. Persian

historical document dataset with introduction to PhaseGT: A ground truthing application, to be

submitted to ICDAR’13.
* [Ziaei2012] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam and Mohamed Cheriet, Historical

Document Binarization Based on Phase Information of Images, in ACCV’12 Workshop on e-Heritage,

Daejeon, South Korea, Nov 5-10, 2012.
* [Farrahi2009] Reza Farrahi Moghaddam, and Mohamed Cheriet, RSLDI: Restoration of single-sided

low-quality document images, Pattern Recognition, Volume 42, Issue 12, p.3355–3364 (2009) DOI:

10.1016/j.patcog.2008.10.021
* [Farrahi2010] Reza Farrahi Moghaddam, and Mohamed Cheriet, A multi-scale framework for adaptive

binarization of degraded document images, Pattern Recognition, Volume 43, Issue 6, Number 6,

p.2186–2198 (2010) DOI: 10.1016/j.patcog.2009.12.024
* [Cheriet2012] Mohamed Cheriet, Reza Farrahi Moghaddam, and Rachid Hedjam, A learning framework

for the optimization and automation of document binarization methods, Computer Vision and Image

Understanding, Volume Accepted, p.– (2012) DOI: 10.1016/j.cviu.2012.11.003

=Submitted Files=
==Version 1.0==

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Persian Heritage Image Binarization Dataset (PHIBD 2012)

2013-05-30T18:21:09Z

Liwicki: /* Copyright */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Hossein Ziaie Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet
Synchromedia Laboratory
ETS, Montreal, (Quebec) Canada
H3C 1K3
E-mail: mohamed.cheriet@etsmtl.ca
Tel: +1(514)396-8972
Fax: +1(514)396-8595

=Current Version=
1.0

=Keywords=
Document Image Binarization, Persian Heritage, Handwritten manuscripts

=Description=
This dataset contains 15 historical and old manuscript images collected from the historical

records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated

with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation

including bleed-through, faded ink, and blur. The dataset is the first in a series to provide

document images and their ground truth as a contribution to Document image analysis and

recognition (DAIR) community.

It is planned to increase the dataset in future and to create a dataset which also covers the tasks of understanding in the near future.

=Metadata and Technical Details=
As metadata, the types of degradation on each document image have been provided in two text

files: 1) for images number 1 to 5 and 2) for images number 6 to 15. It is worth noting that

images number 1 to 5 are considered as the training set while images number 6 to 15 are

considered as the test set for those binarization methods that are based on a learning technique.

Also, the estimated line height and stroke width for each image are provided in these files.

The original document images are 4.9MB, while their ground truth images are 324KB.

=Ground Truth Data=
[[Binarized images for PHIBD 2012 dataset]]

=Related Tasks=
[[Binarization of PHIBD 2012 dataset]]

=Software=
A metacode of a learning-based binarization method based on stroke gray level (SGL) and

background gray level (BGL) is provided. The executable of the method will be provided in near

future.

The proposed learning-based binarization method uses the SGL and the BGL to determine a locally-

adaptive threshold value based on a parameter (alpha). The optimal selection of this parameter is

the learning part of this method.

=References=
* [Ziaei2013] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam, and Mohamed Cheriet. Persian

historical document dataset with introduction to PhaseGT: A ground truthing application, to be

submitted to ICDAR’13.
* [Ziaei2012] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam and Mohamed Cheriet, Historical

Document Binarization Based on Phase Information of Images, in ACCV’12 Workshop on e-Heritage,

Daejeon, South Korea, Nov 5-10, 2012.
* [Farrahi2009] Reza Farrahi Moghaddam, and Mohamed Cheriet, RSLDI: Restoration of single-sided

low-quality document images, Pattern Recognition, Volume 42, Issue 12, p.3355–3364 (2009) DOI:

10.1016/j.patcog.2008.10.021
* [Farrahi2010] Reza Farrahi Moghaddam, and Mohamed Cheriet, A multi-scale framework for adaptive

binarization of degraded document images, Pattern Recognition, Volume 43, Issue 6, Number 6,

p.2186–2198 (2010) DOI: 10.1016/j.patcog.2009.12.024
* [Cheriet2012] Mohamed Cheriet, Reza Farrahi Moghaddam, and Rachid Hedjam, A learning framework

for the optimization and automation of document binarization methods, Computer Vision and Image

Understanding, Volume Accepted, p.– (2012) DOI: 10.1016/j.cviu.2012.11.003

=Submitted Files=
==Version 1.0==

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Persian Heritage Image Binarization Dataset (PHIBD 2012)

2013-05-30T18:20:56Z

Liwicki: Created page with "Datasets -> Datasets List -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2013-05-30 |- | {{Last updated}} |} |} =Contact Author=…"

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Hossein Ziaie Nafchi, Seyed Morteza Ayatollahi, Reza Farrahi Moghaddam, and Mohamed Cheriet
Synchromedia Laboratory
ETS, Montreal, (Quebec) Canada
H3C 1K3
E-mail: mohamed.cheriet@etsmtl.ca
Tel: +1(514)396-8972
Fax: +1(514)396-8595

=Copyright=

=Current Version=
1.0

=Keywords=
Document Image Binarization, Persian Heritage, Handwritten manuscripts

=Description=
This dataset contains 15 historical and old manuscript images collected from the historical

records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated

with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation

including bleed-through, faded ink, and blur. The dataset is the first in a series to provide

document images and their ground truth as a contribution to Document image analysis and

recognition (DAIR) community.

It is planned to increase the dataset in future and to create a dataset which also covers the tasks of understanding in the near future.

=Metadata and Technical Details=
As metadata, the types of degradation on each document image have been provided in two text

files: 1) for images number 1 to 5 and 2) for images number 6 to 15. It is worth noting that

images number 1 to 5 are considered as the training set while images number 6 to 15 are

considered as the test set for those binarization methods that are based on a learning technique.

Also, the estimated line height and stroke width for each image are provided in these files.

The original document images are 4.9MB, while their ground truth images are 324KB.

=Ground Truth Data=
[[Binarized images for PHIBD 2012 dataset]]

=Related Tasks=
[[Binarization of PHIBD 2012 dataset]]

=Software=
A metacode of a learning-based binarization method based on stroke gray level (SGL) and

background gray level (BGL) is provided. The executable of the method will be provided in near

future.

The proposed learning-based binarization method uses the SGL and the BGL to determine a locally-

adaptive threshold value based on a parameter (alpha). The optimal selection of this parameter is

the learning part of this method.

=References=
* [Ziaei2013] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam, and Mohamed Cheriet. Persian

historical document dataset with introduction to PhaseGT: A ground truthing application, to be

submitted to ICDAR’13.
* [Ziaei2012] Hossein Ziaei Nafchi, Reza Farrahi Moghaddam and Mohamed Cheriet, Historical

Document Binarization Based on Phase Information of Images, in ACCV’12 Workshop on e-Heritage,

Daejeon, South Korea, Nov 5-10, 2012.
* [Farrahi2009] Reza Farrahi Moghaddam, and Mohamed Cheriet, RSLDI: Restoration of single-sided

low-quality document images, Pattern Recognition, Volume 42, Issue 12, p.3355–3364 (2009) DOI:

10.1016/j.patcog.2008.10.021
* [Farrahi2010] Reza Farrahi Moghaddam, and Mohamed Cheriet, A multi-scale framework for adaptive

binarization of degraded document images, Pattern Recognition, Volume 43, Issue 6, Number 6,

p.2186–2198 (2010) DOI: 10.1016/j.patcog.2009.12.024
* [Cheriet2012] Mohamed Cheriet, Reza Farrahi Moghaddam, and Rachid Hedjam, A learning framework

for the optimization and automation of document binarization methods, Computer Vision and Image

Understanding, Volume Accepted, p.– (2012) DOI: 10.1016/j.cviu.2012.11.003

=Submitted Files=
==Version 1.0==

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Datasets List

2013-05-30T18:20:36Z

Liwicki: /* On-line and Off-line */

[[Datasets]] -> [[Datasets List]]

{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

See the datasets [[Datasets per Journal / Conference|sorted according to the Journal / Conference]] they first appeared in.

= Complex Text Containers =
== Scene Text ==
* [[MSRA Text Detection 500 Database (MSRA-TD500)]]
* [[The Street View Text Dataset]]
* [[The Street View House Numbers (SVHN) Dataset]]
* [[NEOCR: Natural Environment OCR Dataset]]
* [[KAIST Scene Text Database]]
* [[ICDAR 2003 Robust Reading Competitions]]
* [[ICDAR 2005 Robust Reading Competitions]]



= Machine-printed Documents =

* [[Table Ground Truth for the UW3 and UNLV datasets]]
* [[The DocLab Dataset for Evaluating Table Interpretation Methods]]
* [http://dataset.primaresearch.org/ PRImA Layout Analysis Dataset]
* [http://www.dfki.uni-kl.de/~shafait/downloads.html DFKI Dewarping Contest Dataset (CBDAR 2007)] The dataset, that was used in the CBDAR 2007 Dewarping Contest, contains 102 camera captured documents with their corresponding ASCII text ground-truth. Additionally, text-line level ground-truth was also prepared to benchmark curled text-line segmentation algorithms. Part of the dataset (76 out of 102 pages) were also scanned with a flat-bed scanner to create a ground-truth image for image based evaluation of page dewarping algorithms.
* [http://diuf.unifr.ch/diva/APTI/ APTI: Arabic Printed Text Image Database]
* [[LRDE Document Binarization Dataset (LRDE DBD)]] This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010. The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution).
* [http://ciir.cs.umass.edu/downloads/ocr-evaluation/ RETAS OCR Evaluation Dataset] The RETAS dataset (used in the paper by Yalniz and Manmatha, ICDAR'11) is created to evaluate the optical character recognition (OCR) accuracy of real scanned books. The dataset contains real OCR outputs for 160 scanned books (100 English, 20 French, 20 German, 20 Spanish) downloaded from the Internet Archive website. The corresponding ground truth text for each scanned book is obtained from the Project Gutenberg database. The OCR output of each scanned book is aligned with its ground truth at the word and character level and the alignment output is provided along with estimated OCR accuracies. The dataset is provided for research purposes.

= Graphical Documents =

* [[Chem-Infty Dataset: A ground-truthed dataset of Chemical Structure Images]]
* [[Braille Dataset - Shiraz University]]
* [http://www.eurecom.fr/~huet/work.html TradeMarks Image Database] - By way of Benoit Huet, 999 trademark and logo images

= Mixed Content Documents =
* [http://www.umiacs.umd.edu/~zhugy/Tobacco800.html Tobacco800 Document Image Database] - composed of 1290 document images collected and scanned using a wide variety of equipment over time.

= Handwritten Documents =
== On-line and Off-line ==

* [[ICDAR 2009 Signature Verification Competition (SigComp2009)]]

* [[ICFHR 2010 Signature Verification Competition (4NSigComp2010)]]

* [[ICDAR 2011 Signature Verification Competition (SigComp2011)]]

* [[ICFHR 2012 Signature Verification Competition (4NSigComp2012)]]

* [http://www.nlpr.ia.ac.cn/databases/handwriting/Home.html CASIA Online and Offline Chinese Handwriting Databases] - The Chinese handwriting datasets were produced by 1,020 writers using Anoto pen on papers, such that both online and offline data were obtained. Both the online and the offline dataset consists of three subsets for isolated characters (DB1.0–1.2, about 3.9 million samples of 7,356 classes) and three for handwritten texts (DB2.0–2.2, about 5,090 pages and 1.35 million characters). The datasets are free for academic research for handwritten document segmentation and retrieval, character and text line recognition, writer adaptation and identification.

* [[Persian Heritage Image Binarization Dataset (PHIBD 2012)]] This dataset contains 15 historical and old manuscript images collected from the historical records at the Documents and old manuscripts treasury of Mirza Mohammad Kazemaini (affiliated with Hazrate Emamzadeh Jafar), Yazd, Iran. The images suffer from various types of degradation including bleed-through, faded ink, and blur. The dataset is the first in a series to provide document images and their ground truth as a contribution to Document image analysis and recognition (DAIR) community. It is planned to provide more data and ground-truth information in the fture.

== On-line ==
* [[CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions]]

* [[Devanagari Character Dataset]]

* [[Harbin Institute of Technology Opening Recognition Corpus for Chinese Characters (HIT-OR3C)]]

* [[IAM Online Document Database (IAMonDo-database)]]

* [http://www.iam.unibe.ch/fki/databases/iam-on-line-handwriting-database IAM On-Line Handwriting Database]

* [http://hwr.nici.kun.nl/unipen/ UNIPEN database] (Click on link 'CDROMs')

* [http://www.tuat.ac.jp/~nakagawa/database/ Nakagawa Lab Online Handwriting Database]


* [http://www.ai.rug.nl/~lambert/unipen/icdar-03-competition/ The Informal Competition of Recognizing On-line Words (ICROW)] by the Unipen Foundation

== Off-line ==
* [http://www.rimes-database.fr/wiki/doku.php The Rimes Database] comprises 12,723 handwritten pages corresponding to 5605 mails of two to three pages. It was collected by asking volunteers to write a letter given one of nine predefined scenarios related to business/customer relations. The dataset has been used in numerous competitions in ICDAR and ICFHR. It is available for research purposes only, through the Web site of the authors.

* [[IBN SINA: A database for research on processing and understanding of Arabic manuscripts images]]

* [http://www.cedar.buffalo.edu/Databases/CDROM1/ CEDAR Off-line Handwriting CDROM1]

* [http://www.iam.unibe.ch/fki/databases/iam-handwriting-database IAM Database] - A full English sentence database for off-line handwriting recognition.

* [http://prhlt.iti.upv.es/page/projects/multimodal/idoc/germana The GERMANA Dataset] - GERMANA is the result of digitising and annotating a 764-page Spanish manuscript entitled “Noticias y documentos relativos a Doña Germana de Foix, ́última Reina de Aragón", written in 1891 by Vicent Salvador. It contains approximately 21K text lines manually marked and transcribed by palaeography experts.

* [http://prhlt.iti.upv.es/page/projects/multimodal/idoc/rodrigo The RODRIGO Dataset] - RODRIGO is the result of digitising and annotating a manuscript dated 1545. Digitisation was done at 300dpi in color by the Spanish Culture Ministry. The original manuscript is a 853-page bound volume, entitled "Historia de España del arçobispo Don Rodrigo", completely written in old Castilian (Spanish) by a single author. Annotation exists for text blocks, lines and transcriptions, resulting in approximately 20K lines and 231K running words from a lexicon of 17K words.

* [http://marg.nlm.nih.gov/ MARG- Medical Article Records Groundtruth] - A freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its "ground truth". Please contact Dr. George Thoma (thoma@lhc.nlm.nih.gov) at the National Library of Medicine for more information.

* [http://kornai.com/Hindi/ Hindi font samples] by Andras Kornai, June 5 2003

= Software and Tools =
* [http://lampsrv02.umiacs.umd.edu/projdb/project.php?id=53 GEDI: Groundtruthing Environment for Document Images] - A generic annotation tool for scanned text documents.
* [http://www2.parc.com/isl/groups/pda/pixlabeler/index.html PixLabeler] - a research tool for labeling elements in a document image at a pixel level.
* [http://code.google.com/p/ocropus/ OCRopus(tm)] - The OCRopus(tm) open source document analysis and OCR system
* [http://htk.eng.cam.ac.uk/ The Hidden Markov Model Toolkit (HTK)] - a portable toolkit for building and manipulating hidden Markov models
* [https://github.com/meierue/RNNLIB Bidirectional Long-Short Term Memory Networks] - Implementation of Bidirectional Long-Short Term Memory Networks (BLSTM) combined with Connectionist Temporal Classification (CTC) - including examples for Arabic recognition.
* [http://www.speech.sri.com/projects/srilm/ SRILM - The SRI Language Modeling Toolkit] - SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.
* [http://torch5.sourceforge.net/ Torch 5] - a Matlab-like environment for state-of-the-art machine learning algorithms.
* [http://www.prtools.org/ PRTools] - a Matlab based toolbox for pattern recognition



----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

LRDE Document Binarization Dataset (LRDE DBD)

2013-05-30T17:37:27Z

Liwicki: /* Submitted Files */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France

=Copyright=

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [http://www.nouvelobs.com/ Le Nouvel Observateur]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

=Current Version=
1.0

=Keywords=
Document binarization, Magazine, Scanned

=Description=
The dataset is also available at [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD]

This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

* 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
* 125 numerical "clean documents" created from the "original documents" where images have been removed.
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:
* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Ground Truth Data=
* [[Ground Truth for LRDE DBD text line localization]]
* [[Ground Truth for LRDE DBD binarization]]
* [[Ground Truth for LRDE DBD OCR]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Software=
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
* A Python script is provided to launch the benchmark and compute scores.
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=References=
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]

=Submitted Files=
==Version 1.0==
To be available soon. Please refer to [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD] for downloading the files from the origninal datasets site.

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

LRDE Document Binarization Dataset (LRDE DBD)

2013-05-30T17:36:47Z

Liwicki: /* Description */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France

=Copyright=

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [http://www.nouvelobs.com/ Le Nouvel Observateur]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

=Current Version=
1.0

=Keywords=
Document binarization, Magazine, Scanned

=Description=
The dataset is also available at [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD]

This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

* 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
* 125 numerical "clean documents" created from the "original documents" where images have been removed.
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:
* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Ground Truth Data=
* [[Ground Truth for LRDE DBD text line localization]]
* [[Ground Truth for LRDE DBD binarization]]
* [[Ground Truth for LRDE DBD OCR]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Software=
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
* A Python script is provided to launch the benchmark and compute scores.
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=References=
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]

=Submitted Files=
==Version 1.0==

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

LRDE Document Binarization Dataset (LRDE DBD)

2013-05-30T17:36:22Z

Liwicki: /* Description */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France

=Copyright=

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [http://www.nouvelobs.com/ Le Nouvel Observateur]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

=Current Version=
1.0

=Keywords=
Document binarization, Magazine, Scanned

=Description=
The dataset is also available at [http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD#Setup_script_v1_0 http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/DatasetDBD#Setup_script_v1_0]

This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

* 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
* 125 numerical "clean documents" created from the "original documents" where images have been removed.
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:
* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Ground Truth Data=
* [[Ground Truth for LRDE DBD text line localization]]
* [[Ground Truth for LRDE DBD binarization]]
* [[Ground Truth for LRDE DBD OCR]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Software=
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
* A Python script is provided to launch the benchmark and compute scores.
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=References=
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]

=Submitted Files=
==Version 1.0==

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Datasets

2013-05-30T17:31:28Z

Liwicki: /* Overview – Message from TC-11 */

{| style="width: 100%"
|-
| align="right" |

{|
|-
| {{Last updated}}
|}

|}

=Overview – Message from TC-11=
[[Image:ThisWayToTheDatasets.png|250px|right|link=Datasets List| Datasets List]]

It is extremely important for the Document Image Analysis and Recognition community to be able to cross check and reproduce results described in published papers in the field. In order to achieve this, any datasets used as the basis for publications should be publicly available, as is the norm in many other disciplines.

Authors are actively encouraged to submit the datasets they used to train and / or evaluate their algorithms to the TC-11 in order for them to be published on the TC-11 Web site.

This initiative is not restricted to datasets. At TC-11 we are interested in archiving online any piece of data (ground-truth data, software, etc) which would allow to easily reproduce results, set new targets, foster healthy competition, encourage collaboration and generally advance the DIAR field as a whole.

A wealth of datasets and corresponding ground truth data are already available through the TC-11 [[Datasets List]].

If you wish to contribute, please read below about the procedure to submit material to the TC-11. For any comments or suggestions, please contact Marcus Liwicki, the dataset curator at liwicki (at) liwicki@dfki.uni-kl.de.

=Submission Protocol=
The process of submitting a dataset / ground truth to the TC-11 is the following:

# Please fill in the form below, and send it by email to Dimosthenis Karatzas (dimos at cvc.uab.es), the TC-11 dataset curator.
# The TC-11 dataset curator will review the submission request, and ensure that all information is clear and complete, and any copyright issues are properly addressed.
# As soon as all information is in place, the TC-11 dataset curator will ask you to sign and return by fax the final submission form.
# The TC-11 dataset curator will work with you to upload the dataset to the TC-11 Web site. Depending on the nature of the dataset this might be as easy as sending a CD or uploading the required files.

The TC-11 is actively working towards a more comprehensive way of dealing with datasets and associated information.

=Copyright Note=
TC-11 provides dataset hosting services as a benefit to the international research community. If it is determined that copyrighted material is improperly included in a dataset submitted to inclusion on the TC-11 website, we will immediately remove the offending material upon notification of the copyright holder.

By submitting a dataset for inclusion to the TC-11 Web site, the author certifies that he/she has the right to publish the dataset and any associated data in the public domain and the act of doing so does not violate intellectual property rights or copyrights of some third party.

The TC-11 will provide a service through which the submitted dataset and any associated data will be made public to the Document Analysis community worldwide. In case any legal dispute arises in the future in relation to the publishing of this dataset and associated data in the public domain, the author will hold TC-11 free from any wrongdoing and accept responsibility for the publication of these data.

By submitting a dataset and associated data to the TC-11, you explicitly accept that any third party can independently submit additional information that relates to the original dataset (e.g. additional ground-truth data, software, etc).

We strongly encourage the authors, where they own the copyrights of the submitted information, to consider offering it to the community under a [http://creativecommons.org/choose/| creative commons license]. See [http://wiki.creativecommons.org/Before_Licensing/| this link] for guidelines about how to choose a proper Creative Commons license.

=Useful Definitions=
'''Dataset''': A collection of data along with metadata information, as required to use these data.

'''Metadata''': Metadata is information specific to a particular dataset. Metadata are usually tightly structured within the dataset itself (e.g. information encoded within the filenames of submitted images). Metadata can only be submitted at the time of submission of the dataset.

'''Ground Truth Specification''': The definition of the required information that accurately describes a particular aspect of the data at a high level where agreement between different human observers can be established, as well as the definition of an appropriate structure (format) for storing this information.

'''Ground Truth Data''': A set of data conforming to a particular ground truth specification and relating to a specific dataset. Ground Truth Data can be submitted at any time, while different Ground Truth Data (corresponding to different aspects of the data) can be associated with the same dataset.

'''Task''': A well defined process to evaluate algorithms in the context of a specific scientific problem. A task would typically provide a specific evaluation protocol, and link to specific resources as required (a dataset, and usually related ground truth data). Tasks should correspond to open challenges in the field. If you undertake any of the tasks defined and you have published results or code available, we would really like to know!

'''Resources''': Any other type of related resources that are not specifically covered by the above definitions. Examples would include software to browse and visualise a dataset, software to create ground truth data, algorithms to do performance evaluation, codecs, reports, publications, etc.

=Submission Form=
You can download the submission form from here: [[media:TC11_Dataset_Submission_Form_v09.pdf‎| PDF]] or [[Media:TC11_Dataset_Submission_Form_v09.docx| Word 2007]]

The submission form has six sections. Please fill in the sections applicable to your situation. Not all sections need to be filled in. It would be helpful to have a look at already published datasets to get an idea of the information that is needed. Feel free to use as much space as you need, but generally a couple of paragraphs are more than enough to describe every aspect of the submitted material.

=The TC11 Datasets=
The current list of datasets sorted on research topic can be found [[Datasets List|here]].

The list of datasets sorted according to the Journal or Conference they first appeared in can be found [[Datasets per Journal / Conference|here]].

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Ground Truth for LRDE DBD text line localization

2013-05-30T17:30:32Z

Liwicki: /* Submitted Files */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Keywords=
scanned, magazine, documents, text line localization

=Description=

Text Lines Localization Information has been made available by applying text line localization algorithms. The size category of the text depends on the x-height and is considered with the following rule: 0 < small <= 30 < medium <= 55 < large < +inf

* 123 large text lines localization (clean)
* 320 medium text lines localization (clean).
* 9551 small text lines localization (clean).
* 123 large text lines localization (original).
* 320 medium text lines localization (original).
* 9551 small text lines localization (original).
* 123 large text lines localization (scanned).
* 320 medium text lines localization (scanned).
* 9551 small text lines localization (scanned).

The text lines dataset covers only a subset of the full-document dataset. It is generated from the binarization of the full-document images.
Text line localizations are stored as bounding box coordinates in text files.

Purpose of the three document qualities :

* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Related Dataset=
* [[LRDE Document Binarization Dataset (LRDE DBD)]]

=Related Tasks=
* none

=Submitted Files=
* [http://www.iapr-tc11.org/mediawiki/images/LRDE_DBD_ouvel_obs_2402_textlines-1.0.zip http://www.iapr-tc11.org/mediawiki/images/LRDE_DBD_ouvel_obs_2402_textlines-1.0.zip]

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

File:LRDE DBD ouvel obs 2402 textlines-1.0.zip

2013-05-30T17:27:39Z

Liwicki: Text line ground truth for http://www.iapr-tc11.org/mediawiki/index.php/Ground_Truth_for_LRDE_DBD_text_line_localization

Text line ground truth for http://www.iapr-tc11.org/mediawiki/index.php/Ground_Truth_for_LRDE_DBD_text_line_localization

LRDE Document Binarization Dataset (LRDE DBD)

2013-05-30T17:23:42Z

Liwicki: /* Contact Author */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France

=Copyright=

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [http://www.nouvelobs.com/ Le Nouvel Observateur]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]
* [http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

=Current Version=
1.0

=Keywords=
Document binarization, Magazine, Scanned

=Description=
This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

* 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
* 125 numerical "clean documents" created from the "original documents" where images have been removed.
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:
* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Ground Truth Data=
* [[Ground Truth for LRDE DBD text line localization]]
* [[Ground Truth for LRDE DBD binarization]]
* [[Ground Truth for LRDE DBD OCR]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Software=
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
* A Python script is provided to launch the benchmark and compute scores.
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=References=
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]

=Submitted Files=
==Version 1.0==

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

LRDE Document Binarization Dataset (LRDE DBD)

2013-05-30T17:22:58Z

Liwicki: /* Contact Author */

[[Datasets]] -> [[Datasets List]] -> Current Page

{| style="width: 100%"
|-
| align="right" |

{|
|-
| '''Created: '''2013-05-30
|-
| {{Last updated}}
|}

|}

=Contact Author=
Thierry Géraud – thierry.geraud@lrde.epita.fr
EPITA Research and Development Laboratory (LRDE)
14-16 rue Voltaire F-94276 Le Kremlin-Bicetre France

LRDE is the copyright holder of all the images included in the dataset except for the original documents subset which are copyrighted from [[http://www.nouvelobs.com/ Le Nouvel Observateur]]. This work is based on the French magazine Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

You are allowed to reuse these documents for research purpose for evaluation and illustration. If so, please specify the following copyright: "Copyright (c) 2012. EPITA Research and Development Laboratory (LRDE) with permission from Le Nouvel Observateur". You are not allowed to redistribute this dataset.

If you use this dataset, please also cite the most appropriate paper from this list:
* [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition (IJDAR), 2013]]
* [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201109-ICDAR The SCRIBO Module of the Olena Platform: a Free Software Framework for Document Image Analysis. In the proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR), 2011.]]

This data set is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantability and fitness for a particular purpose.

=Current Version=
1.0

=Keywords=
Document binarization, Magazine, Scanned

=Description=
This dataset is composed of documents images extracted from the same French magazine: Le Nouvel Observateur, issue 2402, November 18th-24th, 2010.

The provided dataset is composed of 375 Full-Document Images (A4 format, 300-dpi resolution)

* 125 numerical "original documents" extracted from a PDF, with full OCR groundtruth.
* 125 numerical "clean documents" created from the "original documents" where images have been removed.
* 125 "scanned documents" based on the "clean documents". They have been printed, scanned and registered to match the "clean documents".

Purpose of the three document qualities:
* Original : evaluate the binarization quality on perfect documents mixing text and images.
* Clean : evaluate the binarization quality on perfect document with text only.
* Scanned : evaluate the binarization quality on slightly degraded documents with text only.

=Ground Truth Data=
* [[Ground Truth for LRDE DBD text line localization]]
* [[Ground Truth for LRDE DBD binarization]]
* [[Ground Truth for LRDE DBD OCR]]

=Related Tasks=
* [[Document Binarization Evaluation for LRDE DBD]]

=Software=
* A setup script is provided to download and configure the benchmarking environment. This is the recommanded way to run this benchmark. Note that this script also includes features to update the dataset if a new version is released.
* A Python script is provided to launch the benchmark and compute scores.
* C++ programs (and sources) are provided for performing evaluations and reading ground-truth data.
* 6 binarization algorithms (and their respective C++ sources) are provided and compiled to run this benchmark on their results.

Minimum requirements: 5GB of free space, Linux (Ubuntu, Debian, …)

Dependencies: Python 2.7, tesseract-ocr, tesseract-ocr-fra, git, libgraphicsmagick++1-dev, graphicsmagick-imagemagick-compat, graphicsmagick-libmagick-dev-compat, build-essential. libtool. automake, autoconf. g++-4.6, libqt4-dev (installed automatically with the setup script on Ubuntu and Debian).

=References=
* G. Lazzara, T. Géraud. Efficient Multiscale Sauvola's Binarization. In International Journal of Document Analysis and Recognition 2013 [[http://www.lrde.epita.fr/cgi-bin/twiki/view/Publications/201302-IJDAR]]

=Submitted Files=
==Version 1.0==

----
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].