Text and Non-Text Distinction in Online Handwritten Documents

From TC11
Revision as of 17:12, 28 June 2010 by Dimos (talk | contribs) (Created page with 'Datasets -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2010-06-28 |- | {{Last updated}} |} |} =Description= A system is to develope…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Datasets -> Current Page

Created: 2010-06-28
Last updated: 2010-006-28

Description

A system is to developed which may have to be trained on a set of documents referred to as training set. This system is able to decide for all strokes of a document if they are part of text or part of non-text content elements in the document.

The fraction of strokes correctly classified by the system is the value of the stroke accuracy of this system. If the online documents are converted to an image, the fraction of pixels corresponding to the correctly classified strokes is the value of the pixel accuracy of this system.

In this task a stroke is considered to be a text stroke if, at any hierarchical level, it is annotated as formula or text line. Other strokes are labelled as non-text strokes. In this task marking elements are ignored.

Evaluation Protocol

The dataset is split into 5 disjoint sets each consisting of approximately 200 documents. No two documents from different sets were created by the same writer. The sets are indexed from 0 to 4. They are defined by 5 files listing the names of the contained documents. The set files are named 0.set, 1.set, 2.set, 3.set, and 4.set

Two different approaches to conduct experiments have been defined for this dataset: 1. Set 0 and 1 are used for the training, set 2 is used to validate system parameters, and set 3 is the test set. 2. A 4-fold cross validation where sets (0 + i) and (1 + i mod 4) are used for training, set (2 + i mod 4) for validation, and set (3 + i mod 4) for testing, for i = 0, . . . , 3.

Set 4 is used as an independent test set which should be used only once in a system.

Related Dataset

Related Ground Truth Data

References

  1. E. Indermühle, H. Bunke, F. Shafait and T. Breuel. Text versus non-Text Distinction in Online Handwritten Documents. In Proc. 25th Symposium On Applied Computing, pages 3–7, 2010
  1. E. Indermühle, M. Liwicki, H. Bunke. IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents. In Proc. Of Int. Workshop on Document Analysis Systems, pages 97-104, 2010



This page is editable only by TC11 Officers .