IAM Online Document Database (IAMonDo-database)

From TC11
Revision as of 16:58, 28 June 2010 by Dimos (talk | contribs) (Created page with 'Datasets -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2010-06-28 |- | {{Last updated}} |} |} =Contact Author= Emanuel Indermühle…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Datasets -> Current Page

Created: 2010-06-28
Last updated: 2010-006-28

Contact Author

Emanuel Indermühle
Neubrückstrasse 10, 
3012 Bern, Switzerland
Email: eindermu@iam.unibe.ch

Current Version

1.0

Keywords

Online, handwriting, document, text, non-text, diagram, table, list, formula, drawings, layout, document zone

Description

The dataset contains 941 online handwritten documents by 189 writers. The documents consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document.

In the acquisition phase, the writers have been asked to copy the content of a template to a sheet of paper (A4). The writing was recorded online by a digital pen. A total of 1000 templates are composed of text coming from the Brown corpus and of drawings, diagrams, and formulas provided by Wikimedia Commons and Wikipedia, respectively. Thereby a total of 200 different diagrams, 200 different drawings and 200 different formulas are used. The text for the text blocks, lists, and tables has been selected randomly from 11 categories of the Brown corpus.

After a document has been written, the writer has marked the document with a small set of marking elements including underlining of text, marking of text on one or multiple lines by an angle on the top left as a start mark, and an angle on the bottom right as an end mark, marking of text enclosing it in square brackets, marking of entire text lines by a vertical stroke on the right or left side of the text block, encircling, annotation of these markings by small text labels, and lines connecting these text labels with the markings. Of all documents a subset of 839 contain an average of 10 such marking elements.

Some of the documents have landscape orientation, others have portrait orientation. On some documents, separate text parts have a different orientation.

The 941 documents contain 68841 words, 7616 text lines in 1478 text blocks, 2068 list items in 536 lists, 2550 table cells in 450 tables, 5698 labels and 917 drawings in 910 diagrams, 546 drawing without labels, and 489 formulas.

As file format InkML as defined by W3C in a working draft 2006 is used. Each document is placed in its individual InkML file. The handwriting is recorded by collecting a sample point every 13 to 14 milliseconds along the pen trajectory. Sample points in a straight line have been removed by the recording device. Each sample point is given by its coordinates on the sheet, a timestamp in milliseconds, and a pressure value from 0 to 255. The sample points recorded, beginning from the moment where the pen touches the paper on to the moment when the pen is lifted again, are grouped together to form a stroke, or trace.

The ground truth annotation and the meta data are integrated together with the plain writing information in the InkML file of an individual document.

All the documents together use 126 MB of space on the hard disk.

Metadata

For each document the following information about the writer has been collected: birthday (yyyy-mm-dd), gender (male,female), citizenship (country code), date (yyyy-mm-dd), profession (free text), educational degree (elementary school, high-school, apprenticeship, secondary school, bachelor, master, phd), native language. These pieces of information are stored along with the rest of the information in the InkML documents.

The documents and the corresponding templates have been numbered from 000 to 999. These number are at the same time the identification string (ID) of a document. If the same template has been copied several times, the ID of these documents has been expanded by small Latin characters to become unique, as for example 001a. The ID has been used as the name of the documents' files.


Related Ground Truth Data

Related Tasks

References

  1. E. Indermühle, H. Bunke, F. Shafait and T. Breuel. Text versus non-Text Distinction in Online Handwritten Documents. In Proc. 25th Symposium On Applied Computing, pages 3–7, 2010
  1. E. Indermühle, M. Liwicki, H. Bunke. IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents. In Proc. Of Int. Workshop on Document Analysis Systems, pages 97-104, 2010

Submitted Files

Version 1.0


This page is editable only by TC11 Officers .