Revision as of 19:22, 28 June 2010

Datasets -> Current Page

Created: 2010-06-28

Last updated: 2010-006-28

Contact Author

Emanuel Indermühle
Neubrückstrasse 10, 
3012 Bern, Switzerland
Email: eindermu@iam.unibe.ch

Current Version

1.0

Keywords

Online, handwriting, document, text, non-text, diagram, table, list, formula, drawings, layout, document zone

Description

The dataset contains 941 online handwritten documents by 189 writers. The documents consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document.

In the acquisition phase, the writers have been asked to copy the content of a template to a sheet of paper (A4). The writing was recorded online by a digital pen. A total of 1000 templates are composed of text coming from the Brown corpus and of drawings, diagrams, and formulas provided by Wikimedia Commons and Wikipedia, respectively. Thereby a total of 200 different diagrams, 200 different drawings and 200 different formulas are used. The text for the text blocks, lists, and tables has been selected randomly from 11 categories of the Brown corpus.

After a document has been written, the writer has marked the document with a small set of marking elements including underlining of text, marking of text on one or multiple lines by an angle on the top left as a start mark, and an angle on the bottom right as an end mark, marking of text enclosing it in square brackets, marking of entire text lines by a vertical stroke on the right or left side of the text block, encircling, annotation of these markings by small text labels, and lines connecting these text labels with the markings. Of all documents a subset of 839 contain an average of 10 such marking elements.

Some of the documents have landscape orientation, others have portrait orientation. On some documents, separate text parts have a different orientation.

The dataset contains:

941 documents
68841 words
7616 text lines in text blocks
1478 text blocks
2068 list items
536 lists
2550 table cells
450 tables
5698 labels in diagrams
917 drawings in diagrams
910 diagrams
546 drawing not part of diagrams
489 formulas
355,097 strokes

As file format InkML as defined by W3C in a working draft 2006 is used. Each document is placed in its individual InkML file. The handwriting is recorded by collecting a sample point every 13 to 14 milliseconds along the pen trajectory. Sample points in a straight line have been removed by the recording device. Each sample point is given by its coordinates on the sheet, a timestamp in milliseconds, and a pressure value from 0 to 255. The sample points recorded, beginning from the moment where the pen touches the paper on to the moment when the pen is lifted again, are grouped together to form a stroke, or trace.

The ground truth annotation and the meta data are integrated together with the plain writing information in the InkML file of an individual document.

All the documents together use 126 MB of space on the hard disk.

Metadata

For each document the following information about the writer has been collected: birthday (yyyy-mm-dd), gender (male,female), citizenship (country code), date (yyyy-mm-dd), profession (free text), educational degree (elementary school, high-school, apprenticeship, secondary school, bachelor, master, phd), native language. These pieces of information are stored along with the rest of the information in the InkML documents.

The documents and the corresponding templates have been numbered from 000 to 999. These number are at the same time the identification string (ID) of a document. If the same template has been copied several times, the ID of these documents has been expanded by small Latin characters to become unique, as for example 001a. The ID has been used as the name of the documents' files.

Software

The database's documents are stored using the InkML standard. InkML standard is quite complex and therefor might be a barrier to access the document's content. To resolve this issue a software library (libinkml) has been released which implements the portion of InkML required to read the documents. This software can be used outside of the context of this database.

The software InkAnno - built on libinkml - was developed to simplify the handling of IAMonDo documents even more. It implements the following functionality: displaying the documents, add and edit annotations, export into pdf/images/feature vectors, rotate and mirror the documents. It may serve as basis for further functionalities.

Related Ground Truth Data

IAMonDo - Hierarchical Layout and Full Transcription. The ground truth information is stored along with the digital ink in the dataset documents.

Related Tasks

Text and Non-Text Distinction in Online Handwritten Documents

References

E. Indermühle, H. Bunke, F. Shafait and T. Breuel. Text versus non-Text Distinction in Online Handwritten Documents. In Proc. 25th Symposium On Applied Computing, pages 3–7, 2010

E. Indermühle, M. Liwicki, H. Bunke. IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents. In Proc. Of Int. Workshop on Document Analysis Systems, pages 97-104, 2010

Submitted Files

Version 1.0

(to be linked soon)

Dataset (29 MB)
libinkml library (also available from google code)
InkAnno Annotation and Visualisation software (also available from google code) (680 KB)

This page is editable only by TC11 Officers .

@@ Line 27: / Line 27: @@
 =Description=
-<!-- [[Image:Dataset_OR3C_Thumbnail.jpg|400px|thumb|right|Example of characters in the dataset]] -->
 The dataset contains 941 online handwritten documents by 189 writers. The documents  consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document.

Navigation menu

Difference between revisions of "IAM Online Document Database (IAMonDo-database)"