Difference between revisions of "IAM Online Document Database (IAMonDo-database)"
(→Description) |
|||
Line 27: | Line 27: | ||
=Description= | =Description= | ||
− | |||
− | |||
The dataset contains 941 online handwritten documents by 189 writers. The documents consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document. | The dataset contains 941 online handwritten documents by 189 writers. The documents consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document. | ||
Revision as of 19:22, 28 June 2010
Datasets -> Current Page
|
Contents
Contact Author
Emanuel Indermühle Neubrückstrasse 10, 3012 Bern, Switzerland Email: eindermu@iam.unibe.ch
Current Version
1.0
Keywords
Online, handwriting, document, text, non-text, diagram, table, list, formula, drawings, layout, document zone
Description
The dataset contains 941 online handwritten documents by 189 writers. The documents consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document.
In the acquisition phase, the writers have been asked to copy the content of a template to a sheet of paper (A4). The writing was recorded online by a digital pen. A total of 1000 templates are composed of text coming from the Brown corpus and of drawings, diagrams, and formulas provided by Wikimedia Commons and Wikipedia, respectively. Thereby a total of 200 different diagrams, 200 different drawings and 200 different formulas are used. The text for the text blocks, lists, and tables has been selected randomly from 11 categories of the Brown corpus.
After a document has been written, the writer has marked the document with a small set of marking elements including underlining of text, marking of text on one or multiple lines by an angle on the top left as a start mark, and an angle on the bottom right as an end mark, marking of text enclosing it in square brackets, marking of entire text lines by a vertical stroke on the right or left side of the text block, encircling, annotation of these markings by small text labels, and lines connecting these text labels with the markings. Of all documents a subset of 839 contain an average of 10 such marking elements.
Some of the documents have landscape orientation, others have portrait orientation. On some documents, separate text parts have a different orientation.
The dataset contains:
- 941 documents
- 68841 words
- 7616 text lines in text blocks
- 1478 text blocks
- 2068 list items
- 536 lists
- 2550 table cells
- 450 tables
- 5698 labels in diagrams
- 917 drawings in diagrams
- 910 diagrams
- 546 drawing not part of diagrams
- 489 formulas
- 355,097 strokes
As file format InkML as defined by W3C in a working draft 2006 is used. Each document is placed in its individual InkML file. The handwriting is recorded by collecting a sample point every 13 to 14 milliseconds along the pen trajectory. Sample points in a straight line have been removed by the recording device. Each sample point is given by its coordinates on the sheet, a timestamp in milliseconds, and a pressure value from 0 to 255. The sample points recorded, beginning from the moment where the pen touches the paper on to the moment when the pen is lifted again, are grouped together to form a stroke, or trace.
The ground truth annotation and the meta data are integrated together with the plain writing information in the InkML file of an individual document.
All the documents together use 126 MB of space on the hard disk.
Metadata
For each document the following information about the writer has been collected: birthday (yyyy-mm-dd), gender (male,female), citizenship (country code), date (yyyy-mm-dd), profession (free text), educational degree (elementary school, high-school, apprenticeship, secondary school, bachelor, master, phd), native language. These pieces of information are stored along with the rest of the information in the InkML documents.
The documents and the corresponding templates have been numbered from 000 to 999. These number are at the same time the identification string (ID) of a document. If the same template has been copied several times, the ID of these documents has been expanded by small Latin characters to become unique, as for example 001a. The ID has been used as the name of the documents' files.
Software
The database's documents are stored using the InkML standard. InkML standard is quite complex and therefor might be a barrier to access the document's content. To resolve this issue a software library (libinkml) has been released which implements the portion of InkML required to read the documents. This software can be used outside of the context of this database.
The software InkAnno - built on libinkml - was developed to simplify the handling of IAMonDo documents even more. It implements the following functionality: displaying the documents, add and edit annotations, export into pdf/images/feature vectors, rotate and mirror the documents. It may serve as basis for further functionalities.
Related Ground Truth Data
- IAMonDo - Hierarchical Layout and Full Transcription. The ground truth information is stored along with the digital ink in the dataset documents.
Related Tasks
References
- E. Indermühle, H. Bunke, F. Shafait and T. Breuel. Text versus non-Text Distinction in Online Handwritten Documents. In Proc. 25th Symposium On Applied Computing, pages 3–7, 2010
- E. Indermühle, M. Liwicki, H. Bunke. IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents. In Proc. Of Int. Workshop on Document Analysis Systems, pages 97-104, 2010
Submitted Files
Version 1.0
(to be linked soon)
- Dataset (29 MB)
- libinkml library (also available from google code)
- InkAnno Annotation and Visualisation software (also available from google code) (680 KB)
This page is editable only by TC11 Officers .