Difference between revisions of "IAM Online Document Database (IAMonDo-database)"

From TC11
Jump to: navigation, search
(Created page with 'Datasets -> Current Page {| style="width: 100%" |- | align="right" | {| |- | '''Created: '''2010-06-28 |- | {{Last updated}} |} |} =Contact Author= Emanuel Indermühle…')
 
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Datasets]] -> Current Page
+
[[Datasets]] -> [[Datasets List]] -> Current Page
  
 
{| style="width: 100%"
 
{| style="width: 100%"
Line 27: Line 27:
  
 
=Description=
 
=Description=
<!-- [[Image:Dataset_OR3C_Thumbnail.jpg|400px|thumb|right|Example of characters in the dataset]] -->
 
 
 
The dataset contains 941 online handwritten documents by 189 writers. The documents  consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document.  
 
The dataset contains 941 online handwritten documents by 189 writers. The documents  consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document.  
  
Line 37: Line 35:
 
Some of the documents have landscape orientation, others have portrait orientation. On some documents, separate text parts have a different orientation.
 
Some of the documents have landscape orientation, others have portrait orientation. On some documents, separate text parts have a different orientation.
  
The 941 documents contain 68841 words, 7616 text lines in 1478 text blocks2068 list items in 536 lists, 2550 table cells in 450 tables, 5698 labels and 917 drawings in 910 diagrams, 546 drawing without labels, and 489 formulas.
+
The dataset contains:
 +
* 941 documents
 +
* 68841 words
 +
* 7616 text lines in text blocks
 +
* 1478 text blocks
 +
* 2068 list items
 +
* 536 lists
 +
* 2550 table cells
 +
* 450 tables
 +
* 5698 labels in diagrams
 +
* 917 drawings in diagrams
 +
* 910 diagrams
 +
* 546 drawing not part of diagrams
 +
* 489 formulas
 +
* 355,097 strokes
  
 
As file format InkML as defined by W3C in a working draft 2006 is used. Each document is placed in its individual InkML file. The handwriting is recorded by collecting a sample point every 13 to 14 milliseconds along the pen trajectory. Sample points in a straight line have been removed by the recording device. Each sample point is given by its coordinates on the sheet, a timestamp in milliseconds, and a pressure value from 0 to 255. The sample points recorded, beginning from the moment where the pen touches the paper on to the moment when the pen is lifted again, are grouped together to form a stroke, or trace.
 
As file format InkML as defined by W3C in a working draft 2006 is used. Each document is placed in its individual InkML file. The handwriting is recorded by collecting a sample point every 13 to 14 milliseconds along the pen trajectory. Sample points in a straight line have been removed by the recording device. Each sample point is given by its coordinates on the sheet, a timestamp in milliseconds, and a pressure value from 0 to 255. The sample points recorded, beginning from the moment where the pen touches the paper on to the moment when the pen is lifted again, are grouped together to form a stroke, or trace.
Line 50: Line 62:
 
The documents and the corresponding templates have been numbered from 000 to 999. These number are at the same time the identification string (ID) of a document. If the same template has been copied several times, the ID of these documents has been expanded by small Latin characters to become unique, as for example 001a. The ID has been used as the name of the documents' files.  
 
The documents and the corresponding templates have been numbered from 000 to 999. These number are at the same time the identification string (ID) of a document. If the same template has been copied several times, the ID of these documents has been expanded by small Latin characters to become unique, as for example 001a. The ID has been used as the name of the documents' files.  
  
 +
=Software=
 +
The database's documents are stored using the InkML standard. InkML standard is quite complex and therefor might be a barrier to access the document's content. To resolve this issue a software library (libinkml) has been released which implements the portion of InkML required to read the documents. This software can be used outside of the context of this database.
 +
 +
The software InkAnno - built on libinkml - was developed to simplify the handling of IAMonDo documents
 +
even more. It implements the following functionality: displaying the documents, add and edit annotations, export into pdf/images/feature vectors, rotate and mirror the documents. It may serve as basis for further functionalities.
  
 
=Related Ground Truth Data=
 
=Related Ground Truth Data=
* [[Hierarchical Layout and Full Transcription]]. The ground truth information is stored along with the digital ink in the dataset documents.
+
* [[IAMonDo - Hierarchical Layout and Full Transcription]]. The ground truth information is stored along with the digital ink in the dataset documents.
  
 
=Related Tasks=
 
=Related Tasks=
Line 65: Line 82:
  
 
==Version 1.0==
 
==Version 1.0==
<!--
+
* [http://www.iapr-tc11.org/dataset/IAMonDo/IAMonDo-db-1.0.tar.gz Dataset] (29 MB)
* [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.1/OR3C/online/document.rar Online Documents] (21 Mb)
+
* [http://www.iapr-tc11.org/dataset/IAMonDo/inkanno-1.0.tar.gz InkAnno] Annotation and Visualisation software (also available from [http://code.google.com/p/inkanno/ google code]) (680 KB)
 
+
* libinkml library: available from [http://code.google.com/p/libinkml/ google code]
* File Format Specification ([http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C/online/File%20style(English).doc English] or [http://www.iapr-tc11.org/dataset/OR3C_DAS2010/v1.0/OR3C/online/File%20style(Chinese).doc Chinese])
 
-->
 
  
 
----
 
----
 
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].
 
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Latest revision as of 18:05, 27 January 2011

Datasets -> Datasets List -> Current Page

Created: 2010-06-28
Last updated: 2011-001-27

Contact Author

Emanuel Indermühle
Neubrückstrasse 10, 
3012 Bern, Switzerland
Email: eindermu@iam.unibe.ch

Current Version

1.0

Keywords

Online, handwriting, document, text, non-text, diagram, table, list, formula, drawings, layout, document zone

Description

The dataset contains 941 online handwritten documents by 189 writers. The documents consists of text blocks, lists, tables, formulas, diagrams and drawings. Such pieces of content have been placed in arbitrary positions on each document.

In the acquisition phase, the writers have been asked to copy the content of a template to a sheet of paper (A4). The writing was recorded online by a digital pen. A total of 1000 templates are composed of text coming from the Brown corpus and of drawings, diagrams, and formulas provided by Wikimedia Commons and Wikipedia, respectively. Thereby a total of 200 different diagrams, 200 different drawings and 200 different formulas are used. The text for the text blocks, lists, and tables has been selected randomly from 11 categories of the Brown corpus.

After a document has been written, the writer has marked the document with a small set of marking elements including underlining of text, marking of text on one or multiple lines by an angle on the top left as a start mark, and an angle on the bottom right as an end mark, marking of text enclosing it in square brackets, marking of entire text lines by a vertical stroke on the right or left side of the text block, encircling, annotation of these markings by small text labels, and lines connecting these text labels with the markings. Of all documents a subset of 839 contain an average of 10 such marking elements.

Some of the documents have landscape orientation, others have portrait orientation. On some documents, separate text parts have a different orientation.

The dataset contains:

  • 941 documents
  • 68841 words
  • 7616 text lines in text blocks
  • 1478 text blocks
  • 2068 list items
  • 536 lists
  • 2550 table cells
  • 450 tables
  • 5698 labels in diagrams
  • 917 drawings in diagrams
  • 910 diagrams
  • 546 drawing not part of diagrams
  • 489 formulas
  • 355,097 strokes

As file format InkML as defined by W3C in a working draft 2006 is used. Each document is placed in its individual InkML file. The handwriting is recorded by collecting a sample point every 13 to 14 milliseconds along the pen trajectory. Sample points in a straight line have been removed by the recording device. Each sample point is given by its coordinates on the sheet, a timestamp in milliseconds, and a pressure value from 0 to 255. The sample points recorded, beginning from the moment where the pen touches the paper on to the moment when the pen is lifted again, are grouped together to form a stroke, or trace.

The ground truth annotation and the meta data are integrated together with the plain writing information in the InkML file of an individual document.

All the documents together use 126 MB of space on the hard disk.

Metadata

For each document the following information about the writer has been collected: birthday (yyyy-mm-dd), gender (male,female), citizenship (country code), date (yyyy-mm-dd), profession (free text), educational degree (elementary school, high-school, apprenticeship, secondary school, bachelor, master, phd), native language. These pieces of information are stored along with the rest of the information in the InkML documents.

The documents and the corresponding templates have been numbered from 000 to 999. These number are at the same time the identification string (ID) of a document. If the same template has been copied several times, the ID of these documents has been expanded by small Latin characters to become unique, as for example 001a. The ID has been used as the name of the documents' files.

Software

The database's documents are stored using the InkML standard. InkML standard is quite complex and therefor might be a barrier to access the document's content. To resolve this issue a software library (libinkml) has been released which implements the portion of InkML required to read the documents. This software can be used outside of the context of this database.

The software InkAnno - built on libinkml - was developed to simplify the handling of IAMonDo documents even more. It implements the following functionality: displaying the documents, add and edit annotations, export into pdf/images/feature vectors, rotate and mirror the documents. It may serve as basis for further functionalities.

Related Ground Truth Data

Related Tasks

References

  1. E. Indermühle, H. Bunke, F. Shafait and T. Breuel. Text versus non-Text Distinction in Online Handwritten Documents. In Proc. 25th Symposium On Applied Computing, pages 3–7, 2010
  1. E. Indermühle, M. Liwicki, H. Bunke. IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents. In Proc. Of Int. Workshop on Document Analysis Systems, pages 97-104, 2010

Submitted Files

Version 1.0


This page is editable only by TC11 Officers .