Difference between revisions of "Datasets"

From TC11
Jump to: navigation, search
(On-line: vectorial, (x_t, y_t))
(Changed the link associated with the "This way to the datasets" to point to the right place...)
 
(96 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Optical Character Recognition (OCR) =
+
{| style="width: 100%"
 +
|-
 +
| align="right" |
  
== Machine-print OCR ==
+
{|
 +
|-
 +
| {{Last updated}}
 +
|}
  
* [http://www.loria.fr/~tombre/Misc/msg00000.html UW-I English Scientific and Technical Journal pages image database] (Broken Link)
+
|}
  
* [http://documents.cfar.umd.edu/resources/database/UWII.html UW-II English/Japanese Document Image Database] (Broken Link)
+
=Important Notice=
 +
[[Image:ThisWayToTheDatasets.png|250px|right|link=https://tc11.cvc.uab.es/datasets/| Datasets List]]
  
* [http://documents.cfar.umd.edu/resources/database/3UWCdRom.html UW-III English/Technical Document Image Database] (Broken Link)
+
The datasets are maintained at http://datasets.iapr-tc11.org The old dataset repository will remain accessible during [[Datasets List | here]]
  
* [http://diuf.unifr.ch/diva/APTI/ APTI: Arabic Printed Text Image Database]
+
=Overview – Message from TC-11=
 +
[[Image:ThisWayToTheDatasets_OldRepo.png|200px|right|link=http://www.iapr-tc11.org/mediawiki/index.php/Datasets_List| Datasets List]]
  
* [http://documents.cfar.umd.edu/resources/database/ERIM_Arabic_DB.html ERIM Database] (Broken Link)
+
It is extremely important for the Document Image Analysis and Recognition community to be able to cross check and reproduce results described in published papers in the field. In order to achieve this, any datasets used as the basis for publications should be publicly available, as is the norm in many other disciplines.
  
= Handwriting =
+
Authors are actively encouraged to submit the datasets they used to train and / or evaluate their algorithms to the TC-11 in order for them to be published on the TC-11 Web site.
== On-line:  vectorial,  <math>(x_t, y_t)</math> ==
 
  
* [http://www.cedar.buffalo.edu/Linguistics/database.html CEDAR On-line Handwriting Database] (Broken Link)
+
This initiative is not restricted to datasets. At TC-11 we are interested in archiving online any piece of data (ground-truth data, software, etc) which would allow to easily reproduce results, set new targets, foster healthy competition, encourage collaboration and generally advance the DIAR field as a whole.
  
* [http://hwr.nici.kun.nl/unipen/ UNIPEN database] (Click on link 'CDROMs')
+
A wealth of datasets and corresponding ground truth data are already available through the TC-11 [http://datasets.iapr-tc11.org Web portal].
  
* [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/pendigits/ Ethem Alpaydin's on-line digit db] (Needs Update)
+
If you wish to contribute, please read below about the procedure to submit material to the TC-11 web-portal. The dataset curators will be notified as soon as the dataset is uploaded.
 +
For any comments or suggestions, please contact Joseph Chazalon, the dataset curator at joseph(dot)chazalon+tc11(at)lrde.epita.fr
  
* [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits/ Ethem Alpaydin's optical digit db] (Needs Update)
+
=Submission Protocol=
 +
In order to submit a protocol please create an account on the TC11 datasets portal (http://datasets.iapr-tc11.org) and follow the online submission instructions. For any problems, please contact the dataset curators.
  
* [http://www.tuat.ac.jp/~nakagawa/database/index.shtml Kuchibue & Nakayosi] (by Masaaki Nakagawa and Stefan Jaeger) (Needs Update)
 
** Together, these databases comprise more than 3 million Japanese characters from 283 writers.
 
  
* [http://www.ai.rug.nl/~lambert/unipen/icdar-03-competition/ The Informal Competition of Recognizing On-line Words (ICROW)] by the Unipen Foundation
+
=Copyright Note=
 +
TC-11 provides dataset hosting services as a benefit to the international research community. If it is determined that copyrighted material is improperly included in a dataset submitted to inclusion on the TC-11 website, we will immediately remove the offending material upon notification of the copyright holder.
  
== Off-line:  image,   <math>I(x,y)</math> ==
+
By submitting a dataset for inclusion to the TC-11 Web site, the author certifies that he/she has the right to publish the dataset and any associated data in the public domain and the act of doing so does not violate intellectual property rights or copyrights of some third party.
  
* [http://www.cedar.buffalo.edu/Databases/CDROM1/ CEDAR Off-line Handwriting CDROM1]
+
The TC-11 will provide a service through which the submitted dataset and any associated data will be made public to the Document Analysis community worldwide. In case any legal dispute arises in the future in relation to the publishing of this dataset and associated data in the public domain, the author will hold TC-11 free from any wrongdoing and accept responsibility for the publication of these data.
  
* [ftp://sequoyah.ncsl.nist.gov/pub/databases/catalog.txt The NIST handwriting OCR databases Catalog]
+
By submitting a dataset and associated data to the TC-11, you explicitly accept that any third party can independently submit additional information that relates to the original dataset (e.g. additional ground-truth data, software, etc).
  
* [http://www.cenparmi.concordia.ca/ CENPARMI (four 3.5" diskettes)] 17000 binarized numbers from ZIP codes on envelopes by the US Postal Services
+
We strongly encourage the authors, where they own the copyrights of the submitted information, to consider offering it to the community under a [http://creativecommons.org/choose/ creative commons license]. See [http://wiki.creativecommons.org/Before_Licensing/ this link] for guidelines about how to choose a proper Creative Commons license.
  
* [http://www.computer.org/proceedings/icdar/0318/03180705abs.htm Handwritten LOB-corpus sentences] by Horst Bunke, University of Bern, Switzerland
+
=Useful Definitions=
 +
'''Dataset''': A collection of data along with metadata information, as required to use these data.
  
* [http://www.iam.unibe.ch/~zimmerma/iamdb/iamdb.html IAM Database] - A full English sentence database for off-line handwriting recognition.
+
'''Metadata''': Metadata is information specific to a particular dataset. Metadata are usually tightly structured within the dataset itself (e.g. information encoded within the filenames of submitted images). Metadata can only be submitted at the time of submission of the dataset.
  
* MARG- Medical Article Records Groundtruth ([http://marg.nlm.nih.gov/]) is a freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its "ground truth". Please contact Dr. George Thoma (thoma@lhc.nlm.nih.gov) at the National Library of Medicine for more information.
+
'''Ground Truth Specification''': The definition of the required information that accurately describes a particular aspect of the data at a high level where agreement between different human observers can be established, as well as the definition of an appropriate structure (format) for storing this information.
  
* [http://kornai.com/Hindi/ Hindi font samples] by Andras Kornai, June 5 2003
+
'''Ground Truth Data''': A set of data conforming to a particular ground truth specification and relating to a specific dataset. Ground Truth Data can be submitted at any time, while different Ground Truth Data (corresponding to different aspects of the data) can be associated with the same dataset.
  
=== Miscellaneous Kanji handwritten OCR databases ===
+
'''Task''': A well defined process to evaluate algorithms in the context of a specific scientific problem. A task would typically provide a specific evaluation protocol, and link to specific resources as required (a dataset, and usually related ground truth data). Tasks should correspond to open challenges in the field. If you undertake any of the tasks defined and you have published results or code available, we would really like to know!
  
* [http://www.cse.salford.ac.uk/prima/TC11//iptp-cdrom2.html IPTP CD-ROM2]
+
'''Resources''': Any other type of related resources that are not specifically covered by the above definitions. Examples would include software to browse and visualise a dataset, software to create ground truth data, algorithms to do performance evaluation, codecs, reports, publications, etc.
  
== Combined on-line/offline handwriting ==
 
  
[http://www.computer.org/proceedings/icdar/0318/03180455abs.htm IRONOFF] by the IRESTE group of the University of Nantes, France, is the first database of this type which is available. It combines handwritten samples of words in image (pixel) and in vector format.
+
----
: Contact: Christian VIARD-GAUDIN (cviard@ireste.fr)
+
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]].

Latest revision as of 16:22, 9 November 2022

Last updated: 2022-11-09

Important Notice

Datasets List

The datasets are maintained at http://datasets.iapr-tc11.org The old dataset repository will remain accessible during here

Overview – Message from TC-11

Datasets List

It is extremely important for the Document Image Analysis and Recognition community to be able to cross check and reproduce results described in published papers in the field. In order to achieve this, any datasets used as the basis for publications should be publicly available, as is the norm in many other disciplines.

Authors are actively encouraged to submit the datasets they used to train and / or evaluate their algorithms to the TC-11 in order for them to be published on the TC-11 Web site.

This initiative is not restricted to datasets. At TC-11 we are interested in archiving online any piece of data (ground-truth data, software, etc) which would allow to easily reproduce results, set new targets, foster healthy competition, encourage collaboration and generally advance the DIAR field as a whole.

A wealth of datasets and corresponding ground truth data are already available through the TC-11 Web portal.

If you wish to contribute, please read below about the procedure to submit material to the TC-11 web-portal. The dataset curators will be notified as soon as the dataset is uploaded. For any comments or suggestions, please contact Joseph Chazalon, the dataset curator at joseph(dot)chazalon+tc11(at)lrde.epita.fr

Submission Protocol

In order to submit a protocol please create an account on the TC11 datasets portal (http://datasets.iapr-tc11.org) and follow the online submission instructions. For any problems, please contact the dataset curators.


Copyright Note

TC-11 provides dataset hosting services as a benefit to the international research community. If it is determined that copyrighted material is improperly included in a dataset submitted to inclusion on the TC-11 website, we will immediately remove the offending material upon notification of the copyright holder.

By submitting a dataset for inclusion to the TC-11 Web site, the author certifies that he/she has the right to publish the dataset and any associated data in the public domain and the act of doing so does not violate intellectual property rights or copyrights of some third party.

The TC-11 will provide a service through which the submitted dataset and any associated data will be made public to the Document Analysis community worldwide. In case any legal dispute arises in the future in relation to the publishing of this dataset and associated data in the public domain, the author will hold TC-11 free from any wrongdoing and accept responsibility for the publication of these data.

By submitting a dataset and associated data to the TC-11, you explicitly accept that any third party can independently submit additional information that relates to the original dataset (e.g. additional ground-truth data, software, etc).

We strongly encourage the authors, where they own the copyrights of the submitted information, to consider offering it to the community under a creative commons license. See this link for guidelines about how to choose a proper Creative Commons license.

Useful Definitions

Dataset: A collection of data along with metadata information, as required to use these data.

Metadata: Metadata is information specific to a particular dataset. Metadata are usually tightly structured within the dataset itself (e.g. information encoded within the filenames of submitted images). Metadata can only be submitted at the time of submission of the dataset.

Ground Truth Specification: The definition of the required information that accurately describes a particular aspect of the data at a high level where agreement between different human observers can be established, as well as the definition of an appropriate structure (format) for storing this information.

Ground Truth Data: A set of data conforming to a particular ground truth specification and relating to a specific dataset. Ground Truth Data can be submitted at any time, while different Ground Truth Data (corresponding to different aspects of the data) can be associated with the same dataset.

Task: A well defined process to evaluate algorithms in the context of a specific scientific problem. A task would typically provide a specific evaluation protocol, and link to specific resources as required (a dataset, and usually related ground truth data). Tasks should correspond to open challenges in the field. If you undertake any of the tasks defined and you have published results or code available, we would really like to know!

Resources: Any other type of related resources that are not specifically covered by the above definitions. Examples would include software to browse and visualise a dataset, software to create ground truth data, algorithms to do performance evaluation, codecs, reports, publications, etc.



This page is editable only by TC11 Officers .