DAS-Discussion: Datasets and Benchmarks

From TC11
Jump to: navigation, search

Back to DAS-Discussion:Index

Last updated: 2012-004-25

DAS Working Subgroup Meeting: Datasets and Benchmarks

Authors:

  • Dimos Karatzas (Moderator) – CVC Barcelona – IAPR TC-11 Data Curator
  • Bart Lamiroy (Secretary) – CSE Lehigh University – IAPR TC-10 Data Curator

Further Participants:

  • Henry Baird – CSE Lehigh University
  • Igor Filippov – SAIC-Frederick, Inc.
  • Albert Gordo – CVC Barcelona
  • Dan Lopresti – CSE Lehigh University – IAPR TC-11 Chair
  • Marcal Rossinyol – CVC Barcelona

Context

Availability, use and dissemination of benchmarks, datasets and ground truths in order to promote subjective and reproducible assessment of document analysis methods, collaboration and exchange of research results in the document analysis domain. The IAPR TC-11 has launched the initiative to strongly encourage participants at the DAS 2010 workshop to contribute and publish their used datasets. It is generally admitted that global and openly accessible benchmarking resources and datasets can be very beneficial to research communities, and that the document analysis domain is lagging behind in this domain compared to others.

Challenges and Main Issues

Initiatives of this kind have existed before and seem to be regularly recurring over the decades (cf. introductory keynote talk by G. Nagy) not all are successful, some have a significant impact of the community, all decline, disappear or become obsolete over the years. The main challenges related to benchmark, dataset and general resource dissemination can be reduced to : 1. guarantee sustainable access over time, 2. keep « entry cost » as low as possible, 3. keep them adapted to the technology advances, application needs, scientific focus shifts, 4. avoid lock-in syndromes by using open and adaptable representation standards and not restricting new and creative uses that were not initially intended.

Potential Community Benefits and Uses

1. Access to reference datasets allows for

  • fast prototyping and testing of one's own methods and algorithms.
  • coherent and objective peer review of presented results
  • reproduction and sharing of results

2. Access to reference algorithms allows for

  • open peer-based performance evaluation
  • self-assessment of one's own prototypes

The fact of guaranteeing an open, permanent and sustained access adds the possibility to allow for continuously running contests or performance evaluation.

Pitfalls and Risks

The design of above mentioned repositories should remain in phase with the needs and the usages of the community it is targeting. The most common pitfall is that its design coerces or limits the usage it can have, for instance by imposing formats, limits, and constraints that could exclude potential users for taking advantage of it or sharing their results. On the other hand, if the repository is just a « bag of stuff » it provides no significant added value to its users and will fail its mission.

Report of Discussions

Given the previous context this section describes the discussions and exchanges that have taken place during the DAS 2010 dedicated workshop sessions.

Dataset Coherence

  • What to do with the datasets, and how make their access coherent ?
  • Given the very wide variety of dataset contributions, how to guarantee the quality and their global usefulness to the community? Should contributions be peer-reviewed before been put on-line.
  • How should contributions be referenced, indexed or annotated so that potential users can

rapidly and efficiently find what they need ?

  • Should there be a specific submission protocol and form ? Is the existing form sufficient ?

What is Data?

Data that might prove useful for the document analysis community consists of

  • Document Images
  • Annex data (annotations, meta-data, acquisition conditions, task context ...)
  • Ground truth (which is a specific case of the next point)
  • Results from algorithms
  • Programs-algorithms
  • Papers

as well as all the links that can be expressed between them, (essentially provenance and generation hierarchies of data produced by algorithms on data produced by other algorithms ...)

Specific Uses and Needs

Besides providing a repository for datasets and resources, such a platform can also offer higher level services to the document analysis community, like, for instance hosting long standing contests, formal tasks definitions for benchmarking and comparing methods and algorithms, capitalizing and encouraging creative an new uses of common referencing and benchmarking resources. To that extent it is important that the maintainers of such a platform make it as open as possible and allow for inputs, suggestions and adaptations of the user community to take into account a wide range of services.

Incentives

There's a general consensus that repositories and community resources are only useful when there is a continuous contribution, correction, evaluation process that keeps it in line with the needs and expectations of its potential users. This means that contributors should have an as great as possible interest to participate in this process. Ideas of achieving this may consist in having a way of referencing the datasets in a non-ambiguous way so that due credit is given (in the same way articles are cited, for instance), provide statistics of “most used”, “most referenced”, “best regarded” contributions. This will both require a stable repository, a flexible reference annotation, a way of publicly rank, rate or review datasets as well as reviewing, ranking or evaluating the reviewers themselves.

Copyright, Intellectual Property, Sensitive and Personal Data

History has shown that copyright and intellectual property issues may significantly hamper and limit uses and dissemination. The fact is, however, that correctly assessing the copyright of the data that is commonly used in the Document Analysis community may be extremely complex and perhaps even impossible in certain cases. It is obvious that experimental data can be classified in three main levels:

  • Open copyright or public domain data (either coming from clearly identifiable sources or of which the contributer is the copyright holder and who releases this right to the community). This includes Scientific Commons, copyleft, GPL or other similar licenses.

In order to avoid complex situations, it is probably a good idea to incite contributors to release their data under a clearly specified license.

  • Third party owned data is data that clearly cannot be distributed.
  • Most of the data is very likely to be of blurry origin. Most certainly because either its origin has been lost, because the copyright holders cannot be clearly identified, or because the data is very unlikely to create litigation.

The bottom line is, unfortunately, that a non-profit research community does not have the resources of legally establishing even a reasonable guarantee of origin of copyright or a risk assessment on interesting and useful, yet possibly problematic data contributions. It seems, however, that the Millennium Act offers academia the possibility to host and distribute data in a legal safe way, as long as there is no clear intent to infringe on the copyright law, that the contributers are knowingly responsible for its publication (and are identifiable) and that the platform has a mechanism allowing legitimate copyright holders to signal litigious material and that the publishers offer a means to efficiently and rapidly remove this material (cf. Youtube-like platforms). It is not clear, however what happens with data that is distributable from a copyright point of view, but that contains personal or sensitive information. While it is clear, from an ethical point of view that these data should not be distributed, it is however less obvious as how to detect and handle situations where it find its way in our repositories.

Data Persistence and Dependency

The need for taking into account possible copyright litigation and subsequent needs of removing data from the repository calls for a very careful study of data persistence and dependency. Since successful datasets will be largely used by the community they are likely to produce “derived products” in the form of ground truths, analysis results, etc. In case of copyright litigation it will be necessary to retrace the provenance of all derived or dependent data as to correctly remove the concerned data, without affecting legitimate data. This situation may extend to versioning and error-correction of dynamic datasets. Interdependence of datasets can furthermore be considered with respect to, remote or partial hosting of parent data.

Operational issues

Submission process Maintenance burden Noise management (spurious, silly or malevolent contributions to the repository) Guaranteeing data quality:

  • reviewing/certifying
  • starring
  • reviewing or starring reviewers

About the Lehigh DAE platform

The DAE platform currently developed at the CSE department of Lehigh University (http://dae.cse.lehigh.edu) offers a framework that looks very promising and seems to potentially address all issues that have been discussed. It uses a open and document analysis centered data model to represent data in a very flexible way, allows non-static notions of datasets, browsing and querying data inter-dependence and provenance etc. We therefore suggest that the TC-10 and TC-11 data repository initiatives experiment with the DAE platform for hosting their datasets.

Conclusions and Concrete Actions

What specific actions can be sponsored, supported or done by the TC-11 ? First and foremost, we feel that publicly available quality datasets are needed and beneficial for the community. It therefore is our opinion that the TC-11 (and IAPR TCs in general) should sponsor and promote make these datasets available. Possible incentives are:

  • provide a way to reference and cite datasets with due credit to their authors
  • make publishing and contributing dataset a reasonably hard requirement for article review and acceptation. In order to receive TC-11 sponsorship events could, for instance, require that authors publish their datasets to a public repository under an appropriate license, or clearly state why this is not possible in their context (e.g. sensitive personal data in medical document analysis)
  • reward most cited or best reviewed datasets at major bi-annual international events.

Of course providing resources like datasets raises quite some legal questions. While it seems quite obvious that our community does not have the resources to do an exhaustive legal study of this question, it seems reasonably safe to assume that under the Millennium act, data hosted that is contributed by identified users having acknowledged a copyright notice and/or provided data under an appropriate license (e.g. but not exclusively, Creative Commons – http://creativecommons.org/). Help from volunteers that can share knowledge in this domain should be called upon. An open resource creates the risk of containing “noisy” data. TC-11 should therefor sponsor and encourage a posteriori review of these data. This can, for instance, be done by providing a “starring” mechanism, or a more extensive web 2.0 review mechanism, similar to community websites where reviewers are evaluated on the quality of their reviews and where reviewer ranking can provide a recognition within the community. Last but not least, it is our conviction that the TCs should play a leading role in promoting and inciting for the use of collectively shared resources, but that it is not the TCs (and TC-11 in particular) role to decide how these resources are actually used. Experience shows that longstanding contests and benchmarks are generally very useful and successful, but require a high level of individual commitment of the organizers. We see the role of the TC-11 more as a facilitator, counselor and service provider which would lower the burden on the organizers. This would probably be done under the form of providing advice, best practices and support (or even provide) on-line back-office services for running contests (upload and storage, output format formalization and comparison ...).