The DocLab Dataset for Evaluating Table Interpretation Methods

From TC11
Revision as of 18:01, 27 January 2011 by Dimos (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Datasets -> Datasets List -> Current Page

Created: 2010-08-03
Last updated: 2011-001-27

Contact Author

Raghav Krishna Padmanabhan
89, 14th Street , Apt#1 
Troy, NY - 12180
Email: raghav.krishna[at]gmail.com
Tel: 979-571-5551
USA

Current Version

1.0

Keywords

Augmentations, Aggregates, Evaluation , Footnotes, Table interpretation

Description

Thumb DocLabTables.jpg

The dataset is a collection of 165 files culled from 9 websites in the geopolitical domain. The files are in one of the following formats – HTML (77), Excel (67), and CSV (20). Each file contains at least one table. The dataset consists of a total of 172 tables.

DATASET CONSTRUCTION: The files comprising the dataset were selected based on the following constraints on the tables they contained. 1. Tables with rectilinear structure only. 2. Tables with text in English language only. 3. Tables that do not contain graphic symbols or figures. 4. Non recursive tables, i.e., no table with a table as one of its content cells. 5. Non-concatenated tables (no tables formed by concatenating two or more tables). 6. Tables which do not span more than one HTML page or Excel sheet.

Metadata

Statistics for each table are provided in an Excel file. The information recorded is table size (number of rows and columns), augmentations (aggregates, footnotes, units), Wang dimensionality and source Web Site.

Related Ground Truth Data

Related Tasks

References

  1. Padmanabhan, R., Jandhyala, R. C., Krishnamoorthy, M., Nagy, G., Seth, S., Silversmith, W.: Interactive Conversion of Web Tables. In: Procs. Eighth IAPR International Workshop on Graphics Recognition (GREC 2009), City University of La Rochelle, France, Lecture Notes in Computer Science, 6020, Springer, Heidelberg (In Press) (2010)
  2. Seth, S., Jandhyala, R. C., Krishnamoorthy, M., Nagy, G.: Analysis and Taxonomy of Column Header Categories for Web Tables (Oral Presentation). In: Procs. Ninth IAPR International Workshop on Document Analysis Systems, Boston, Massachusetts (2010), ID: 73
  3. Nagy, G. Padmanabhan, R., Jandhyala, R. C., Silversmith, W., Krishnamoorthy, M.: Table Metadata: Headers, Augmentations and Aggregates. In: Procs. Ninth IAPR International Workshop on Document Analysis Systems, Boston, Massachusetts (2010), ID: 77
  4. Padmanabhan, R.: Table Abstraction Tool, Master’s Thesis, Rensselaer Polytechnic Institute, May 2009.

Submitted Files

Version 1.0


This page is editable only by TC11 Officers .