IBN SINA: A database for research on processing and understanding of Arabic manuscripts images

From TC11
Revision as of 10:12, 13 May 2010 by Dimos (talk | contribs) (References)
Jump to: navigation, search

Datasets -> Current Page

Created: 2010-05-13
Last updated: 2010-005-13

Contact Author

Prof Mohamed Cheriet
Synchromedia Laboratory
ETS, Montréal, (QC) Canada
H3C 1K3
mohamed.cheriet@etsmtl.ca
Tel: +1(514)396-8972
Fax: +1(514)396-8595

Keywords

Arabic language, shape recognition, skeletonization

Description

Example of images in the dataset
Example of the ground truth data

The database is built on a manuscript-image provided by the Institute of Islamic Studies (IIS), McGill University, Montreal. The author of the manuscript was Sayf al-Din Abu al-Hasan Ali ibn Abi Ali ibn Muhammad al-Amidi (d. 1243A.D.). The title of the manuscript is Kitab Kashf al-tamwihat sharh al-Tanbihat (i.e., Commentary on Ibn Sina's al-Isharat fi wa-al-tanbihat). The attention paid to this work was particularly intense in the period between the late twelfth century to the first half of the fourteenth century, when more than a dozen comprehensive commentaries were composed. Kashf al-tamwihat fi sharh al-Tanbihat, one of the early commentaries written on al-Isharat wa-al-tanbihat, is an unpublished commentary which still awaits critical edition by scholars.

The selected dataset is obtained form 51 folios and corresponds to the feature vectors of 20722 shapes (connected components (CCs) or blobs) of Arabic script extracted from a historical manuscript (see above).

The shapes are extracted from the enhanced and restored document images of the manuscript [1, 2] and represent blobs of ink in the original manuscript. In order to extract the shapes, a complete binarization process is applied to the input images. It consists of an enhancement / restoration step followed by the binarization step. For the enhancement step multi-level classifiers including the Stroke Map, the Edge Profile and the Estimated background have been used. The definition of these classifiers can be found in reference [3]. For the skeletonization, a thinning process has been used, followed by a correction step to discover missed branch points.

The feature vector consists of 92 features. The features can be divided in two parts: (1) 8 global features and (2) 84 skeleton-based features. The second part also can be divided in two sub-parts: (a) topological features based on the relation to the branch/end/singular points on the skeleton and (b) geometrical features related to the orientation and position of sub-strokes that comprise the connected component or shape under study. The feature vector is regularized in terms of its length. The details can be found in section 4 of the paper [1].

All non-normalised feature vectors are provided in a single space delimited text file, where each row corresponds to the feature vector of a single CC.

Related Ground Truth Data

Related Tasks

References

  1. Reza Farrahi Moghaddam, Mohamed Cheriet, Mathias M. Adankon, Kostyantyn Filonenko, and Robert Wisnovsky, “IBN SINA: A database for research on processing and understanding of Arabic manuscripts images”, Proceedings of DAS’10, June 9-11, 2010, Boston, MA, USA
  2. Reza Farrahi Moghaddam and Mohamed Cheriet, “Application of Multi-level Classifiers and Clustering for Automatic Word-spotting in Historical Document Images”, ICDAR’09, pp 511-515, July 26-29, Barcelona, Spain.
  3. Reza Farrahi Moghaddam and Mohamed Cheriet, “RSLDI: Restoration of single-sided low-quality document images”, Pattern Recognition, 42, 3355-3364, 2009.

Submitted Files



This page is editable only by TC11 Officers .