Difference between revisions of "IBN SINA: A database for research on processing and understanding of Arabic manuscripts images"
(→Submitted Files) |
|||
Line 33: | Line 33: | ||
=Keywords= | =Keywords= | ||
− | Arabic language, shape recognition, skeletonization | + | Arabic language, shape recognition, skeletonization, character recognition |
=Description= | =Description= | ||
Line 41: | Line 41: | ||
The database is built on a manuscript-image provided by the Institute of Islamic Studies (IIS), McGill University, Montreal. The author of the manuscript was Sayf al-Din Abu al-Hasan Ali ibn Abi Ali ibn Muhammad al-Amidi (d. 1243A.D.). The title of the manuscript is Kitab Kashf al-tamwihat sharh al-Tanbihat (i.e., Commentary on Ibn Sina's al-Isharat fi wa-al-tanbihat). The attention paid to this work was particularly intense in the period between the late twelfth century to the first half of the fourteenth century, when more than a dozen comprehensive commentaries were composed. Kashf al-tamwihat fi sharh al-Tanbihat, one of the early commentaries written on al-Isharat wa-al-tanbihat, is an unpublished commentary which still awaits critical edition by scholars. | The database is built on a manuscript-image provided by the Institute of Islamic Studies (IIS), McGill University, Montreal. The author of the manuscript was Sayf al-Din Abu al-Hasan Ali ibn Abi Ali ibn Muhammad al-Amidi (d. 1243A.D.). The title of the manuscript is Kitab Kashf al-tamwihat sharh al-Tanbihat (i.e., Commentary on Ibn Sina's al-Isharat fi wa-al-tanbihat). The attention paid to this work was particularly intense in the period between the late twelfth century to the first half of the fourteenth century, when more than a dozen comprehensive commentaries were composed. Kashf al-tamwihat fi sharh al-Tanbihat, one of the early commentaries written on al-Isharat wa-al-tanbihat, is an unpublished commentary which still awaits critical edition by scholars. | ||
− | The selected dataset is obtained form 51 folios and corresponds to | + | The selected dataset is obtained form 51 folios and corresponds to 20722 extracted shapes (connected components) of Arabic script extracted from the historical manuscript. |
The shapes are extracted from the enhanced and restored document images of the manuscript [[#References|[1, 2]]] and represent blobs of ink in the original manuscript. In order to extract the shapes, a complete binarization process is applied to the input images. It consists of an enhancement / restoration step followed by the binarization step. For the enhancement step multi-level classifiers including the Stroke Map, the Edge Profile and the Estimated background have been used. The definition of these classifiers can be found in reference [[#References|[3]]]. For the skeletonization, a thinning process has been used, followed by a correction step to discover missed branch points. | The shapes are extracted from the enhanced and restored document images of the manuscript [[#References|[1, 2]]] and represent blobs of ink in the original manuscript. In order to extract the shapes, a complete binarization process is applied to the input images. It consists of an enhancement / restoration step followed by the binarization step. For the enhancement step multi-level classifiers including the Stroke Map, the Edge Profile and the Estimated background have been used. The definition of these classifiers can be found in reference [[#References|[3]]]. For the skeletonization, a thinning process has been used, followed by a correction step to discover missed branch points. | ||
− | The feature vector consists of 92 features. The features can be divided in two parts: (1) 8 global features and (2) 84 skeleton-based features. The second part also can be divided in two sub-parts: (a) topological features based on the relation to the branch/end/singular points on the skeleton and (b) geometrical features related to the orientation and position of sub-strokes that comprise the connected component or shape under study. The feature vector is regularized in terms of its length. The details can be found in section 4 of the paper [[#References|[1]]]. | + | ==Version 1.0== |
+ | The first version of the dataset comprises the feature vectors of the 20722 shapes (sub-words extracted as connected components). The feature vector consists of 92 features. The features can be divided in two parts: (1) 8 global features and (2) 84 skeleton-based features. The second part also can be divided in two sub-parts: (a) topological features based on the relation to the branch/end/singular points on the skeleton and (b) geometrical features related to the orientation and position of sub-strokes that comprise the connected component or shape under study. The feature vector is regularized in terms of its length. The details can be found in section 4 of the paper [[#References|[1]]]. | ||
− | All non-normalised feature vectors are provided in a single space delimited text file, where each row corresponds to the feature vector of a single | + | All non-normalised feature vectors are provided in a single space delimited text file, where each row corresponds to the feature vector of a single connected component. |
+ | |||
+ | ==Version 2.0== | ||
+ | The extended version of the dataset includes the image data. A MatLAB file is provided for each folio, which contains a structure with all the information, icluding the images and features of each sub-word. Information about the structure of the files is provided in the guide. | ||
+ | |||
+ | Two versions of the dataset are provided for greater flexibility: one containing just the binarized images of the shapes (small file) and another containing both the original color images and the binarized ones (large file). The rest of the information in the both files is identical. | ||
=Related Ground Truth Data= | =Related Ground Truth Data= | ||
− | + | ==Version 1.0== | |
+ | For each shape in the dataset, 15 binary labels that correspond to 15 problems are provided. For each problem, the label specifies whether the string of that shape contains the associate Arabic letter of that problem or not. For example, in Latin language for the sake of font simplicity, if the string is “ab”, for Problem of letter “a”, the label will be 1, and for problem of letter “c” the label will be -1. | ||
+ | |||
+ | Only 15 letters are considered here that have at least 1000 positive samples each in the dataset. The 15 letters are: | ||
+ | |||
+ | Ein: abbreviated as EU, | ||
+ | Aleph: abbreviated as aL, | ||
+ | Be: Abbreviated as bL, | ||
+ | Dal: Abbreviated as dL, | ||
+ | Fa: Abbreviated as fL, | ||
+ | Ha: Abbreviated as hL, | ||
+ | Kaf: Abbreviated as kL, | ||
+ | Lam: Abbreviated as lL, | ||
+ | Mim: Abbreviated as mL, | ||
+ | Nun: Abbreviated as nL, | ||
+ | Ghain: Abbreviated as qL, | ||
+ | Ra: Abbreviated as rL, | ||
+ | Ta: Abbreviated as tL, | ||
+ | Waw: Abbreviated as vL, | ||
+ | Ya: Abbreviated as yL | ||
+ | |||
+ | For abbreviation of the problem names, Fingilish encoding is used which corresponds each Arabic letter to an ASCII character. Please see the report for the details. Ground Truth Specification The ground truth data are specified as a single 20722 X 15 matrix containing the labels of the 15 problems for all shapes in the dataset. They are provided as a single space delimited text file. | ||
+ | |||
+ | ==Version 2.0== | ||
+ | [[Image:IBN SINA fingilish-table.png|400px|thumb|right|The latin labeling system used.]] | ||
+ | |||
+ | Labelling information is provided as a field within the connected component structure (MatLAB file). The labels are given in Latin transliteration, as shown in the figure. | ||
=Related Tasks= | =Related Tasks= | ||
Line 63: | Line 95: | ||
==Version 1.0== | ==Version 1.0== | ||
* [http://www.iapr-tc11.org/dataset/IBN_SINA/labeling_dataset-just_features Feature Vectors of all 20722 shapes] (4 Mb). | * [http://www.iapr-tc11.org/dataset/IBN_SINA/labeling_dataset-just_features Feature Vectors of all 20722 shapes] (4 Mb). | ||
+ | * [http://www.iapr-tc11.org/dataset/IBN_SINA/labeling_dataset-group_reduced-labels Shape Labels for the IBN SINA Dataset] (1 Mb). | ||
+ | ==Version 2.0== | ||
+ | * [http://www.iapr-tc11.org/dataset/IBN_SINA_2/IBN_SINA_Ext.zip The full dataset including just binarized images] (17 Mb). | ||
+ | * [http://www.iapr-tc11.org/dataset/IBN_SINA_2/IBN_SINA_Ext_color.zip The full dataset including both binarized and color images] (170 Mb). | ||
+ | * [http://www.iapr-tc11.org/dataset/IBN_SINA_2/IBN_SINA_Ext_Guide.pdf A guide to the Dataset] (0.5 Mb). | ||
---- | ---- | ||
This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]]. | This page is editable only by [[IAPR-TC11:Reading_Systems#TC11_Officers|TC11 Officers ]]. |
Revision as of 15:48, 2 October 2011
Datasets -> Datasets List -> Current Page
|
Contents
Contact Author
Prof Mohamed Cheriet Synchromedia Laboratory ETS, Montréal, (QC) Canada H3C 1K3 E-mail: mohamed.cheriet@etsmtl.ca Tel: +1(514)396-8972 Fax: +1(514)396-8595 |
Keywords
Arabic language, shape recognition, skeletonization, character recognition
Description
The database is built on a manuscript-image provided by the Institute of Islamic Studies (IIS), McGill University, Montreal. The author of the manuscript was Sayf al-Din Abu al-Hasan Ali ibn Abi Ali ibn Muhammad al-Amidi (d. 1243A.D.). The title of the manuscript is Kitab Kashf al-tamwihat sharh al-Tanbihat (i.e., Commentary on Ibn Sina's al-Isharat fi wa-al-tanbihat). The attention paid to this work was particularly intense in the period between the late twelfth century to the first half of the fourteenth century, when more than a dozen comprehensive commentaries were composed. Kashf al-tamwihat fi sharh al-Tanbihat, one of the early commentaries written on al-Isharat wa-al-tanbihat, is an unpublished commentary which still awaits critical edition by scholars.
The selected dataset is obtained form 51 folios and corresponds to 20722 extracted shapes (connected components) of Arabic script extracted from the historical manuscript.
The shapes are extracted from the enhanced and restored document images of the manuscript [1, 2] and represent blobs of ink in the original manuscript. In order to extract the shapes, a complete binarization process is applied to the input images. It consists of an enhancement / restoration step followed by the binarization step. For the enhancement step multi-level classifiers including the Stroke Map, the Edge Profile and the Estimated background have been used. The definition of these classifiers can be found in reference [3]. For the skeletonization, a thinning process has been used, followed by a correction step to discover missed branch points.
Version 1.0
The first version of the dataset comprises the feature vectors of the 20722 shapes (sub-words extracted as connected components). The feature vector consists of 92 features. The features can be divided in two parts: (1) 8 global features and (2) 84 skeleton-based features. The second part also can be divided in two sub-parts: (a) topological features based on the relation to the branch/end/singular points on the skeleton and (b) geometrical features related to the orientation and position of sub-strokes that comprise the connected component or shape under study. The feature vector is regularized in terms of its length. The details can be found in section 4 of the paper [1].
All non-normalised feature vectors are provided in a single space delimited text file, where each row corresponds to the feature vector of a single connected component.
Version 2.0
The extended version of the dataset includes the image data. A MatLAB file is provided for each folio, which contains a structure with all the information, icluding the images and features of each sub-word. Information about the structure of the files is provided in the guide.
Two versions of the dataset are provided for greater flexibility: one containing just the binarized images of the shapes (small file) and another containing both the original color images and the binarized ones (large file). The rest of the information in the both files is identical.
Related Ground Truth Data
Version 1.0
For each shape in the dataset, 15 binary labels that correspond to 15 problems are provided. For each problem, the label specifies whether the string of that shape contains the associate Arabic letter of that problem or not. For example, in Latin language for the sake of font simplicity, if the string is “ab”, for Problem of letter “a”, the label will be 1, and for problem of letter “c” the label will be -1.
Only 15 letters are considered here that have at least 1000 positive samples each in the dataset. The 15 letters are:
Ein: abbreviated as EU, Aleph: abbreviated as aL, Be: Abbreviated as bL, Dal: Abbreviated as dL, Fa: Abbreviated as fL, Ha: Abbreviated as hL, Kaf: Abbreviated as kL, Lam: Abbreviated as lL, Mim: Abbreviated as mL, Nun: Abbreviated as nL, Ghain: Abbreviated as qL, Ra: Abbreviated as rL, Ta: Abbreviated as tL, Waw: Abbreviated as vL, Ya: Abbreviated as yL
For abbreviation of the problem names, Fingilish encoding is used which corresponds each Arabic letter to an ASCII character. Please see the report for the details. Ground Truth Specification The ground truth data are specified as a single 20722 X 15 matrix containing the labels of the 15 problems for all shapes in the dataset. They are provided as a single space delimited text file.
Version 2.0
Labelling information is provided as a field within the connected component structure (MatLAB file). The labels are given in Latin transliteration, as shown in the figure.
Related Tasks
References
- Reza Farrahi Moghaddam, Mohamed Cheriet, Mathias M. Adankon, Kostyantyn Filonenko, and Robert Wisnovsky, “IBN SINA: A database for research on processing and understanding of Arabic manuscripts images”, Proceedings of DAS’10, June 9-11, 2010, Boston, MA, USA
- Reza Farrahi Moghaddam and Mohamed Cheriet, “Application of Multi-level Classifiers and Clustering for Automatic Word-spotting in Historical Document Images”, ICDAR’09, pp 511-515, July 26-29, Barcelona, Spain.
- Reza Farrahi Moghaddam and Mohamed Cheriet, “RSLDI: Restoration of single-sided low-quality document images”, Pattern Recognition, 42, 3355-3364, 2009.
Submitted Files
Version 1.0
Version 2.0
- The full dataset including just binarized images (17 Mb).
- The full dataset including both binarized and color images (170 Mb).
- A guide to the Dataset (0.5 Mb).
This page is editable only by TC11 Officers .