Chem-Infty Dataset: A ground-truthed dataset of Chemical Structure Images
Koji Nakagawa(kn[at]kyudai.jp), Faculty of Mathematics, Kyushu University, JAPAN
Akio Fujiyoshi(fujiyosi[at]mx.ibaraki.ac.jp), Department of Computer and Information Sciences, Ibaraki University, JAPAN
Masakazu Suzuki(suzuki[at]math.kyushu-u.ac.jp), Faculty of Mathematics, Kyushu University, JAPAN
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.1 Japan License
Optical Chemical Structure Recognition, Graphical Documents, Symbols
This dataset consists of chemical images (dataset) and their chemical meaning (see ground truth section). The 5727 chemical images were randomly collected from Japanese published patent applications in the year 2008.
- Number of samples in the dataset: 869
- File format: TIFF format images including binary and greyscale.
- File Name Convention: The file names of image files and the meta data have the following name convention:
- 2008XXXXXX_N_chem.tif: a TIFF file
- 2008XXXXXX_N_chem.sdf: the meta data of 2008XXXXXX_NNN_chem.tif
- The string '2008XXXXXX' expresses the patent ID and 'N' expresses the ‘N’-th elements of the multi-tiff file (See Reference ).
When you use or distribute this dataset, please inform the authors of your contact information (Name, Affiliation, E-mail address).
Disclaimer: Although the authors tried their best to provide an error-free dataset, there might be some incorrect data. If you encounter any such errors, please report them back to the authors so that the data can be updated.
- CLiDE (Chemical Literature Data Extraction) Validation Set.
- OSRA: Optical Structure Recognition. Validation data of US Patent.
Related Ground Truth Data
- Koji Nakagawa, Akio Fujiyoshi, and Masakazu Suzuki. Ground-Truthed Dataset of Chemical Structure Images in Japanese Published Patent Applications. In the proceedings of the 9th International Workshop on Document Analysis Systems (DAS'2010), pp 455-462, June 9-11, 2010, Boston, MA, USA.
- Akio Fujiyoshi, Koji Nakagawa, and Masakazu Suzuki. Robust Recognition Method of Chemical Structure Images for Japanese Published Patent Applications. Available as a short paper in the web page of the 9th International Workshop on Document Analysis Systems (DAS'2010), June 9-11, 2010, Boston, MA, USA.
- CTfile Formats Specification
- ChemInfty Dataset (69MB)
This page is editable only by TC11 Officers .