DAS-Discussion: Information Extraction (2014)

Back to DAS-Discussion:Index

Last updated: 2015-001-02

DAS Working Subgroup Meeting: Information Extraction

Authors:

Nibal Nayef

Participants:

Yoshinori AKAO (Japanese police)
Saddok KEBAIRI (Itesoft)
Manaba OHTA
Xin TAO
Ronaldo MESSINA (a2ia)
Nibal NAYEF (France)
Bao

Introduction

We have totally different views of information extraction Different tasks:

Entity spotting (numbers, words, ….)
Graphics spotting (logos, symbols, tables etc.)
Semantics after text recognition
Logical structure

What is a document ??!!

We have many types of documents [and increasing]:

Digitally born documents
Camera / mobile captured
Scanned

..

To extract any kind of information from any type of document, we need a sort of “prerequisite” module, so that IE modules can work on all document types

Problems of IE

What kind of semantic information should we extract?: Technical terms, ….
Define the logical structure of a document
Same information in different representations: Same name in different languages
What are the ground truth data, size of training data?: Use human voting to build GT
Ultimate goal: Automatic and complete understanding of document contents.
Application: Enrich Data Mining

Approaches

CRF, NLP, and all methods for word/graphic spotting

Future Directions

Combine methods from different fields:

Image processing
Natural language processing

Take into account that documents are drastically changing

Navigation menu