DAS-Discussion: Information Extraction (2014)

From TC11
Jump to: navigation, search

Back to DAS-Discussion:Index

Last updated: 2015-001-02

DAS Working Subgroup Meeting: Information Extraction

Authors:

  • Nibal Nayef

Participants:

  • Yoshinori AKAO (Japanese police)
  • Saddok KEBAIRI (Itesoft)
  • Manaba OHTA
  • Xin TAO
  • Ronaldo MESSINA (a2ia)
  • Nibal NAYEF (France)
  • Bao

Introduction

We have totally different views of information extraction Different tasks:

  • Entity spotting (numbers, words, ….)
  • Graphics spotting (logos, symbols, tables etc.)
  • Semantics after text recognition
  • Logical structure

What is a document ??!!

We have many types of documents [and increasing]:

  • Digitally born documents
  • Camera / mobile captured
  • Scanned

..

To extract any kind of information from any type of document, we need a sort of “prerequisite” module, so that IE modules can work on all document types

Problems of IE

  • What kind of semantic information should we extract?: Technical terms, ….
  • Define the logical structure of a document
  • Same information in different representations: Same name in different languages
  • What are the ground truth data, size of training data?: Use human voting to build GT
  • Ultimate goal: Automatic and complete understanding of document contents.
  • Application: Enrich Data Mining

Approaches

CRF, NLP, and all methods for word/graphic spotting

Future Directions

Combine methods from different fields:

  • Image processing
  • Natural language processing

Take into account that documents are drastically changing