DAS-Discussion: Information Extraction (2014)
From TC11
Back to DAS-Discussion:Index
|
Contents
DAS Working Subgroup Meeting: Information Extraction
Authors:
- Nibal Nayef
Participants:
- Yoshinori AKAO (Japanese police)
- Saddok KEBAIRI (Itesoft)
- Manaba OHTA
- Xin TAO
- Ronaldo MESSINA (a2ia)
- Nibal NAYEF (France)
- Bao
Introduction
We have totally different views of information extraction Different tasks:
- Entity spotting (numbers, words, ….)
- Graphics spotting (logos, symbols, tables etc.)
- Semantics after text recognition
- Logical structure
What is a document ??!!
We have many types of documents [and increasing]:
- Digitally born documents
- Camera / mobile captured
- Scanned
..
To extract any kind of information from any type of document, we need a sort of “prerequisite” module, so that IE modules can work on all document types
Problems of IE
- What kind of semantic information should we extract?: Technical terms, ….
- Define the logical structure of a document
- Same information in different representations: Same name in different languages
- What are the ground truth data, size of training data?: Use human voting to build GT
- Ultimate goal: Automatic and complete understanding of document contents.
- Application: Enrich Data Mining
Approaches
CRF, NLP, and all methods for word/graphic spotting
Future Directions
Combine methods from different fields:
- Image processing
- Natural language processing
Take into account that documents are drastically changing