DAS-Discussion: Information Extraction

From TC11
Jump to: navigation, search

Back to DAS-Discussion:Index

Last updated: 2012-004-25

DAS Working Subgroup Meeting: Information Extraction


  • Partha Pratim Roy (CVC, Spain, partha@cvc.uab.es)
  • Prateek Sarkar (PARC, USA, psarkar@parc.com)

The general objective of the Information Extraction (IE) meeting in the context of DAS’2009 workshop was to discuss open problems and existing methodologies in order to record the recent advances in IE field. The two-day long meeting was devoted to discuss the importance of information, the pros and cons of different methodologies for extraction and different experimental systems developed for IE. This report discusses the details of the meetings including the datasets of documents, meta-data collection, entities interaction and learning methods.


Information Extraction (IE) is a technique used to detect relevant information in larger documents and present it in a structured format. Generally, IE is applied in electronic documents which can be manipulated using the ASCII text. But, when the data in document is not ASCII but black pixels in white page, it is not easy for proper understanding. Such documents are required to be converted into a machine-readable version by Optical Character Recognition (OCR) techniques. This conversion is performed with considerable effort in cost and time, to guarantee a high quality standard, i.e. no loss of relevant information. There is no complete automatic conversion into the target document format available. The compromise is either high cost using human controlled input or semiautomatic input using document imaging, i.e. the original scanned image is stored.

Total 12 members of different research-labs across the world including USA, Spain, France, Germany, and Japan participated. These research labs work mainly on the documents such as article papers, invoices, patent documents, administrative documents etc. Presently, IE systems are commonly based on pattern matching. Each pattern consists of a regular expression and an associated mapping from syntactic to logical form. The input to information extraction is a set of text documents and the output is a set of filled slots. The set of filled slots may represent an entity with its attributes, a relationship between two or more entities, or an event with various entities playing roles and/or being in certain relationships.

What Information to Extract

Significant efforts have been focused on the problem of extracting structured information (e.g., researchers, publications, co-author and advising relationships, etc.) from such data. Some of the information discussed, are given below.

  • Functional Roles: To qualify as a descriptor, a noun phrase must refer either to the person or to the person's professional role (or other functional role). Some references to a person's role are made indirectly, e.g., by the use of "as," "work as," "job of":
  • Named Entities: The Named Entity task consists of three subtasks (entity names, temporal expressions, number expressions). The expressions to be annotated are "unique identifiers" of entities (organizations, persons, locations), times (dates, times), and quantities (monetary values, percentages).
  • Named relations: The set of filled slots may represent an entity with its attributes, a relationship between two or more entities, or an event with various entities playing roles and/or being in certain relationships.
  • Keywords, topics (search terms): Finally, we leverage the extracted entities and relations to provide user services such as browsing, keyword search, and structured querying. A keyword query is interpreted as a set of precise queries in the context of information extracted from text.
  • Citation matching Citation matching is the problem of extracting bibliographic records from citation lists in technical papers, and merging records that represent the same publication.
  • Linking entities Some citation websites such as, CiteSeer.org extracts citation information from academic research papers, including the paper’s title, authors, publication venue, year, etc. It also duplicates citation entries from papers’ reference sections, so one can easily finds all the papers that cite a certain paper. The resulting “citation graph” can be analyzed to automatically find the seminal papers in a subfield.
  • Chemical structure formula To search for chemical structures in research articles, diagrams or text representing molecules need to be translated to a standard chemical file format compatible with search engines.
  • Information for document Scheduling If a document matches a profile the user is notified about the existence of a relevant document. Here, a stream of incoming documents is handled one at a time to determine where each should be directed. The list of Profiles identify which user should receive the document


Numerous IE systems have been designed and implemented. In principal, the used approaches can be categorized into three groups: (1) expert designed rules (2) statistical methods and (3) Manual Interaction.

1) The expert designed rules approach asks for a system developer, who is familiar with both the requirements of the application domain and the function of the designed IE system. The developer is concerned with the definition of rules used to extract the relevant information based on known problems. Therefore, a corpus of domain-relevant texts will be available for this task. Furthermore, she or he is free to apply any general knowledge or intuitions in the design of rules. Thus, the performance of the IE system depends on the skill of the knowledge engineer. Some advantages and disadvantages of this system are discussed below.


  • Often better performance without much training data.


  • Difficult to maintain, and modify over time.
  • Risk of over-specification
  • Restricted to known problems

2) For systems or modules using statistical methods an annotated corpus of domain relevant texts is necessary. Therefore, there is no need for system expertise. The task of the system is to annotate the texts appropriately. The annotated texts are the input of the system or module, which runs a training algorithm on them. Thus, the system obtains knowledge from the annotated texts and can use it to gain desired information from new texts of the same domain. If the required training data is available or easily obtainable, the learning approach should be chosen. For complex domain level tasks, where the annotation task is much slower, more difficult, more expensive, or requires extensive domain expertise from annotators, this approach may be favoured. If specifications change, it is often easier to make minor changes to a set of rules than to re-annotate and retrain the system. However, other changes in specifications may be easier to accomplish with a trainable system. Some advantages and disadvantages of this system are discussed below.


  • Automatic training possible.
  • Auto-adaption can be realized easily


  • Risk of unexpected and unexplainable results
  • Need more training data in general

3) Other systems require manual interaction as part of the extraction process, but work with relatively unstructured input texts. Still other systems have taken a middle road, with medium levels of automation on a reasonable range of documents. Some advantages and disadvantages of this system are discussed below.


  • Does not scale well.
  • No universally applicable off-the-shelf tools.


There are some open unsolved problems extracting the information. One of the main problems is the combination of document categorization and data extraction. Document categorization and data extraction can benefit from each other. But no generic method to combine two processes.

Another problem is found mainly data format standards. The data format UIMA (Unstructured Information Management Architecture) is used quite widely but it is not preferred universally. Also, there are no standards rule for representing uncertainty, and multiple hypotheses. The research community also lack standards ground-truth formats, data sets.

Proven Methods

Many researches have been done to obtain information using IE techniques. Some of the works have already been published in conferences. Different documents are processed and good recognition accuracies are obtained. Good performance is obtained in bibliographic data extraction, Invoice data retrieval, chemical structure recognition, etc. In restricted invoice data, 95% recognition rate has been obtained. Chemical structure recognition in US patent data is performed with 75% recognition accuracy.


In this report we have presented the discussion held in subgroup meeting of Information Extraction in DAS’2010 workshop. Extraction of information from full text looks promising, but context must be regarded. Part of this context is given by the situation of the text under analysis within the document. Different extraction methodologies along with their pros and cons have been discussed in the meeting.

Back to DAS-Discussion:Index