Chemical Data Analysis in the Large, May 22nd - 26th 2000, Bozen, Italy |
INFORMATION EXTRACTION FROM BIOLOGICAL SCIENCE JOURNAL ARTICLES: ENZYME INTERACTIONS AND PROTEIN STRUCTURESROBERT GAIZAUSKAS,* KEVIN HUMPHREYS AND GEORGE DEMETRIOUDepartment, of Computer Science, University of Sheffield, Regent Court, Portobello Street Sheffield, Sl 4DP UK |
ABSTRACT
With the explosive growth of scientific literature in the area of molecular biology, the need to process and extract, information automatically from on-line text sources has become increasingly important. Information extraction technology, as defined and developed through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from news-wire texts and primarily in domains concerned with human activity. In this paper we consider the application of this technology to the extraction of information from scientific journal papers in the area of molecular biology. We describe how an information extraction system designed to participate in the MUC exercises has been modified for two bioinformatics applications: EMPathIE, concerned with enzyme and metabolic pathways; and PASTA, concerned with protein structure. The progress so far provides convincing grounds for believing that IE techniques will deliver novel and effective ways for the extraction of information from unstructured text sources in the pursuit of knowledge in the biological domain. |
INTRODUCTION
Information Extraction (IE) may be defined as the activity of extracting details of predefined classes of entities and relationships from natural language texts and placing this information into a structured representation called a template. [1, 2] The prototypical IE tasks are those defined by the U.S. DARPA-sponsored Message Understanding Conferences (MUCs), requiring the filling of a complex template from newswire texts on subjects such as joint venture announcements, management succession events, or rocket launchings. [3, 4] While the performance of current technology is not yet at human levels overall, it is approaching human levels for some component tasks (e.g. the recognition and classification of named entities in text) and is at a level at which comparable technologies, such as information retrieval and machine translation, have found useful application. IE is particularly relevant where large volumes of text make human analysis infeasible, where template-oriented information seeking is appropriate (i.e. where there is a relatively stable information need and a set of texts in a relatively narrow domain), where conventional information retrieval technology is inadequate, and where some error can be tolerated. |
INFORMATION EXTRACTION TECHNOLOGY
The most recent MUG evaluation (MUC-7, [4]) specified five separate component tasks, which illustrate the main functional capabilities of current IE systems:
Systems are evaluated on each of these tasks as follows. Each task is precisely specified by means of a task definition document. Human annotators are then given these definitions and use them to produce by hand the 'correct' results for each of the tasks - filled templates or texts tagged with name classes or coreference relations (these results are called answer keys). The participating systems are then run and their results, called system responses, are automatically scored against the answer keys. Chief metrics are precision - percentage of the system's output that is correct (i.e. occurs in the answer key) - and recall - percentage of the correct answer that occurs in the system's output. |
TWO BIOINFORMATICS APPLICATIONS OF INFORMATION EXTRACTION
We are currently investigating the use of IE for two separate bioinformatics research projects. The Enzyme and Metabolic Pathways Information Extraction (EMPathIE) project aims to extract details of enzyme reactions from articles in the journals Biochimica et Biophysica Acta and FEMS Microbiology Letters. The utility for biological researchers of a database of enzyme reactions lies in the ability to search for potential sequences of reactions, where the products of one reaction match the requirements of another. Such sequences form metabolic pathways, the identification of which can suggest potential sites for the application of drugs to affect a particular end result. Typically, journal articles in this domain describe details of a single enzyme reaction, often with little indication of related reactions and which pathways the reaction may be part of. Only by combining details from several articles can potential pathways be identified. |
Results: We have determined the crystal structure of a triacylglycerol lipase from Pseudomonas cepacia (Pet) in the absence of a bound inhibitor using X-ray crystallography. The structure shows the lipase to contain an alpha/betahydrolase fold and a catalytic triad comprising of residues Ser87, His286 and Asp264. The enzyme shares several structural features with homologous lipases from Pseudomonas glumae (PgL) and Chromobacterium viscosum (CvL), including a calcium-binding site. The present structure of Pet reveals a highly open conformation with a solvent-accessible active site. This is in contrast to the structures of PgL and Pet in which the active site is buried under a closed or partially opened 'lid', respectively. |
| Figure 1. Sample text fragment from a scientific paper in Molecular Biology. |
EMPathIE
One of the inspirations for the Enzyme and Metabolic Pathways application was the existence of a manually constructed database for the same application. The EMP database (Selkov et al., 1996) contains over 20,000 records of enzyme reactions, collected from journal articles published since 1964. That such a database has been constructed and is widely used demonstrates the utility of the application. EMPathie aims to extract only a key subset of the fields found in the EMP database records. |
PASTA
The entities to be identified for the PASTA task include proteins, amino acid residues, species, types of structural characteristics (secondary structure, quaternary structure), active sites, other (probably less important) regions, chains and interactions (hydrogen bonds, disulphide bonds etc.) In collaboration with molecular biologists we have designed a template to capture protein structure information, a fragment of which, filled with information extracted from the text in Figure 1, is shown below: |
THE EMPATHIE AND PASTA SYSTEMS
The IE systems developed to carry out the EMPathIE and PASTA tasks are both derived from the Large Scale Information Extraction (LaSIE) system, a general purpose IE system. under development at Sheffield since 1994. [10,11] One of several dozen systems designed to take part in the MUC evaluations over the years, the LaSIE system more or less fits the description of a generic IE system. [12] |
![]() |
| Figure 2. EMPathIE system modules within GATE. |
|
[15] The PASTA system is similar and reuses several modules, within the same environment. The architecture of the original LaSIE system has been substantially rearranged for its use in the biochemical domain, mainly to allow the reuse of general English processing modules, such as the part-of-speech tagger and the phrasal parser, without special retraining or adaptation to allow for the domain-specific terminology. This has resulted in an independent terminology identification subsystem, postponing general syntactic analysis until an attempt to identify terms has been made. In general, the original LaSIE system modules, developed for news-wire applications, have been reused, but with various modifications resulting from specific features of the texts, as described in the following. Both systems have a pipeline architecture consisting of four principal stages, described in the following sections: text preprocessing (SGML/structure analysis, tokenization), lexical and terminological processing (terminology lexicons; morphological analysis, terminology grammars), parsing and semantic interpretation (sentence boundary detection; part-of-speech tagging, phrasal grammars, semantic interpretation), and discourse interpretation (coreference resolution, domain modeling). |
Text Preprocessing
Scientific articles typically have a rigid structure, including abstract, introduction, method and materials, results and discussion sections; and for particular applications certain sections can be targeted for detailed analysis while others can be skipped completely. Where articles are available in SGML with a DTD, an initial module is used to identify particular markup, specified in a configuration file, for use by subsequent modules. Where articles arc in plain text, an initial module called 'sectionizer' is used to identify and classify significant sections using sets of regular expressions. Both the SGML and sectionizer modules may specify that certain text regions are to be excluded from any subsequent processing; avoiding detailed processing of apparently irrelevant text, especially within the discourse interpretation stage where coreference resolution is a relatively expensive operation. |
Lexical and Terminological Processing
The main information sources used for terminology identification in the biochemical domain are: case-insensitive terminology lexicons, listing component terms of various categories; morphological cues, mainly standard biochemical suffixes; and hand-constructed grammar rules for each terminology class. For example, the enzyme name mannitol-l-phosphate 5-dehydrogenase would be recognised firstly by the classification of mannitol as a potential compound modifier, and phosphate as a compound, both by being matched in the terminology lexicon. Morphological analysis would then suggest dehydrogenase as a potential enzyme head, due to its suffix -ase, and then grammar rules would apply to combine the enzyme head with a known compound and modifier that can play the role of enzyme modifier. |
Parsing and Semantic Interpretation
The syntactic processing modules treat any terms recognized in the previous stage as non-decomposable units, with a syntactic role of proper noun. The sentence splitting module cannot therefore propose sentence boundaries within a preclassified term. Similarly, the part-of-speech tagger only attempts to assign tags to tokens which are not part of proposed terms, and the phrasal parser treats terms as preparsed noun phrases. Of course, this approach does not necessarily assume the terminology recognition subsystem to be fully complete and correct, and subsequent syntactic or semantic context can still be used to reclassify or remove proposed terms. In particular, tokens which are constituents of terms proposed but not classified by the NE subsystem, i.e. potential but unknown NEs, are passed to the tagger and phrasal parser as normal, but the potential term is passed to the parser in addition, as a proper name, to allow the phrasal grammar to determine the best analysis. If the unclassified NE is retained after phrasal parsing, it may be classified within the discourse interpreter using its semantic context or as a result of being coreferred with an entity of a known class. |
Discourse Interpretation
The discourse interpreter adds the semantic representation of each sentence to a predefined domain model, made up of an ontology, or concept hierarchy, plus inheritable properties and inference rules associated with concepts. The domain model is gradually populated with instances of concepts from the text to become a discourse model. A powerful coreference mechanism attempts to merge each newly introduced instance with an existing one, subject to various syntactic and semantic constraints. Inference rules of particular instance types may then fire to hypothesize the existence of instances required to fill a template (e.g. an organism with a source relation to an enzyme), and the coreference mechanism will then attempt to resolve the hypothesized instances with actual instances from the text. |
RESULTS AND EVALUATIONEvaluation
So far, a complete EMPathIE system exists which has been developed by concentrating on the full texts of six journal papers (the development corpus) and evaluated against a corpus of a further seven journal papers (the evaluation corpus). Filled templates for all thirteen of these journal papers were produced by trained biochemists highlighting key entities on paper copies of the texts and adding marginal notes where necessary to specify compound roles in interactions and any additional slot values such as concentration, temperature, etc. The annotations were translated to template format by the system developer (with the system frozen before evaluation texts were seen), but some degree of subjective interpretation was required in this process. The annotation would therefore probably be difficult to reproduce without a detailed task specification document, which would be aided by inter-annotator agreement studies to highlight areas of ambiguity in the task definition. However, the current templates at least have the advantage of being produced with some degree of consistency by the developer alone, and so do allow a useful measure of the system's accuracy. |
| Table 1. Initial Template results for EMPathIE |
| Test Set | COR | INC | MIS | SPU | REC | PRE |
| Dev | 150 | 121 | 330 | 61 | 25 | 45 |
| Eval | 213 | 193 | 518 | 93 | 23 | 43 |
|
In addition to evaluating the template filling capabilities of the prototype we have evaluated its performance at correctly identifying and classifying term classes in the texts (this corresponds to the MUG named entity task). To do this six of the seven evaluation corpus articles were manually annotated for eleven terminology or named entity classes. The results are shown in Table 2. [16] |
| Table 2. Initial Named Entity results for EMPathIE |
| Name_Type | COR | INC | MIS | SPU | REC | PRE |
| compound | 538 | 27 | 553 | 39 | 48 | 89 |
| element | 24 | 0 | 19 | 14 | 56 | 63 |
| enzyme | 612 | 0 | 12 | 23 | 98 | 96 |
| genus | 15 | 0 | 18 | 11 | 45 | 58 |
| location | 33 | 1 | 15 | 24 | 67 | 57 |
| measure | 566 | 0 | 120 | 81 | 83 | 87 |
| organism | 188 | 9 | 53 | 64 | 75 | 72 |
| organization | 35 | 6 | 31 | 8 | 49 | 71 |
| pathway | 0 | 0 | 15 | 4 | 0 | 0 |
| person | 17 | 1 | 58 | 9 | 22 | 63 |
| TOTALS | 2028 | 44 | 894 | 277 | 68 | 86 |
|
The development of the PASTA system has reached the stage where a prototype system exists which can produce templates as described above. A corpus of 52 abstracts of journal articles has been manually annotated with terminology classes, by the system developer with the assistance of a molecular biologist, to allow an automatic evaluation of the PASTA terminology system using the MUG scoring software. Table 3 shows some preliminary results for the main terminology classes. |
| Table 3. Initial Named Entity results for PASTA |
| Name_Type | COR | INC | MIS | SPU | REC | PRE |
| protein | 358 | 0 | 52 | 12 | 87 | 97 |
| species | 111 | 0 | 22 | 3 | 83 | 97 |
| residue | 175 | 0 | 4 | 13 | 98 | 93 |
| site | 53 | 0 | 34 | 10 | 61 | 84 |
| region | 19 | 0 | 24 | 0 | 44 | 100 |
| 2_struct | 78 | 0 | 1 | 1 | 99 | 99 |
| sup_struct | 84 | 0 | 0 | 5 | 100 | 94 |
| 4_struct | 115 | 0 | 5 | 3 | 96 | 97 |
| chain | 27 | 0 | 12 | 0 | 69 | 100 |
| base | 38 | 0 | 0 | 1 | 100 | 97 |
| atom | 42 | 0 | 2 | 10 | 95 | 81 |
| nonprotein | 107 | 0 | 0 | 21 | 100 | 84 |
| interaction | 10 | 0 | 3 | 1 | 77 | 91 |
| TOTALS | 1217 | 0 | 159 | 80 | 88 | 94 |
Discussion
It should be stressed that these evaluation results are very preliminary, and we would expect them to improve substantially with further development. |
CONCLUSION
Between these two projects much of the low-level work of moving IE systems into the new domain of molecular biology and the new text genre of journal papers has been carried out. We have generalized our software to cope with longer, multi-sectioned articles with embedded SGML; we have generalized tokenization routines to cope with scientific nomenclature and terminology recognition procedures to deal with a broad range of molecular biological terminology. All of this work is reusable by any IE application in the area of molecular biology. |
ACKNOWLDEGMENTSEMPathIE is a 1.5 year research project in collaboration with, and funded by, GlaxoWellcome plc and Elsevier Science. The authors would like to thank Dr. Charlie Hodgman of GlaxoWellcome for supplying domain expertise and Elsevier for supplying electronic copy of relevant journals. PASTA is funded under the UK BBSRCEPRSC Biolnformatics Programme (BIF08754) and is a collaboration between the Departments of Computer Science, Information Studies and Molecular Biology and Biotechnology at the University of Sheffield. The authors would like to thank Dr. Peter Artymiuk and Prof. Peter Willett of the University of Sheffield for supplying their expertise in the biochemistry domain. |
REFERENCES AND NOTES
[1] Cowie, J.; Lehnert, W. Communications of the ACM, 1996, 39, 80. |
|
Published in "Chemical Data Analysis in the Large: The Challenge of the Automation Age", Martin G. Hicks (Ed.), Proceedings of the Beilstein-Institut Workshop, May 22nd - 26th, 2000, Bozen, Italy http://www.beilstein-institut.de/bozen2000/proceedings/ |