Projects per year
Science relies on data in all its different forms. In molecular biology and bioinformatics in particular large scale data generation has taken centre stage in the form of high-throughput experiments. In line with this exponential increase of experimental data has been the near exponential growth of scientific publications. Yet where classical data mining techniques are still capable of coping with this deluge in structured data (Chapter 2), access of information found in scientific literature is still limited to search engines allowing searches on the level keywords, titles and abstracts. However, large amounts of knowledge about biological entities and their relations are held within the body of articles. When extracted, this data can be used as evidence for existing knowledge or hypothesis generation making scientific literature a valuable scientific resource. To unlock the information inside the articles requires a dedicated set of techniques and approaches tailored to the unstructured nature of free text. Analogous to the field of data mining for the analysis of structured data, the field of text mining has emerged for unstructured text and a number of applications has been developed in that field.
This thesis is about text mining in the field of metabolomics. The work focusses on strategies for accessing large collections of scientific text and on the text mining steps required to extract metabolic reactions and their constituents, enzymes and metabolites, from scientific text. Metabolic reactions are important for our understanding of metabolic processes within cells and that information provides an important link between genotype phenotype. Furthermore information about metabolic reactions stored in databases is far from complete making it an excellent target for our text mining application.
In order to access the scientific publications for further analysis they can be used as flat text or loaded into database systems. In Chapter 2we assessed and discussed the capabilities and performance of XML-type database systems to store and access very large collections of XML-type documents in the form of the Medline corpus, a collection of more than 20 million of scientific abstracts. XML data formats are common in the field of bioinformatics and are also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. The database systems were evaluated on a number of aspects broadly ranging from technical requirements to ease-of-use and performance. The performance of the different XML-type database systems was measured Medline abstract collections of increasing size and with a number of different queries. One of the queries assessed the capabilities of each database system to search the full-text of each abstract, which would allow access to the information within the text without further text analysis. The results show that all database systems cope well with the small and medium dataset, but that the full dataset remains a challenge. Also the query possibilities varied greatly across all studied databases. This led us to conclude that the performances and possibilities of the different database types vary greatly, also depending on the type of research question. There is no single system that outperforms the others; instead different circumstances can lead to a different optimal solution. Some of these scenarios are presented in the chapter.
Among the conclusions of Chapter 2is that conventional data mining techniques do not work for the natural language part of a publication beyond simple retrieval queries based on pattern matching. The natural language used in written text is too unstructured for that purpose and requires dedicated text mining approaches, the main research topic of this thesis. Two major tasks of text mining are named entity recognition, the identification of relevant entities in the text, and relation extraction, the identification of relations between those named entities. For both text mining tasks many different techniques and approaches have been developed. For the named entity recognition of enzymes and metabolites we used a dictionary-based approach (Chapter 3) and for metabolic reaction extraction a full grammar approach (Chapter 4).
In Chapter 3we describe the creation of two thesauri, one for enzymes and one for metabolites with the specific goal of allowing named entity identification, the mapping of identified synonyms to a common identifier, for metabolic reaction extraction. In the case of the enzyme thesaurus these identifiers are Enzyme Nomenclature numbers (EC number), in the case of the metabolite thesaurus KEGG metabolite identifiers. These thesauri are applied to the identification of enzymes and metabolites in the text mining approach of Chapter 4. Both were created from existing data sources by a series of automated steps followed by manual curation. Compared to a previously published chemical thesaurus, created entirely with automated steps, our much smaller metabolite thesaurus performed on the same level for F-measure with a slightly higher precision. The enzyme thesaurus produced results equal to our metabolite thesaurus. The compactness of our thesauri permits the manual curation step important in guaranteeing accuracy of the thesaurus contents, whereas creation from existing resources by automated means limits the effort required for creation. We concluded that our thesauri are compact and of high quality, and that this compactness does not greatly impact recall.
In Chapter 4we studied the applicability and performance of a full parsing approach using the two thesauri described in Chapter 3 for the extraction of metabolic reactions from scientific full-text articles. For this we developed a text mining pipeline built around a modified dependency parser from the AGFL grammar lab using a pattern-based approach to extract metabolic reactions from the parsing output. Results of a comparison to a modified rule-based approach by Czarnecki et al.using three previously described metabolic pathways from the EcoCyc database show a slightly lower recall compared to the rule-based approach, but higher precision. We concluded that despite its current recall our full parsing approach to metabolic reaction extraction has high precision and potential to be used to (re-)construct metabolic pathways in an automated setting. Future improvements to the grammar and relation extraction rules should allow reactions to be extracted with even higher specificity.
To identify potential improvements to the recall, the effect of a number of text pre-processing steps on the performance was tested in a number of experiments. The one experiment that had the most effect on performance was the conversion of schematic chemical formulas to syntactic complete sentences allowing them to be analysed by the parser. In addition to the improvements to the text mining approach described in Chapter 4I make suggestions in Chapter 5 for potential improvements and extensions to our full parsing approach for metabolic reaction extraction. Core focus here is the increase of recall by optimising each of the steps required for the final goal of extracting metabolic reactions from the text. Some of the discussed improvements are to increase the coverage of the used thesauri, possibly with specialist thesauri depending on the analysed literature. Another potential target is the grammar, where there is still room to increase parsing success by taking into account the characteristics of biomedical language. On a different level are suggestions to include some form of anaphora resolution and across sentence boundary search to increase the amount of information extracted from literature.
In the second part of Chapter 5I make suggestions as to how to maximise the information gained from the text mining results. One of the first steps should be integration with other biomedical databases to allow integration with existing knowledge about metabolic reactions and other biological entities. Another aspect is some form of ranking or weighting of the results to be able to distinguish between high quality results useful for automated analyses and lower quality results still useful for manual approaches. Furthermore I provide a perspective on the necessity of computational literature analysis in the form of text mining. The main reasoning here is that human annotators cannot keep up with the amount of publications so that some form of automated analysis is unavoidable. Lastly I discuss the role of text mining in bioinformatics and with that also the accessibility of both text mining results and the literature resources necessary to create them. An important requirement for the future of text mining is that the barriers around high-throughput access to literature for text mining applications have to be removed. With regards to accessing text mining results, there is a long way to go for many applications, including ours, before they can be used directly by biologists. A major factor is that these applications rarely feature a suitable user interface and easy to use setup.
To conclude, I see the main role of a text mining system like ours mainly in gathering evidence for existing knowledge and giving insights into the nuances of the research landscape of a given topic. When using the results of our reaction extraction system for the identification of ‘new’ reactions it is important to go back to the actual evidence presented for extra validations and to cross-validate the predictions with other resources or experiments. Ideally text mining will be used for generation of hypotheses, in which the researcher uses text mining findings to get ideas on, in our case, new connections between metabolites and enzymes; subsequently the researcher needs to go back to the original texts for further study. In this role text mining is an essential tool on the workbench of the molecular biologist.
|Qualification||Doctor of Philosophy|
|Award date||7 Apr 2014|
|Place of Publication||Wageningen|
|Publication status||Published - 2014|
- data analysis
- text mining
- scientific research
- molecular biology
FingerprintDive into the research topics of 'Text mining for metabolic reaction extraction from scientific literature'. Together they form a unique fingerprint.
- 1 Finished