Lars Meyer: GEOROC Information Extraction Pipeline (Lab Rotation Project, 2024)
Student: Lars Meyer
Year: 2024
Duration: 346h / 12 Credits
Lab Rotation: Learning outcome, core skills: After successful completion of the module, students are able to plan and conduct a research project, and present its results; they acquire project management skills and learn to work collaboratively in a data science team.
Description: Automating information extraction from scientific literature using large language models (LLMs) faces several key challenges due to the high complexity of scientific texts. LLMs often misinterpret numerical data paired with units, such as “87 ppm of Sr,” or struggle with context, for instance, determining whether a described method applies to a specific experiment or is part of a comparison. Critical information in figures and tables, such as elemental concentrations, remains inaccessible and often are only partly described in the text. Additionally, LLMs may extract isolated details while missing relationships across sections, such as linking a measurement technique in the methods section to its results. This project aims to explore the potential of automating information extraction from scientific literature, specifically for the GEOROC use case. The goal is to begin developing a framework to test and evaluate the capabilities of LLMs in extracting geochemical data from scientific papers. Therefore solutions for automating text extraction, dataset creation, LLM querying and result evaluations are implemented in the Python scripts within this directory. For the initial exploration, the scope in this project is limited to a small scale experiment focused at extracting geochemical methods used in papers.
Results: https://gitlab.gwdg.de/lars.meyer04/information_extraction_pipeline_meyer