Biomedical Information Extraction
How to extract information from unstructured biomedical data and text.
Contents
Code Libraries
Python tools primarily intended for bioinformatics and computational molecular biology purposes, but also a convenient way to obtain data, including documents/abstracts from PubMed (see Chapter 9 of the documentation).
A system for building predictive medical natural language processing models. Built on the spaCy framework.
Tools, Platforms, and Services
A system for processing the text in electronic medical records. Widely used and open source.
A system for processing documents describing cancer presentations. Based on cTAKES (see above).
A framework for running text mining tools on the newest set(s) of documents from PubMed.
BERT models
GPT-2 models
Text Embeddings
Datasets
A corpus of scholarly manuscripts concerning COVID-19. Articles are primarily from PubMed Central and preprint servers, though the set also includes metadata on papers without full-text availability.
paper - 120 full text articles annotated for PPI and genetic interactions. Used in the BioCreative V BioC task.
paper - 77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions (so, fairly close to PPI annotations). Additional information is here.
paper - 77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions (so, fairly close to PPI annotations). Additional information is here.
The MIMIC Chest X-Ray database. Contains more than 377,000 radiographic images and accompanying free-text radiology reports. As with MIMIC-III, requires acceptance of a data use agreement.
Deidentified health data from ~60,000 intensive care unit admissions. Requires completion of an online training course (CITI training) and acceptance of a data use agreement prior to use.