Awesome Biomedical Information Extraction

Code Libraries

Python tools primarily intended for bioinformatics and computational molecular biology purposes, but also a convenient way to obtain data, including documents/abstracts from PubMed (see Chapter 9 of the documentation).

Bio-SCoRes

A framework for biomedical coreference resolution.

medaCy

A system for building predictive medical natural language processing models. Built on the spaCy framework.

ScispaCy 1.9k updated 7mo ago

A version of the spaCy framework for scientific and biomedical documents.

rentrez

R utilities for accessing NCBI resources, including PubMed.

code

a Python package and model (for use with spaCy) for doing NER with medication-related concepts.

Repos for Specific Datasets

mimic-code 3.2k updated 8mo ago

Code associated with the MIMIC-III dataset (see below). Includes some helpful tutorials.

Tools, Platforms, and Services

cTAKES 125 updated 4mo ago

A system for processing the text in electronic medical records. Widely used and open source.

DeepPhe 30 updated 7mo ago

A system for processing documents describing cancer presentations. Based on cTAKES (see above).

Pubrunner 42 updated 6y ago

A framework for running text mining tools on the newest set(s) of documents from PubMed.

SemEHR 89 (archived)

an IE infrastructure for electronic health records (EHR). Built on the CogStack project.

TabInOut 40 updated 7y ago

a framework for IE from tables in the literature.

Annotation Tools

Anafora 243 updated 3y ago

An annotation tool with adjudication and progress tracking features.

brat 1.9k updated 2y ago

The brat rapid annotation tool. Supports producing text annotations visually, through the browser. Not subject specific; appropriate for many annotation projects. Visualization is based on that of the stav tool.

MedTator 60 updated 1y ago

An annotation tool designed to have minimal dependencies.

BERT models

BioBERT 706 updated 6y ago

A PubMed and PubMed Central-trained version of the BERT language model.

BioBERT 2.2k updated 2y ago

A PubMed and PubMed Central-trained version of the BERT language model.

SciBERT 1.7k updated 4y ago

A BERT model trained on >1M papers from the Semantic Scholar database.

BlueBERT 588 updated 3y ago

A BERT model pre-trained on PubMed text and MIMIC-III notes.

ClinicalBERT

Alsentzer et al Clinical BERT 758 updated 5y ago

paper

Huang et al ClinicalBERT 436 updated 3y ago

paper

GPT-2 models

BioGPT 4.5k updated 2y ago

A GPT-2 model pre-trained on 15 million PubMed abstracts, along with fine-tuned versions for several biomedical tasks.

Text Embeddings

BioWordVec 153 updated 2y ago

Word embeddings derived from biomedical text (>27 million PubMed titles and abstracts), including subword embedding model based on MeSH.

Datasets

CORD-19 185 updated 1y ago

A corpus of scholarly manuscripts concerning COVID-19. Articles are primarily from PubMed Central and preprint servers, though the set also includes metadata on papers without full-text availability.

BioC-BioGRID

paper - 120 full text articles annotated for PPI and genetic interactions. Used in the BioCreative V BioC task.

LLL

paper - 77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions (so, fairly close to PPI annotations). Additional information is here.

LLL

paper - 77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions (so, fairly close to PPI annotations). Additional information is here.

MIMIC-CXR

The MIMIC Chest X-Ray database. Contains more than 377,000 radiographic images and accompanying free-text radiology reports. As with MIMIC-III, requires acceptance of a data use agreement.

MIMIC-III

Deidentified health data from ~60,000 intensive care unit admissions. Requires completion of an online training course (CITI training) and acceptance of a data use agreement prior to use.

MIMIC-IV

An update to MIMIC-III's multimodal patient data, now covering more recent years of admissions, plus a new data structure, emergency department records, and links to MIMIC-CXR images.

eICU Collaborative Research Database

a database of observations from more than 200 thousand intensive care unit admissions, with consistent structure. Requires registration, training course completion, and data use agreement.

Annotated Text Data

CRAFT 77 updated 2y ago

67 full-text biomedical articles annotated in a variety of ways, including for concepts and coreferences. Now on version 5, including annotations linking concepts to the MONDO disease ontology.

Other Datasets

eICU Collaborative Research Database

a database of observations from more than 200 thousand intensive care unit admissions, with consistent structure. Requires registration, training course completion, and data use agreement.

Ontologies and Controlled Vocabularies

Disease Ontology 391 updated 4mo ago

An ontology of human diseases. Has cross-links to MeSH, ICD, NCI Thesaurus, SNOMED, and OMIM. Public domain. Available on GitHub and on the OBO Foundry.

Data Models

Biolink 229 updated 4mo ago

A data model of biological entities. Provided as a YAML file.

OMOP Common Data Model 1.0k updated 8mo ago

a standard for observational healthcare data.

Other models

Flair embeddings from PubMed 14.4k updated 8mo ago

A language model available through the Flair framework and embedding method. Trained over a 5% sample of PubMed abstracts until 2015, or > 1.2 million abstracts in total.

Biomedical Information Extraction

Contents