Computational Biology

Computational approaches applied to problems in biology.

Collection 128 stars GitHub

Databases

CZ CELLxGENE

Single-cell dataset repository and interactive explorer from the Chan Zuckerberg Initiative.

Gene Expression Omnibus

Public functional genomics database.

Single Cell PORTAL

Public database for single-cell RNA.

Single Cell Expression Atlas

Public database for single-cell RNA.

Drug Repurposing Hub

Collections of drug repurposing data (drug, MoA, target, etc).

PathwayCommons

Database of pathways and interactions.

WikiPathways 9 updated 5y ago

Database of biological pathways.

Reactome

Expert-curated, peer-reviewed pathway database with detailed reaction mechanisms.

BioCyc

Collection of pathway/genome databases across thousands of organisms.

SIGNOR

Database of causal signaling interactions and pathways.

MSigDB (Molecular Signatures Database)

Curated gene sets derived from pathways and biological processes.

MassBank

Open source databases and tools for mass spectrometry reference spectra.

MoNA MassBank of North America

Meta-database of metabolite mass spectra, metadata, and associated compounds.

THE HUMAN PROTEIN ATLAS

Comprehensive human protein database (cells, tissues, organs).

PROTEIN DATA BANK (PDB)

3D structures of proteins, nucleic acids, complexes.

UniProt

Functional information on proteins.

AlphaFold Protein Structure Database 14.4k updated 4mo ago

3D protein structure predictions.

RCSB Protein Data Bank

Repository for structural data of biological molecules.

Critical Assessment of Structure Prediction (CASP)

Assessing methods for protein structure prediction.

Uniclust

Clustered protein sequence databases.

CATH database

Hierarchical classification of protein domain structures.

SAbDab

Structural Antibody Database containing all antibody structures in the PDB.

OADB (Observed Antibody Space Database)

Database of antibody sequences from immune repertoire sequencing.

NeXtProt

Expert knowledge base on human proteins with deep functional annotation, complementary to UniProt.

Drug Repurposing Hub

Collections of drug repurposing data (drug, MoA, target, etc).

Uniclust

Clustered protein sequence databases.

UniRef

Non-redundant sequence database clustering UniProtKB entries at multiple identity thresholds.

SAbDab

Structural Antibody Database containing all antibody structures in the PDB.

OADB (Observed Antibody Space Database)

Database of antibody sequences from immune repertoire sequencing.

Pfam

Database of protein families described by multiple sequence alignments and hidden Markov models.

Some Repo

A description for some repo.

Reactome

Expert-curated, peer-reviewed pathway database with detailed reaction mechanisms.

SIGNOR

Database of causal signaling interactions and pathways.

MSigDB (Molecular Signatures Database)

Curated gene sets derived from pathways and biological processes.

DISEASES

Gene–disease association database integrating evidence from text mining, curated databases, and experimental data.

Knowledge Graph

Drug Mechanism Database (DrugMechDB) 71 updated 8mo ago

Mechanisms of action from drug to disease.

DRKG 675 updated 4y ago

Large-scale biological knowledge graph for drug discovery.

Hetionet 347 updated 3y ago

Heterogeneous network integrating genes, diseases, drugs, pathways, and more.

PrimeKG 746 updated 2y ago

Multi-modal precision medicine knowledge graph integrating clinical, genetic, and drug data.

Disease

ML-Samples

A collection of samples and tutorials for Microsoft Machine Learning.

Benchmarks & Datasets

FLIP (Fitness Landscape Inference for Proteins) 117 updated 4mo ago

Benchmark collection of protein fitness landscape datasets for evaluating protein ML models.

GuacaMol 508 updated 2y ago

Benchmark suite for generative molecular design models.

MOSES 966 updated 2y ago

Benchmarking platform for molecular generation models.

OpenBioLink 158 updated 2y ago

Benchmark datasets for biological knowledge graph completion.

ProteinGym 403 updated 6mo ago

Large-scale benchmark of deep mutational scanning assays for evaluating protein fitness landscape models.

scIB (Single-cell Integration Benchmarks) 408 updated 10mo ago

Comprehensive benchmarking framework for single-cell data integration methods.

TAPE (Tasks Assessing Protein Embeddings) 734 updated 3y ago

Benchmark suite of five biologically meaningful semi-supervised learning tasks for evaluating protein representations.

JUMP Cell Painting Datasets 184 updated 3mo ago

Consortium-scale cell imaging perturbation datasets (chemical and genetic) for phenotypic profiling and drug discovery research.

scPerturb 173 updated 1y ago

Curated and continuously updated single-cell perturbation data resource spanning CRISPR and drug perturbation studies.

Preprocessing Tools

Chemistry Development Kit 581 updated 3mo ago

Cheminformatics software & machine learning tools.

FlashDeconv 14 updated 4mo ago

High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min).

RDKit 3.4k updated 4mo ago

Cheminformatics software & machine learning toolkit.

DeepChem 6.6k updated 4mo ago

Deep learning library for drug discovery, quantum chemistry, and materials science.

ChatSpatial 33 updated 2mo ago

MCP server for spatial transcriptomics analysis via natural language.

CellTypist 461 updated 4mo ago

Automated cell type annotation for scRNA-seq.

scVelo 495 updated 5mo ago

RNA velocity estimation for single-cell transcriptomics, inferring the direction and speed of cell differentiation.

STAR 2.2k updated 1y ago

Ultrafast universal RNA-seq aligner with support for spliced alignment and single-cell quantification via STARsolo.

Harmony 643 updated 3mo ago

Fast and scalable integration of single-cell data across datasets, conditions, technologies, and species.

CellChat 780 (archived)

Inference and analysis of cell-cell communication ligand-receptor networks from single-cell transcriptomics data.

SCENIC 482 updated 2y ago

Single-cell regulatory network inference and clustering linking transcription factors to co-expressed gene modules.

DoubletFinder 545 updated 1y ago

Machine learning approach for detecting multiplet (doublet) artifacts in single-cell RNA-seq data.

Numbat 214 updated 5mo ago

Haplotype-aware copy number variation inference from single-cell RNA-seq using hidden Markov models.

CaSpER 87 updated 5y ago

CNV identification and visualization by integrative analysis of single-cell or bulk RNA-seq data.

CellCharter 172 updated 3mo ago

Identification and characterization of spatial cell niches from spatial transcriptomics using VAEs and Gaussian mixture models.

STAGATE 51 updated 3y ago

Adaptive graph attention auto-encoder for spatial domain identification in spatial transcriptomics.

NCEM 116 updated 2y ago

GNN-based model for learning intercellular communication from spatial graphs of cells.

DeepTalk 29 updated 1y ago

Graph attention network for deciphering cell-cell communication from spatial transcriptomics data.

COMMOT 137 updated 2y ago

Optimal transport-based framework for screening cell-cell communication in spatial transcriptomics.

TIGON 58 updated 1y ago

Neural optimal transport method for reconstructing growth and dynamic trajectories from single-cell transcriptomics.

LINGER 127 updated 4mo ago

Neural network for gene regulatory network inference from single-cell multiome (RNA+ATAC-seq) data with bulk data pretraining.

sciPENN 18 updated 4y ago

RNN-based method for simultaneous protein expression prediction, uncertainty estimation, and cell-type label transfer from CITE-seq and scRNA-seq data.

MOGONET 179 updated 5y ago

Multi-omics graph convolutional network framework for patient classification and biomarker identification.

Machine Learning Tasks and Models

Drug Response Prediction

drGAT 1 updated 4mo ago

Attention-based model for drug response prediction with gene explainability.

MOFGCN 7 updated 3y ago

GCN + heterogeneous network.

RECOVER 24 updated 1y ago

Machine learning framework for predicting synergistic drug combination responses across cell lines.

DGDRP updated 2y ago

Multi-view embedding neural network.

DeepAEG 3 updated 2y ago

GNN embedding + attention mechanism.

TGSA 23 updated 4y ago

Tumor gene set and attention-based model leveraging biological pathway knowledge for drug response prediction.

HiDRA

Hierarchical network model incorporating gene and pathway-level information for cancer drug response prediction.

PRNet 78 updated 1y ago

Deep generative model for predicting transcriptional responses to novel chemical perturbations for drug discovery.

chemCPA 149 updated 1y ago

Compositional perturbation autoencoder for predicting single-cell transcriptional responses to unseen drug perturbations and dose combinations.

cycleCDR 3 updated 2y ago

Interpretable cycle-consistency framework for modeling cellular responses to drug perturbations.

DRUML 11 updated 4y ago

Ensemble machine learning framework combining standard ML with deep learning to systematically rank anti-cancer drugs from proteomics and RNA-seq data.

Drug Repurposing

DeepPurpose 1.2k updated 2y ago

Deep learning library for drug repurposing.

TranSiGen 35 updated 1y ago

Dual-VAE architecture for ligand-based virtual screening, drug response prediction, and drug repurposing using chemical-induced transcriptional profiles.

Drug Target Interaction

NeoDTI 77 updated 5y ago

Library for drug-target interaction prediction.

DTINet 188 updated 3y ago

Network-based framework integrating heterogeneous biological data for DTI prediction.

DeepDTA 296 updated 2y ago

Deep learning model using CNNs on protein sequences and drug SMILES.

GraphDTA 296 updated 5y ago

Graph neural network–based DTI prediction using molecular graphs.

MolTrans 226 updated 4y ago

Transformer-based DTI model leveraging molecular substructures.

DrugBAN 145 updated 3y ago

Bilinear attention network for interpretable DTI prediction.

Compound-Protein Interaction

MCPINN 3 updated 2y ago

Drug discovery via compound-protein interaction and machine learning.

TransformerCPI 154 updated 4y ago

CPI prediction using Transformer.

Molecular Generation

REINVENT 372 (archived)

Reinforcement learning for de novo drug design.

MolGPT 169 updated 3y ago

Transformer-based model for molecular generation.

Molecular Transformer 419 updated 4y ago

Sequence-to-sequence model for retrosynthesis prediction.

TargetDiff 325 updated 2y ago

3D equivariant diffusion model for structure-based drug design.

DiffDock 1.5k updated 1y ago

Diffusion generative model for molecular docking, predicting the binding pose of small molecules to protein targets.

JTVAE 557 updated 3y ago

Junction tree variational autoencoder for molecular graph generation that guarantees chemical validity via a hierarchical tree decomposition.

DiffSBDD 503 updated 1y ago

Equivariant diffusion model for structure-based drug design that generates molecules and binding conformations for protein targets.

ReLeaSE 368 updated 4y ago

Deep reinforcement learning framework for de novo drug design combining a generative and predictive model.

PaccMannRL 10 updated 2y ago

Reinforcement learning-based generative model for de novo hit-like anticancer molecule design from transcriptomic data.

LLM for Biology

BioGPT 4.5k updated 2y ago

LLM for biomedical text generation.

GeneGPT 423 updated 1y ago

LLM for biomedical information, integrated with various APIs.

GenePT 314 updated 2y ago

Foundation LLM for single-cell data.

scPRINT 143 updated 5mo ago

Pretrained on 50M cells for scRNA-seq denoising & zero imputation.

ClawBio 771 updated 2mo ago

Bioinformatics-native AI agent skill library with local-first pharmacogenomics, ancestry PCA, semantic similarity, nutrigenomics, and metagenomics skills.

MolT5 194 updated 2y ago

Language model for molecular tasks bridging text and SMILES, enabling molecule captioning and text-driven molecule generation.

ChatDrug 160 updated 2y ago

LLM-based conversational pipeline for drug discovery, using natural language prompts for iterative drug editing and optimization.

Transcriptomics Foundation Models

scFoundation 398 updated 8mo ago

Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks.

scGPT 1.5k updated 8mo ago

Transformer-based foundation model pretrained on millions of single-cell profiles.

BulkFormer 50 updated 7mo ago

Foundation model for bulk RNA-seq data; learns general transcriptomic representations.

scBERT 351 updated 2y ago

BERT-based foundation model pretrained on large-scale scRNA-seq data for cell type annotation.

CellPLM 101 updated 2y ago

Cell pre-trained language model with inter-cell transformer architecture for diverse single-cell analysis tasks.

UCE 248 updated 5mo ago

Universal Cell Embeddings: zero-shot single-cell embedding model trained on 36M cells across species, tissues, and assays without fine-tuning.

GEARS 347 updated 1y ago

Graph-based model for predicting transcriptional responses to single and combinatorial genetic perturbations using biological priors.

Geneformer

Context-aware, attention-based deep learning model pretrained on a large corpus of single-cell transcriptomes.

SATURN 164 updated 2y ago

Transformer-based model integrating gene expression and protein sequences via a protein language model to learn unified multi-species cell embeddings.

CancerFoundation 30 updated 10mo ago

Single-cell RNA-seq foundation model trained exclusively on a curated dataset of malignant cells to learn cancer-specific embeddings.

Spatial Foundation Models

GigaPath 584 updated 1y ago

Slide-level digital pathology foundation model pretrained on 1.3 billion pathology image tokens from whole-slide images.

UNI 697 updated 1y ago

General-purpose self-supervised pathology foundation model trained on 100K+ whole-slide images for diverse computational pathology tasks.

CONCH 487 updated 1y ago

Vision-language foundation model for computational pathology trained with contrastive captioning on pathology image–text pairs.

Phikon

ViT-based pathology foundation model pretrained with iBOT self-supervision on TCGA whole-slide images.

Nicheformer 161 updated 8mo ago

Foundation model for single-cell and spatial omics using a transformer architecture with positional embeddings to encode spatial cell information.

scGPT-spatial 135 updated 1y ago

Extension of scGPT for spatial transcriptomics with continual pretraining and a mixture-of-experts decoder for spatial gene expression analysis.

Multi-Omics Foundation Models

scMulan 62 updated 2y ago

Single-cell multi-omic language model pretrained on ~10M cells spanning transcriptomics, epigenomics, and proteomics for cross-omics transfer tasks.

MultiVI 1.6k updated 4mo ago

Multi-modal variational autoencoder for integrating paired and unpaired single-cell RNA-seq and ATAC-seq measurements into a unified latent space.

MIRA 68 updated 1y ago

Probabilistic multimodal topic model jointly modeling single-cell transcriptomics and chromatin accessibility for regulatory network inference.

GLUE 464 updated 5mo ago

Graph-Linked Unified Embedding framework for unpaired single-cell multi-omics data integration across RNA, ATAC, methylation, and protein modalities.

BABEL 47 updated 3y ago

Cross-modality translation model enabling prediction between scRNA-seq and scATAC-seq profiles without requiring paired single-cell measurements.

Multigrate 32 updated 4mo ago

Asymmetric multi-omics variational autoencoder for integrating single-cell data across RNA, ATAC, and protein modalities with missing-modality support.

MOFA+ 388 updated 5mo ago

Multi-Omics Factor Analysis framework identifying shared axes of variation across bulk and single-cell datasets including RNA, ATAC, proteomics, methylation, and copy number.

GeneCompass 111 updated 5mo ago

Large-scale foundation model integrating DNA regulatory sequences and single-cell transcriptomics from 120M+ cells across multiple species for gene regulation prediction.

UnitedNet 52 updated 2y ago

Interpretable multi-task deep neural network for single-cell multi-omics integration spanning transcriptomics, chromatin accessibility, and proteomics.

SpatialGlue

Graph attention network for spatial multi-omics integration jointly embedding spatial transcriptomics with chromatin accessibility or proteomics.

MIDAS 66 updated 2mo ago

Mosaic integration and differential accessibility model for single-cell multi-omics data that handles arbitrary missing-modality combinations across transcriptomics, chromatin accessibility, and proteomics.

Concerto 40 updated 3y ago

Contrastive self-supervised learning framework for single-cell multimodal data integration, batch correction, and reference-query mapping.

scButterfly 28 updated 2y ago

Dual-aligned variational autoencoder for single-cell cross-modality translation between paired and unpaired multiomics data.

JAMIE 16 updated 10mo ago

Joint variational autoencoder for multimodal single-cell data imputation and embedding.

scPair 10 updated 10mo ago

Bidirectional feedforward network for single-cell multimodal analysis with cross-modality prediction leveraging single-cell atlases.

Domain Alignment

scArches 400 updated 4mo ago

Transfer learning framework for mapping new single-cell datasets onto pre-trained reference atlases across batches, conditions, and modalities.

TOSICA

Transformer-based framework for one-stop interpretable cell-type annotation supporting cross-dataset and cross-species transfer.

Protein Structure Prediction and Design

ESMFold 4.0k (archived)

Fast protein structure prediction using language model embeddings.

AlphaFold3 7.8k updated 4mo ago

Predicts structures of proteins, nucleic acids, small molecules, and their complexes.

Boltz-1 3.9k updated 4mo ago

Open-source all-atom biomolecular structure prediction model for proteins, nucleic acids, small molecules, and their complexes achieving AlphaFold3-level accuracy.

Chai-1 1.9k updated 4mo ago

Unified molecular structure prediction model covering proteins, nucleic acids, small molecules, and complexes.

ESM3 2.3k updated 4mo ago

Multimodal protein language model that jointly reasons over sequence, structure, and function for generative protein design and engineering.

RFdiffusion 2.8k updated 8mo ago

Generative model for protein backbone design using diffusion.

ProteinMPNN 1.7k updated 1y ago

Deep learning model for protein sequence design given backbone structure.

OmegaFold 615 updated 3y ago

High-resolution de novo protein structure prediction from sequence.

RoseTTAFold 2.2k updated 2y ago

Three-track neural network for protein structure prediction.

OpenFold 3.3k updated 7mo ago

Trainable, memory-efficient open-source reproduction of AlphaFold2 enabling custom protein structure prediction workflows.

SaProt

Structure-aware protein language model using structure-aware tokens that encode both sequence and backbone geometry for improved function prediction.

EvoDiff 668 updated 6mo ago

Discrete diffusion framework for protein sequence generation trained on evolutionary-scale data, supporting unconditional generation, disordered region design, and functional motif scaffolding. [paper-2023 ]

Compound Embedding

ChemBERTa-2 488 updated 1y ago

RoBERTa-based molecular language model pretrained on SMILES for small-molecule representation learning.

GROVER 387 updated 5mo ago

Self-supervised graph transformer for large-scale molecular representation learning from unlabeled compounds.

Mol2Vec 288 (archived)

Unsupervised molecular embedding method inspired by Word2Vec for learning vector representations of chemical substructures.

MolFormer 389 updated 10mo ago

Linear attention transformer pretrained on millions of SMILES strings for efficient molecular embeddings.

Uni-Mol 1.1k updated 1y ago

3D molecular pretraining framework for universal representation learning on molecules and protein pockets.

CHIEF 706 updated 6mo ago

Clinical Histopathology Imaging Evaluation Foundation model integrating histology images and clinical context for pan-cancer analysis.

BiomedCLIP

CLIP-based vision-language foundation model for biomedical images and text trained on PubMed figure–caption pairs.

PORPOISE 243 updated 3y ago

Pan-cancer integrative histology-genomic analysis framework using multimodal deep learning for patient stratification.

PathomicFusion 330 updated 3y ago

Integrated framework fusing histopathology and genomic features via CNN, GNN, and attention gating for cancer diagnosis and prognosis.

Virchow

Million-slide digital pathology foundation model using a vision transformer and self-supervised distillation for tile-level pathology image representation.

TOAD 182 updated 4y ago

Tumor Origin Assessment via Deep-learning; weakly-supervised multi-task model predicting cancer primary origin from H&E whole-slide images.

PLIP 377 updated 2y ago

Vision-language foundation model for pathology trained with contrastive learning on pathology image–text pairs for image classification and text-to-image retrieval.

MUSK 230 updated 9mo ago

Vision-language foundation model for precision oncology analyzing multimodal paired text and pathology image data for biomarker prediction and retrieval.