Computational Biology
Computational approaches applied to problems in biology.
Contents
- Drug Response Prediction
- Drug Repurposing
- Drug Target Interaction
- Compound-Protein Interaction
- Molecular Generation
- LLM for Biology
- Transcriptomics Foundation Models
- Spatial Foundation Models
- Multi-Omics Foundation Models
- Domain Alignment
- Protein Structure Prediction and Design
- Compound Embedding
- Multi-Modal Foundation Models
- Genomics Foundation Models
- Pre-trained Embedding
Databases
Single-cell dataset repository and interactive explorer from the Chan Zuckerberg Initiative.
Curated gene sets derived from pathways and biological processes.
Meta-database of metabolite mass spectra, metadata, and associated compounds.
Assessing methods for protein structure prediction.
Database of antibody sequences from immune repertoire sequencing.
Expert knowledge base on human proteins with deep functional annotation, complementary to UniProt.
Non-redundant sequence database clustering UniProtKB entries at multiple identity thresholds.
Database of antibody sequences from immune repertoire sequencing.
Database of protein families described by multiple sequence alignments and hidden Markov models.
Benchmarks & Datasets
Benchmark collection of protein fitness landscape datasets for evaluating protein ML models.
Large-scale benchmark of deep mutational scanning assays for evaluating protein fitness landscape models.
Comprehensive benchmarking framework for single-cell data integration methods.
Benchmark suite of five biologically meaningful semi-supervised learning tasks for evaluating protein representations.
Preprocessing Tools
High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min).
Deep learning library for drug discovery, quantum chemistry, and materials science.
RNA velocity estimation for single-cell transcriptomics, inferring the direction and speed of cell differentiation.
Ultrafast universal RNA-seq aligner with support for spliced alignment and single-cell quantification via STARsolo.
Fast and scalable integration of single-cell data across datasets, conditions, technologies, and species.
Inference and analysis of cell-cell communication ligand-receptor networks from single-cell transcriptomics data.
Single-cell regulatory network inference and clustering linking transcription factors to co-expressed gene modules.
Machine learning approach for detecting multiplet (doublet) artifacts in single-cell RNA-seq data.
Haplotype-aware copy number variation inference from single-cell RNA-seq using hidden Markov models.
CNV identification and visualization by integrative analysis of single-cell or bulk RNA-seq data.
Identification and characterization of spatial cell niches from spatial transcriptomics using VAEs and Gaussian mixture models.
Adaptive graph attention auto-encoder for spatial domain identification in spatial transcriptomics.
GNN-based model for learning intercellular communication from spatial graphs of cells.
Graph attention network for deciphering cell-cell communication from spatial transcriptomics data.
Optimal transport-based framework for screening cell-cell communication in spatial transcriptomics.
Neural optimal transport method for reconstructing growth and dynamic trajectories from single-cell transcriptomics.
Neural network for gene regulatory network inference from single-cell multiome (RNA+ATAC-seq) data with bulk data pretraining.
Machine Learning Tasks and Models
Drug Response Prediction
Attention-based model for drug response prediction with gene explainability.
Machine learning framework for predicting synergistic drug combination responses across cell lines.
Tumor gene set and attention-based model leveraging biological pathway knowledge for drug response prediction.
Hierarchical network model incorporating gene and pathway-level information for cancer drug response prediction.
Deep generative model for predicting transcriptional responses to novel chemical perturbations for drug discovery.
Compositional perturbation autoencoder for predicting single-cell transcriptional responses to unseen drug perturbations and dose combinations.
Drug Repurposing
Drug Target Interaction
Compound-Protein Interaction
Molecular Generation
Diffusion generative model for molecular docking, predicting the binding pose of small molecules to protein targets.
Junction tree variational autoencoder for molecular graph generation that guarantees chemical validity via a hierarchical tree decomposition.
Equivariant diffusion model for structure-based drug design that generates molecules and binding conformations for protein targets.
LLM for Biology
Bioinformatics-native AI agent skill library with local-first pharmacogenomics, ancestry PCA, semantic similarity, nutrigenomics, and metagenomics skills.
Transcriptomics Foundation Models
Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks.
Transformer-based foundation model pretrained on millions of single-cell profiles.
Foundation model for bulk RNA-seq data; learns general transcriptomic representations.
BERT-based foundation model pretrained on large-scale scRNA-seq data for cell type annotation.
Cell pre-trained language model with inter-cell transformer architecture for diverse single-cell analysis tasks.
Universal Cell Embeddings: zero-shot single-cell embedding model trained on 36M cells across species, tissues, and assays without fine-tuning.
Graph-based model for predicting transcriptional responses to single and combinatorial genetic perturbations using biological priors.
Context-aware, attention-based deep learning model pretrained on a large corpus of single-cell transcriptomes.
Spatial Foundation Models
Slide-level digital pathology foundation model pretrained on 1.3 billion pathology image tokens from whole-slide images.
General-purpose self-supervised pathology foundation model trained on 100K+ whole-slide images for diverse computational pathology tasks.
Vision-language foundation model for computational pathology trained with contrastive captioning on pathology image–text pairs.
ViT-based pathology foundation model pretrained with iBOT self-supervision on TCGA whole-slide images.
Multi-Omics Foundation Models
Single-cell multi-omic language model pretrained on ~10M cells spanning transcriptomics, epigenomics, and proteomics for cross-omics transfer tasks.
Multi-modal variational autoencoder for integrating paired and unpaired single-cell RNA-seq and ATAC-seq measurements into a unified latent space.
Probabilistic multimodal topic model jointly modeling single-cell transcriptomics and chromatin accessibility for regulatory network inference.
Graph-Linked Unified Embedding framework for unpaired single-cell multi-omics data integration across RNA, ATAC, methylation, and protein modalities.
Cross-modality translation model enabling prediction between scRNA-seq and scATAC-seq profiles without requiring paired single-cell measurements.
Asymmetric multi-omics variational autoencoder for integrating single-cell data across RNA, ATAC, and protein modalities with missing-modality support.
Multi-Omics Factor Analysis framework identifying shared axes of variation across bulk and single-cell datasets including RNA, ATAC, proteomics, methylation, and copy number.
Large-scale foundation model integrating DNA regulatory sequences and single-cell transcriptomics from 120M+ cells across multiple species for gene regulation prediction.
Interpretable multi-task deep neural network for single-cell multi-omics integration spanning transcriptomics, chromatin accessibility, and proteomics.
Graph attention network for spatial multi-omics integration jointly embedding spatial transcriptomics with chromatin accessibility or proteomics.
Mosaic integration and differential accessibility model for single-cell multi-omics data that handles arbitrary missing-modality combinations across transcriptomics, chromatin accessibility, and proteomics.
Contrastive self-supervised learning framework for single-cell multimodal data integration, batch correction, and reference-query mapping.
Dual-aligned variational autoencoder for single-cell cross-modality translation between paired and unpaired multiomics data.
Domain Alignment
Protein Structure Prediction and Design
Predicts structures of proteins, nucleic acids, small molecules, and their complexes.
Open-source all-atom biomolecular structure prediction model for proteins, nucleic acids, small molecules, and their complexes achieving AlphaFold3-level accuracy.
Unified molecular structure prediction model covering proteins, nucleic acids, small molecules, and complexes.
Multimodal protein language model that jointly reasons over sequence, structure, and function for generative protein design and engineering.
Deep learning model for protein sequence design given backbone structure.
Trainable, memory-efficient open-source reproduction of AlphaFold2 enabling custom protein structure prediction workflows.
Compound Embedding
RoBERTa-based molecular language model pretrained on SMILES for small-molecule representation learning.
Self-supervised graph transformer for large-scale molecular representation learning from unlabeled compounds.
Unsupervised molecular embedding method inspired by Word2Vec for learning vector representations of chemical substructures.
Multi-Modal Foundation Models
Clinical Histopathology Imaging Evaluation Foundation model integrating histology images and clinical context for pan-cancer analysis.
CLIP-based vision-language foundation model for biomedical images and text trained on PubMed figure–caption pairs.
Pan-cancer integrative histology-genomic analysis framework using multimodal deep learning for patient stratification.
Integrated framework fusing histopathology and genomic features via CNN, GNN, and attention gating for cancer diagnosis and prognosis.
Million-slide digital pathology foundation model using a vision transformer and self-supervised distillation for tile-level pathology image representation.
Tumor Origin Assessment via Deep-learning; weakly-supervised multi-task model predicting cancer primary origin from H&E whole-slide images.
Genomics Foundation Models
Foundation model for genomic sequences across multiple species.
Bidirectional equivariant long-range DNA sequence model based on Mamba.
Long-range genomic foundation model handling sequences up to 1M tokens with sub-quadratic attention.
Extended successor to Enformer for predicting RNA-seq coverage from long genomic sequence windows (524 kb) with improved resolution.