Computational Biology
Computational approaches applied to problems in biology.
Contents
Databases
Single-cell dataset repository and interactive explorer from the Chan Zuckerberg Initiative.
Public functional genomics database.
Public database for single-cell RNA.
Public database for single-cell RNA.
Collections of drug repurposing data (drug, MoA, target, etc).
Database of pathways and interactions.
Database of biological pathways.
Expert-curated, peer-reviewed pathway database with detailed reaction mechanisms.
Collection of pathway/genome databases across thousands of organisms.
Database of causal signaling interactions and pathways.
Curated gene sets derived from pathways and biological processes.
Open source databases and tools for mass spectrometry reference spectra.
Meta-database of metabolite mass spectra, metadata, and associated compounds.
Comprehensive human protein database (cells, tissues, organs).
3D structures of proteins, nucleic acids, complexes.
Functional information on proteins.
3D protein structure predictions.
Repository for structural data of biological molecules.
Assessing methods for protein structure prediction.
Clustered protein sequence databases.
Hierarchical classification of protein domain structures.
Structural Antibody Database containing all antibody structures in the PDB.
Database of antibody sequences from immune repertoire sequencing.
Knowledge Graph
Mechanisms of action from drug to disease.
Large-scale biological knowledge graph for drug discovery.
Heterogeneous network integrating genes, diseases, drugs, pathways, and more.
Multi-modal precision medicine knowledge graph integrating clinical, genetic, and drug data.
Benchmarks & Datasets
Benchmark collection of protein fitness landscape datasets for evaluating protein ML models.
Benchmark suite for generative molecular design models.
Benchmarking platform for molecular generation models.
Benchmark datasets for biological knowledge graph completion.
Large-scale benchmark of deep mutational scanning assays for evaluating protein fitness landscape models.
Comprehensive benchmarking framework for single-cell data integration methods.
Benchmark suite of five biologically meaningful semi-supervised learning tasks for evaluating protein representations.
Preprocessing Tools
Cheminformatics software & machine learning tools.
High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min).
Cheminformatics software & machine learning toolkit.
Deep learning library for drug discovery, quantum chemistry, and materials science.
MCP server for spatial transcriptomics analysis via natural language.
Automated cell type annotation for scRNA-seq.
RNA velocity estimation for single-cell transcriptomics, inferring the direction and speed of cell differentiation.
Machine Learning Tasks and Models
Attention-based model for drug response prediction with gene explainability.
GCN + heterogeneous network.
Machine learning framework for predicting synergistic drug combination responses across cell lines.
Deep learning library for drug repurposing.
Library for drug-target interaction prediction.
Network-based framework integrating heterogeneous biological data for DTI prediction.
Deep learning model using CNNs on protein sequences and drug SMILES.
Graph neural network–based DTI prediction using molecular graphs.
Transformer-based DTI model leveraging molecular substructures.
Bilinear attention network for interpretable DTI prediction.
Drug discovery via compound-protein interaction and machine learning.
CPI prediction using Transformer.
Reinforcement learning for de novo drug design.
Transformer-based model for molecular generation.
Sequence-to-sequence model for retrosynthesis prediction.
3D equivariant diffusion model for structure-based drug design.
Diffusion generative model for molecular docking, predicting the binding pose of small molecules to protein targets.
LLM for Biology
LLM for biomedical text generation.
LLM for biomedical information, integrated with various APIs.
Foundation LLM for single-cell data.
Pretrained on 50M cells for scRNA-seq denoising & zero imputation.
Bioinformatics-native AI agent skill library with local-first pharmacogenomics, ancestry PCA, semantic similarity, nutrigenomics, and metagenomics skills.
Single-cell Foundation Models
Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks.
Transformer-based foundation model pretrained on millions of single-cell profiles.
Foundation model for bulk RNA-seq data; learns general transcriptomic representations.
BERT-based foundation model pretrained on large-scale scRNA-seq data for cell type annotation.
Cell pre-trained language model with inter-cell transformer architecture for diverse single-cell analysis tasks.
Universal Cell Embeddings: zero-shot single-cell embedding model trained on 36M cells across species, tissues, and assays without fine-tuning.
Graph-based model for predicting transcriptional responses to single and combinatorial genetic perturbations using biological priors.
Spatial Foundation Models
Slide-level digital pathology foundation model pretrained on 1.3 billion pathology image tokens from whole-slide images.
General-purpose self-supervised pathology foundation model trained on 100K+ whole-slide images for diverse computational pathology tasks.
Vision-language foundation model for computational pathology trained with contrastive captioning on pathology image–text pairs.
Multi-Omics Foundation Models
Single-cell multi-omic language model pretrained on ~10M cells spanning transcriptomics, epigenomics, and proteomics for cross-omics transfer tasks.
Multi-modal variational autoencoder for integrating paired and unpaired single-cell RNA-seq and ATAC-seq measurements into a unified latent space.
Probabilistic multimodal topic model jointly modeling single-cell transcriptomics and chromatin accessibility for regulatory network inference.
Graph-Linked Unified Embedding framework for unpaired single-cell multi-omics data integration across RNA, ATAC, methylation, and protein modalities.
Cross-modality translation model enabling prediction between scRNA-seq and scATAC-seq profiles without requiring paired single-cell measurements.
Asymmetric multi-omics variational autoencoder for integrating single-cell data across RNA, ATAC, and protein modalities with missing-modality support.
Multi-Omics Factor Analysis framework identifying shared axes of variation across bulk and single-cell datasets including RNA, ATAC, proteomics, methylation, and copy number.
Large-scale foundation model integrating DNA regulatory sequences and single-cell transcriptomics from 120M+ cells across multiple species for gene regulation prediction.
Interpretable multi-task deep neural network for single-cell multi-omics integration spanning transcriptomics, chromatin accessibility, and proteomics.
Graph attention network for spatial multi-omics integration jointly embedding spatial transcriptomics with chromatin accessibility or proteomics.
Mosaic integration and differential accessibility model for single-cell multi-omics data that handles arbitrary missing-modality combinations across transcriptomics, chromatin accessibility, and proteomics.
Domain Alignment
Transfer learning framework for mapping new single-cell datasets onto pre-trained reference atlases across batches, conditions, and modalities.
Transformer-based framework for one-stop interpretable cell-type annotation supporting cross-dataset and cross-species transfer.
Protein Foundation Models
Fast protein structure prediction using language model embeddings.
Chemical embeddings & prediction.
Predicts structures of proteins, nucleic acids, small molecules, and their complexes.
Open-source all-atom biomolecular structure prediction model for proteins, nucleic acids, small molecules, and their complexes achieving AlphaFold3-level accuracy.
Unified molecular structure prediction model covering proteins, nucleic acids, small molecules, and complexes.
Multimodal protein language model that jointly reasons over sequence, structure, and function for generative protein design and engineering.
Generative model for protein backbone design using diffusion.
Deep learning model for protein sequence design given backbone structure.
High-resolution de novo protein structure prediction from sequence.
Three-track neural network for protein structure prediction.
Trainable, memory-efficient open-source reproduction of AlphaFold2 enabling custom protein structure prediction workflows.
Structure-aware protein language model using structure-aware tokens that encode both sequence and backbone geometry for improved function prediction.
Discrete diffusion framework for protein sequence generation trained on evolutionary-scale data, supporting unconditional generation, disordered region design, and functional motif scaffolding.
Multi-Modal Foundation Models
Genomics Foundation Models
Foundation model for genomic sequences across multiple species.
Pre-trained bidirectional encoder for DNA sequence analysis.
Improved genome foundation model with efficient tokenization.
Transformer model predicting gene expression from DNA sequence.
Sequential regulatory activity prediction from DNA sequences.
Bidirectional equivariant long-range DNA sequence model based on Mamba.
Long-context genomic foundation model (up to 1M tokens).
Long-range genomic foundation model handling sequences up to 1M tokens with sub-quadratic attention.
Extended successor to Enformer for predicting RNA-seq coverage from long genomic sequence windows (524 kb) with improved resolution.