Project Awesome project awesome

Genomics Foundation Models > GPN (Genomic Pre-trained Network)

Masked language model for DNA sequences enabling zero-shot variant effect prediction without requiring functional annotations.

Package 339 stars GitHub

GPN (Genomic Pre-trained Network)

hgt_genome_392c4_a47ce0

Code and resources for genomic language models GPN, GPN-MSA, PhyloGPN and GPN-Star.

Table of contents

Installation

Install directly from GitHub:

pip install git+https://github.com/songlab-cal/gpn.git

For development (editable install):

git clone https://github.com/songlab-cal/gpn.git
cd gpn
pip install -e .

Modeling frameworks

Model Paper Notes
GPN Benegas et al. 2023 Requires unaligned genomes
GPN-MSA Benegas et al. 2025 Requires aligned genomes for both training and inference [deprecated in favor of GPN-Star]
PhyloGPN Albors et al. 2025 Uses an alignment during training, but does not require it for inference or fine-tuning
GPN-Star Ye et al. 2025 Requires aligned genomes for both training and inference

GPN

A single-sequence genomic language model trained on unaligned genomes. Also known as GPN-SS.

Quick start

import gpn.model  # registers architecture for AutoModel
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-brassicales")

Papers

Benegas, Batra and Song "DNA language models are powerful predictors of genome-wide variant effects" PNAS (2023)

Benegas, Eraslan and Song "Benchmarking DNA sequence models for causal regulatory variant prediction in human genetics" bioRxiv (2025)

Sorghum gene expression prediction (unpublished)

Training on your own data

1. Create a dataset

Use the Snakemake workflow to create a dataset:

  • Can automatically download data from NCBI given a list of accessions, or use your own fasta files
  • Navigate to workflow/make_dataset/, configure config/config.yaml and config/assemblies.tsv, then run:
    snakemake --cores all
    
2. Train the model

Training features:

  • Automatically detects all available GPUs
  • Track metrics on Weights & Biases
  • Implemented encoders: convnet (default), roformer (Transformer), bytenet
  • Specify config overrides: e.g. --config_overrides encoder=bytenet,num_hidden_layers=30
  • The number of steps that you can train without overfitting will be a function of the size and diversity of your dataset

Example command:

WANDB_PROJECT=your_project torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_mlm --do_train --do_eval \
    --report_to wandb --prediction_loss_only True --remove_unused_columns False \
    --dataset_name results/dataset --tokenizer_name gonzalobenegas/tokenizer-dna-mlm \
    --soft_masked_loss_weight_train 0.1 --soft_masked_loss_weight_evaluation 0.0 \
    --weight_decay 0.01 --optim adamw_torch \
    --dataloader_num_workers 16 --seed 42 \
    --save_strategy steps --save_steps 10000 --evaluation_strategy steps \
    --eval_steps 10000 --logging_steps 10000 --max_steps 120000 --warmup_steps 1000 \
    --learning_rate 1e-3 --lr_scheduler_type constant_with_warmup \
    --run_name your_run --output_dir your_output_dir --model_type GPN \
    --per_device_train_batch_size 512 --per_device_eval_batch_size 512 --gradient_accumulation_steps 1 --total_batch_size 2048 \
    --torch_compile \
    --ddp_find_unused_parameters False \
    --bf16 --bf16_full_eval
3. Extract embeddings

Input file requires chrom, start, end columns.

Example command:

torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.get_embeddings \
    windows.parquet genome.fa.gz 100 your_output_dir results.parquet \
    --per_device_batch_size 4000 --is_file --dataloader_num_workers 16
4. Variant effect prediction

Input file requires chrom, pos, ref, alt columns.

Example command:

torchrun --nproc_per_node=$(echo $CUDA_VISIBLE_DEVICES | awk -F',' '{print NF}') -m gpn.ss.run_vep \
    variants.parquet genome.fa.gz 512 your_output_dir results.parquet \
    --per_device_batch_size 4000 --is_file --dataloader_num_workers 16

GPN-MSA

A genomic language model trained on whole-genome alignments across multiple species.

Quick start

import gpn.model  # registers architecture for AutoModel
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-msa-sapiens")

Papers

Benegas, Albors, Aw, Ye and Song "A DNA language model based on multispecies alignment predicts the effects of genome-wide variants" Nature Biotechnology (2025)

Training on other species (e.g. other vertebrates, plants)

PhyloGPN

A phylogenetic genomic language model that uses an alignment during training but does not require it for inference or fine-tuning. PhyloGPN is a convolutional neural network that outputs rate matrix parameters for Felsenstein's F81 substitution model, trained on the Zoonomia alignment. It can be used for transfer learning and zero-shot variant deleteriousness prediction, especially useful for sequences not in reference genomes.

Quick start

from transformers import AutoModel

model = AutoModel.from_pretrained("songlab/PhyloGPN", trust_remote_code=True)

Papers

Albors, Li, Benegas, Ye and Song "A Phylogenetic Approach to Genomic Language Modeling" RECOMB (2025)

GPN-Star

A phylogeny-aware genomic language model trained on whole-genome alignments across multiple evolutionary timescales.

Quick start

import gpn.star.model  # registers architecture for AutoModel
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("songlab/gpn-star-hg38-p243-200m")

Papers

Ye, Benegas, Albors, Li, Prillo, Fields, Clarke and Song "Predicting functional constraints across evolutionary timescales with phylogeny-informed genomic language models" bioRxiv (2025)

Getting help

  • Questions? Open a Discussion for usage questions, ideas, or general help
  • Issues? Report bugs or request features via Issues

Citation

GPN:

@article{benegas2023dna,
  title={DNA language models are powerful predictors of genome-wide variant effects},
  author={Benegas, Gonzalo and Batra, Sanjit Singh and Song, Yun S},
  journal={Proceedings of the National Academy of Sciences},
  volume={120},
  number={44},
  pages={e2311219120},
  year={2023},
  publisher={National Acad Sciences}
}

GPN-MSA:

@article{benegas2025dna,
  title={A DNA language model based on multispecies alignment predicts the effects of genome-wide variants},
  author={Benegas, Gonzalo and Albors, Carlos and Aw, Alan J and Ye, Chengzhong and Song, Yun S},
  journal={Nature Biotechnology},
  pages={1--6},
  year={2025},
  publisher={Nature Publishing Group US New York}
}

PhyloGPN:

@inproceedings{albors2025phylogenetic,
  title={A Phylogenetic Approach to Genomic Language Modeling},
  author={Albors, Carlos and Li, Jianan Canal and Benegas, Gonzalo and Ye, Chengzhong and Song, Yun S},
  booktitle={International Conference on Research in Computational Molecular Biology},
  pages={99--117},
  year={2025},
  organization={Springer}
}

GPN-Star:

@article{ye2025predicting,
  title={Predicting functional constraints across evolutionary timescales with phylogeny-informed genomic language models},
  author={Ye, Chengzhong and Benegas, Gonzalo and Albors, Carlos and Li, Jianan Canal and Prillo, Sebastian and Fields, Peter D and Clarke, Brian and Song, Yun S},
  journal={bioRxiv},
  pages={2025--09},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}
Back to Computational Biology