Bioinformatics
Contents
Package suites
International association of users & developers of open source Perl tools for bioinformatics, genomics and life sciences.
Freely available tools for biological computing in Python, with included cookbook, packaging and thorough documentation. Part of the Open Bioinformatics Foundation. Contains the very useful Entrez package for API access to the NCBI databases.
Rust implementations of algorithms and data structures useful for bioinformatics.
The modern C++ library for sequence analysis.
A Go library and command line utility for engineering organisms.
Biocaml aims to be a high-performance user-friendly library for Bioinformatics.
Java framework for processing biological data.
Data Tools
Downloading
Data Processing
Command Line Utilities
Git repo of useful single line commands.
Modular and universal bioinformatics, Bionode provides pipeable UNIX command line tools and JavaScript APIs for bioinformatics analysis workflows. [web ]
Syntax Highlighting for Computational Biology file formats (SAM, VCF, GTF, FASTA, PDB, etc...) in vim/less/gedit/sublime. [paper-2018 | web ]
Utilities for working with CSV/Tab-delimited files. [web ]
Another cross-platform, efficient, practical and pretty CSV/TSV toolkit. [web ]
Easily submitting PBS jobs with script template. Multiple input files supported.
A wee tool for random access into BGZF files.
Fast FASTQ filtering by matching reads against one or more regex patterns.
Sort genomic files according to a specified order.
Table file index. [paper-2011 ]
Write-once-read-many table for large datasets.
Create an index on a compressed text file.
Next Generation Sequencing
Workflow Managers
A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities. [paper-2014 | web ]
A small language for defining pipeline stages and linking them together to make pipelines. [web ]
a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. [web ]
A Workflow Management System geared towards scientific workflows. [web ]
(recommended) - A fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner. [paper-2018 | web ]
A python-based workflow manager.
Computation Pipeline library for python widely used in science and bioinformatics. [paper-2010 | web ]
Workflow library embedded in the Go programming language, focusing on supporting complex workflow constructs, compiling to a single binary, providing powerful file naming and comprehensive audit reports for every output [paper-2019 | web ]
Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments. [paper-2010 | web ]
Workflow standard developed by the Broad. [web ]
Pipelines
A list of pipeline resources.
A flexible pipeline, built with Nextflow, for the complete analysis of bacterial genomes. [web ]
A generic but comprehensive bacterial annotation pipeline, built with Nextflow, with nice graphical options for investigating results. [web ]
Batteries included genomic analysis pipeline for variant and RNA-Seq analysis, structural variant calling, annotation, and prediction. [web ]
Customizable pipeline for differential expression analysis with an intuitive GUI. [web ]
A pipeline for preprocessing short and long sequencing reads, built with Nextflow. [web ]
Sequence Processing
Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data. [paper-2017 ]
A quality control tool for high throughput sequence data. [web ]
FASTQ and SAM quality control using Python.
FASTQ/A short-reads pre-processing tools: Demultiplexing, trimming, clipping, quality filtering, and masking utilities. [web ]
Aggregate results from bioinformatics analyses across many samples into a single report. [paper-2016 | web ]
Sequence manipulation toolkit for FASTA/FASTQ files written in Nim. [paper-2021 | web ]
A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang. [paper-2016 | web ]
file format conversion in Biopython in a convenient way. [web ]
Toolkit for processing sequences in FASTA/Q formats.
UNIX-style FASTA manipulation tools.
Data Analysis
Sequence Alignment
An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. [paper-2012 | web ]
Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.
BWA-MEM drop-in replacement: 2-3x faster, 2-5x cheaper, 100% identical output on standard CPUs. [paper-2026 ]
the wavefront alignment algorithm (WFA) which expoit sequence similarity to speed up alignment [paper-2020 ]
SIMD C library for global, semi-global, and local pairwise sequence alignments [paper-2016 ]
A system for rapidly aligning entire genomes, whether in complete or draft form. [paper-1999 | paper-2002 | paper-2004 | web ]
An ultrafast protein aligner for blastp and blastx like searches. [paper-2021 ]
Partial-Order Alignment for fast alignment and consensus of multiple homologous sequences. [paper-2002 ]
Ultra-fast, sensitive search and clustering suite for protein and nucleotide sequence sets. [paper-2017 | paper-2018 ]
Quantification
Variant Calling
Deep learning-based variant caller
Bayesian haplotype-based polymorphism discovery and genotyping.
Variant Discovery in High-Throughput Sequencing Data.
A polymorphic bayesian genotyping model with wide applicability.
Structural variant discovery by integrated paired-end and split-read analysis.
lumpy: a general probabilistic framework for structural variant discovery.
Structural variant and indel caller for mapped sequencing data.
GRIDSS: the Genomic Rearrangement IDentification Software Suite.
structural variant calling and genotyping with existing tools, but,smoothly.
VCF File Utilities
Set of tools for manipulating VCF files.
Annotate a VCF with other VCFs/BEDs/tabixed files.
A C++ library for parsing and manipulating VCF files.
VCF manipulation and statistics (e.g. linkage disequilibrium, allele frequency, Fst).
BAM File Utilities
Collection of tools for working with BAM files.
MtDNA:Nuclear Coverage; BAM Toolbox can output the ratio of MtDNA:nuclear coverage, a proxy for mitochondrial content.
Automate common SAM & BAM conversions.
fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing.
Displaying sequence statistics for next-generation sequencing.
Fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs.
Telseq is a tool for estimating telomere length from whole genome sequence data.
GFF BED File Utilities
Variant Simulation
Variant Prediction/Annotation
Data
Tools
A port of pyVCF using Cython for speed.
Cython + HTSlib == fast VCF parsing; even faster parsing than pyVCF. [paper-2017 | web ]
Python wrapper for bedtools. [paper-2011 | web ]
Pythonic access to FASTA files.
Python wrapper for samtools. [web ]
A VCF Parser for Python. [web ]
Assembly
SPAdes (St. Petersburg genome assembler) is an assembly toolkit containing various assembly pipelines and the de-facto standard for prokaryotic genome assemblies.
SKESA is a de-novo sequence read assembler for microbial genomes. It uses conservative heuristics and is designed to create breaks at repeat regions in the genome. This leads to excellent sequence quality without significantly compromising contiguity.
Minimap2 is an pairwise aligner for genomic and spliced nucleotide sequences. It can perform the assembly-to-assembly alignment, and works with gzip'd FASTQ, FASTA formats. It also finds overlaps between long-reads.
Annotation
Prokka: rapid prokaryotic genome annotation. Prokka is one of the most cited annotation command line tools for microbial genome annotations.
Bakta is a tool for the rapid & standardized annotation of bacterial genomes & plasmids. It provides dbxref-rich and sORF-including annotations in machine-readable JSON & bioinformatics standard file formats for automatic downstream analysis.
Long-read sequencing
Long-read Assembly
A single molecule sequence assembler for genomes large and small.
De novo assembler for single molecule sequencing reads using repeat graphs.
A haplotype-resolved assembler for accurate Hifi reads.
A fuzzy Bruijn graph approach to long noisy reads assembly
Visualization
Genome Browsers / Gene Diagrams
Easy-to-use DNA sequence visualization tool that turns FASTA files into browser-based visualizations.
Embeddable genome viewer. Integration data from a wide variety of sources, and can load data directly from popular genomics file formats including bigWig, BAM, and VCF.
BioJS is a library of over hundred JavaScript components enabling you to visualize and process data using current web technologies.
Flexible circular visualization of genome-associated data with BioPerl and SVG.
Horizon chart D3-based JavaScript library for DNA data.
Java-based browser. Fast, efficient, scalable visualization tool for genomics data and annotations. Handles a large variety of formats.
D3 JavaScript based genome viewer. Constructs SVGs.
JavaScript genome browser that is highly customizable via plugins and track customizations.
Point and click, cross platform suite for analysing and visualizing next-generation sequencing datasets.
JavaScript library that can be used to generate interactive and highly customizable web-based genome browsers.
JavaScript library for drawing canvas-based gene diagrams.