We provide here a non-exhaustive list of implementations of the algorithms and data structures described in the book. We mainly list prototypes that are useful for pedagogical purposes. Sometimes we also include high-performance, mature codes that are ready to be used in genome-scale applications.
If you or your students implement some algorithms from the book, please let us know via email
and we will link your code from this page. We do our best to assert that it is shared in good trust, but we do not fully check its efficiency or correctness.
Part I – Preliminaries
3. Data structures
3.1 Dynamic range minimum queries
- File rbt_max.cpp inside the NN50-calculator implements fully dynamic range maximum queries using a red-black tree.
- The libmaus2 library implements static range minimum queries using lookup tables, dynamic programming and tree encodings.
3.2 Bitvector rank and select operations
3.3 Wavelet tree
- See the wt* files in the SDSL library.
- The libmaus2 library implements static and dynamic wavelet trees, as well as external-memory and parallel construction algorithms.
- The libcds2 library implements the wavelet tree and the wavelet matrix.
4.1 Directed acyclic graphs (DAGs)
4.1.1 Topological ordering
4.2 Arbitrary directed graphs
4.2.1 Eulerian paths
4.2.2 Shortest paths and the Bellman-Ford method
5. Network flows
5.2 Minimum-cost flows and circulations
5.2.2 A pseudo-polynomial algorithm
5.3 Bipartite matching problems
5.3.1 Perfect matching
5.4 Covering problems
5.4.2 Minimum path cover in a DAG
- The reduction to a minimum-cost problem presented in the book is part of the MPC-Solver.
Part II – Fundamentals of Biological Sequence Analysis
- The libmaus2 library implements a number of algorithms for local, global and sparse sequence alignment, overlap alignment, and LCS computation.
6.1 Edit distance
6.1.1 Edit distance computation
- The example in the book was generated with the script dp2tikz.py.
* 6.1.3 Myers bitparallel algorithm
6.2 Longest common subsequence
6.2.1 Sparse dynamic programming
- The algorithm can be found in this prototype targeted to a music retrieval application.
6.3 Approximate string matching
6.4 Biological sequence alignment
6.4.1 Global alignment
6.4.2 Local alignment
6.4.4 Affine gap scores
6.5 Gene alignment
6.6 Multiple alignment
6.6.2 Dynamic programming
6.6.4 Progressive multiple alignment
6.6.5 DAG alignment
- PAGAN implements DAG-based progressive alignment.
6.6.6 Jumping alignment
7. Hidden Markov models (HMMs)
7.2 The Viterbi algorithm
7.3 The forward and backward algorithms
7.4 Estimating HMM parameters
Part III – Genome-Scale Index Structures
8. Classical indexes
8.1 k-mer index
- The libmaus2 library implements a gamma encoder and decoder, and some applications of gamma-coding like a run-length encoder.
8.2 Suffix array
8.2.1 Suffix and string sorting
- The sais library provides a number of space- and time-efficient implementations.
8.3 Suffix tree
8.3.2 Construction of the suffix tree
- The libmaus2 library implements algorithms for building the LCP array from the suffix array, and from the BWT represented as a wavelet tree.
8.4 Applications of the suffix tree
8.4.1 Maximal repeats
8.4.2 Maximal unique matches
8.4.3 Document counting
9. Burrows-Wheeler indexes
9.2 BWT index
9.2.3 Succinct suffix array
- Many variants of succinct suffix arrays, FM-indexes, and compressed suffix arrays are implemented in the Pizza&Chili corpus.
- A more generic interface can be found inside the SDSL library.
- The libmaus2 library implements the FM-index, a bidirectional index, and a succinct suffix array.
9.3 Space-efficient construction of the BWT
- See the csalib library for a space-efficient implementation in C of a different algorithm from what we describe.
- See the construct_bwt package for an even more space-efficient implementation in C++.
9.4 Bidirectional BWT index
- See the BD_BWT_index for an implementation of the bidirectional BWT index.
- Section 9.4 and 9.4.1 describe general iterators of the internal nodes of a suffix tree. See VSTree Iterator for an example interface of such an iterator.
9.5 BWT index for labeled trees
9.6 BWT index for labeled DAGs
9.7 BWT indexes for de Bruijn graphs
9.7.1 Frequency-oblivious representation
- An inefficient, pedagogical implementation in Python is available at debby.py.
- See dbgfm for an implementation of a different data structure based on the Burrows-Wheeler transform.
Part IV – Genome-Scale Algorithms
10. Read alignment
10.1 Pattern partitioning
- The extension of the pigeonhole principle considered in Exercise 10.2 is implemented in ERNE.
10.2 Dynamic programming along suffix tree paths
- Several prototype implementations of pattern partitioning, combined with the bitparallel computation along suffix tree paths, can be found in this package.
10.3 Backtracking on BWT indexes
10.3.1 Prefix pruning
- An almost verbatim implementation can be found in readaligner.
- The original implementation of prefix pruning, together with some optimizations for DNA alphabet and seed-heuristics, can be found in BWA.
- The idea of pruning by hashing combined with the BWT index, described in Insight 10.3, is implemented in ERNE.
10.3.2 Case analysis pruning with the bidirectional BWT index
- An almost verbatim implementation can be found in readaligner.
- The original implementation of case analysis pruning, together with some optimizations for DNA alphabet and seed-heuristics, can be found in Bowtie.
- The original implementation of the enhanced case analysis pruning exploiting the bidirectional BWT index, together with some optimizations for DNA alphabet and seed-heuristics, can be found in SOAP2.
10.4 Suffix filtering for approximate overlaps
10.6 Split alignment of reads
- One of the book Insights is implemented in TopHat.
10.7 Alignment of reads to a pan-genome
10.7.1 Indexing a set of individual genomes
- An implementation of the hybrid index is available at hybrid.
10.7.2 Indexing a reference genome and a set of variations
11. Genome analysis and comparison
11.1 Space-efficient genome analysis
11.1.1 Maximal repeats
11.1.2 Maximal unique matches
11.1.3 Maximal exact matches
11.2 Comparing genomes without alignment
11.2.1 Substring and k-mer kernels
- For pedagogical implementations, see the BW4SA library.
- CVTree (source code) is a webserver implementation of k-mer kernels that is not designed for space-efficiency.
- The data-driven approach for choosing k detailed in an insight is also implemented in the FFP package.
- Some examples of k-mer extractors are: Jellyfish, DSK, KMC2, BF Counter, Tallymer.
* 11.2.2 Substring kernels with Markovian correction
- A proof-of-concept implementation based on truncated suffix trees is available in the Composerv package.
11.2.3 Substring kernels and matching statistics
- The SASK package implements substring kernels using matching statistics and suffix arrays.
- The backwardSK package implements substring kernels using matching statistics, the BWT, a balanced parentheses data structure, and the LCP array.
- kmacs implements an inexact variant of the average common substring approach described in the insight, but using a simple heuristic.
11.2.4 Mismatch kernels
- A variant of the mismatch kernel described in this section is implemented by the SEQAM lab.
- A collection of other inexact kernels is provided by Christina Leslie's lab. Such implementations are not designed for space-efficiency.
11.2.5 Compression distance
The CompLearn suite implements the Normalized Compression Distance and uses it for clustering.
12. Genome compression
12.2 Bit-optimal Lempel-Ziv compression
- An implementation of Relative Lempel-Ziv is implemented in the RLZ package.
- An alternative approach is implemented in the GDC2 package.
13. Fragment assembly
13.2 Contig assembly
- GATB (Genome Assembly & Analysis Tool Box) implements de Bruijn graphs based on a Bloom filter, and supports reverse complements.
- This script is based on GATB and reports all unitigs in a de Bruijn graph.
13.2.1 Read error correction
- The algorithm given in the book is largely simplified from what actual tools use. Perhaps the most similar approach is implemented in LoRDEC.
13.2.3 Irreducible overlap graphs
- The algorithm described in the Insight is implemented in SGA.
13.4 Gap filling
- The algorithm we describe is implemented in Gap2Seq.
Part V – Applications
14.1 Variation calling
14.1.1 Calling small variants
- Popular workflows for variant calling are implemented in GATK and SAMtools.
14.1.2 Calling large variants
14.2 Variation calling over pan-genomes
14.2.2 Alignments on the labeled DAG of a population
14.2.3 Evaluation of variation calling results
14.3 Haplotype assembly and phasing
- The algorithm described in the book was implemented by WhatsHap.
15.1 Estimating the expression of annotated transcripts
- See the Least Squares solvers implemented in MATLAB (in particular, functions lsqnonneg or lsqlin for the problem as considered in the book).
15.2 Transcript assembly
15.2.1 Short reads
- A minimum path cover approach based on read overlaps that resembles the one described in this section is implemented in Cufflinks.
15.2.2 Long reads
- See Traphlor.
- An alternative approach for exploiting long reads or partially-assembled transcripts, and also based on a minimum path cover formulation, is implemented in BRANCH.
15.2.3 Paired-end reads
- See CLASS for an alternative approach that takes into account paired-end read information.
15.3 Simultaneous assembly and expression estimation
- The minimum-cost flow solutions for Problem 15.6, and the algorithm asked for in Exercise 15.12 are implemented in Traph.
- Traph also implements Problem 15.6 using minimum-cost flows and the reduction from Insight 5.2 from convex costs to linear costs.
- The addition of a regularization term to the objective function mentioned in Insight 15.2, also solved with a minimum-cost flow problem, is implemented in flipflop.
15.4 Transcript alignment with co-linear chaining
16.1 Species estimation
16.1.1 Single-read methods
- The lowest common ancestor method for classifying reads is implemented in the MEGAN package.
16.1.2 Multi-read and coverage-sensitive methods
- Variants of the lowest common ancestor method, applied to contigs and read clusters, are implemented in MetaCluster 4.
- The set-cover heuristic for read classification is implemented in the MTR package.
- Taxonomic markers for read classification are implemented in MetaPhyler and in
MetaPhlAn. MetaRef is a database of markers, cores and crowns.
16.1.2 Multi-read and coverage-sensitive methods
- A procatical implementation of Problem 16.2 is by the tool MetaFlow.
16.2 Read clustering
- The pipeline for read clustering described in this section is originally implemented in MetaCluster 5. This implementation includes disjoint-set forests (Insight 16.1) and k-means clustering (Insight 16.2), but it is not designed for space-efficiency.
- The bwtCluster package implements the pipeline space-efficiently.
- An alternative clustering criterion based on rare k-mers is implemented in TOSS.
- CONCOCT is a tool for the simultaneous clustering of contigs from multiple metagenomic samples.
16.3 Comparing metagenomic samples
- Meta-Storms compares metagenomic samples using their estimated taxonomic composition.
16.3.1 Sequence-based methods
- d2-tools implements a number of dissimilarity measures on k-mers which have been applied to metagenomes and metatranscriptomes.
16.3.2 Read-based methods
- commet compares metagenomic samples using a read similarity criterion.
16.3.3 Multi-sample methods