This 2nd edition contains many new sections, sections that have gone through significant modifications, and some new or rearranged chapters. These are highlighted. Sections that describe advanced, often technical concepts that are not central to the main flow of the book are marked with a star. Use the Map of HTS to see how the main computational steps in high-throughput sequencing map to book chapters.

Part I – Preliminaries

1. Molecular biology and
high-throughput sequencing

1.1 DNA, RNA, proteins

1.2 Genetic variations

1.3 High-throughput sequencing

2. Algorithm design

2.1 Complexity analysis

2.2 Data representations

2.3 Reductions

2.4 Literature

3. Data structures

3.1 Dynamic range minimum queries

3.2 Bitvector rank and select operations

3.3 Wavelet tree

3.3.1 Balanced representation

3.3.2 Range queries

3.4 Static range minimum queries

3.4.1 From RMQs to LCAs through Cartesian tree

3.4.2 From LCAs to special RMQs

3.4.3 Solving short special RMQs

3.4.4 From large special RMQs to shorter general RMQs

3.4.5 Solving general RMQs

3.5 Hashing

3.5.1 Universal hashing

3.5.1 Approximate membership query

3.5.1 Rolling hash

3.5.1 Minimizers

3.6 Literature

4. Graphs

4.1 Directed acyclic graphs (DAGs)

4.1.1 Topological ordering

4.1.2 Shortest paths

4.2 Arbitrary directed graphs

4.2.1 Eulerian paths

4.2.2 Shortest paths and the Bellman-Ford method

4.3 Literature

5. Network flows

5.1 Flows and their decompositions

5.2 Minimum-cost flows and circulations

5.2.1 The residual graph

5.2.2 A pseudo-polynomial algorithm

5.3 Bipartite matching problems

5.3.1 Perfect matching

5.3.2 Matching with capacity constraints

5.3.3 Matching with residual constraints

5.4 Covering problems

5.4.1 Disjoint cycle cover

5.4.2 Minimum path cover in a DAG

5.5 Literature

Part II – Fundamentals of Biological Sequence Analysis

6. Alignments

6.1 Edit distance

6.1.1 Edit distance computation

6.1.2 Shortest detour

6.1.3 Myers' bitparallel algorithm

6.2 Longest common subsequence

6.2.1 Sparse dynamic programming

6.3 Approximate string matching

6.4 Biological sequence alignment

6.4.1 Global alignment

6.4.2 Local alignment

6.4.3 Overlap alignment

6.4.4 Affine gap scores

6.4.5 The invariant technique

6.5 Gene alignment

6.6 Multiple alignment

6.6.1 Scoring schemes

6.6.2 Dynamic programming

6.6.3 Hardness

6.6.4 Progressive multiple alignment

6.7 DAG alignment

6.8 Alignment on cyclic graphs

6.9 Jumping alignment

6.10 Literature

7. Hidden Markov models (HMMs)

7.1 Definition and basic problems

7.2 The Viterbi algorithm

7.3 The forward and backward algorithms

7.4 Estimating HMM parameters

7.5 Literature

Part III – Genome-Scale Index Structures

8. Classical indexes

8.1 k-mer index

8.2 Suffix array

8.2.1 Suffix and string sorting

8.3 Suffix tree

8.3.1 Properties of the suffix tree

8.3.2 Construction of the suffix tree

8.4 Applications of the suffix tree

8.4.1 Maximal repeats

8.4.2 Maximal unique matches

8.4.3 Document counting

8.4.4 Suffix-prefix overlaps

8.4.5 Approximate string matching in O(kn) time

8.5 Literature

9. Burrows–Wheeler indexes

9.1 Burrows–Wheeler transform (BWT)

9.2 BWT index

9.2.1 Succinct LF-mapping

9.2.2 Backward search

9.2.3 Succinct suffix array

9.2.4 Batched locate queries

9.3 Space-efficient construction of the BWT

9.4 Bidirectional BWT index

9.4.1 Visiting all nodes of the suffix tree with just one BWT

9.5 BWT index for labeled trees

9.5.1 Moving top-down

9.5.2 Moving bottom-up

9.5.3 Construction and space complexity

9.6 BWT index for labeled DAGs

9.6.1 Moving backward

9.6.2 Moving forward

9.6.3 Construction

9.7 BWT indexes for de Bruijn graphs

9.7.1 Frequency-oblivious representation

9.7.2 Frequency-aware representation

9.7.3 Space-efficient construction

9.8 Literature

Part IV – Genome-Scale Algorithms

10. Alignment-based genome analysis

10.1 Variation calling

10.1.1 Calling small variants

10.1.2 Calling large variants

10.3 Pattern partitioning

10.3 Dynamic programming along suffix tree paths

10.4 Backtracking on BWT indexes

10.4.1 Prefix pruning

10.4.2 Case analysis pruning with the bidirectional BWT index

10.5 Suffix filtering for approximate overlaps

10.6 Paired-end and mate pair reads

10.7 Algorithmic approach to variant selection

10.8 Literature

11. Alignment-free genome analysis and comparison

11.1 De novo variation calling

11.2 Space-efficient genome analysis

11.2.1 Maximal repeats

11.2.2 Maximal unique matches

11.2.3 Maximal exact matches

11.3 Comparing genomes without alignment

11.3.1 Substring and k-mer kernels

11.3.2 Substring kernels with Markovian correction

11.3.3 Substring kernels and matching statistics

11.3.4 Mismatch kernels

11.3.5 Compression distance

11.3.6 Approximating Jaccard similarity with min-hash

11.4 Literature

12. Compression of genome collections

12.1 Lempel–Ziv parsing

12.1.1 Basic algorithm for Lempel–Ziv parsing

12.1.2 Space-efficient Lempel–Ziv parsing

12.1.3 Space- and time-efficient Lempel–Ziv parsing

12.2 Bit-optimal Lempel–Ziv compression

12.3 Prefix-free parsing and run-length encoded BWT

12.3.1 Prefix-free parsing

12.3.2 Suffix sorting

12.3.3 Construction of the run-length BWT

12.4 Literature

13. Fragment assembly

13.1 Sequencing by hybridization

13.2 Contig assembly

13.2.1 Read error correction

13.2.2 Reverse complements

13.2.3 Irreducible overlap graphs

13.3 Scaffolding

13.4 Gap filling

13.5 Literature

Part V – Applications

14. Haplotype analysis

14.1 Haplotype assembly and phasing

14.1.1 Minimum error correction

14.1.2 Hardness

14.1.3 Dynamic programming

14.2 Haplotype matching

14.2.1 Haplotype matching in linear time

14.2.2 Positional Burrows–Wheeler transform

14.3 Literature

15. Pangenomics

15.1 Overview of pangenome representations

15.1.1 Colored de Bruijn graphs

15.1.2 Founder graphs and founder sequences

15.2 Aligning reads to pangenome

15.2.1 Indexing a set of individual genomes with a hybrid scheme

15.2.2 Indexing a set of individual genomes with the r-index

15.2.3 Indexing a reference genome and a set of variants

15.3 Variation calling over pangenomes

15.3.1 Analysis of alignments on a set of individual genomes

15.3.2 Analysis of alignment on the labeled DAG of a population

15.3.3 Evaluation of variation calling results

15.4 Literature

16. Transcriptomics

16.1 Split alignment of reads

16.2 Estimating the expression of annotated transcripts

16.3 Transcript assembly

16.3.1 Short reads

16.3.2 Long reads

16.3.3 Paired-end reads

16.4 Simultaneous assembly and expression estimation

16.5 Transcript alignment with minimizers and co-linear chaining

16.6 Literature

17. Metagenomics

17.1 Species estimation

17.1.1 Single-read methods

17.1.2 Multi-read and coverage-sensitive methods

17.2 Read clustering

17.2.1 Filtering reads from low-frequency species

17.2.2 Initializing clusters

17.2.3 Growing clusters

17.3 Comparing metagenomic samples

17.3.1 Sequence-based methods

17.3.2 Read-based methods

17.3.3 Multi-sample methods

17.4 Literature