Sections that describe advanced, often technical concepts that are not central to the main flow of the book are marked with a star. See the Teaching page for a list of logical dependencies among chapters. Use the Map of HTS to see how the main computational steps in high-throughput sequencing map to book chapters.

Part I – Preliminaries

1. Molecular biology and
high-throughput sequencing

1.1 DNA, RNA, proteins

1.2 Genetic variations

1.3 High-throughput sequencing

2. Algorithm design

2.1 Complexity analysis

2.2 Data representations

2.3 Reductions

2.4 Literature

3. Data structures

3.1 Dynamic range minimum queries

3.2 Bitvector rank and select operations

3.3 Wavelet tree

3.3.1 Balanced representation

3.3.2 Range queries

3.4 Literature

4. Graphs

4.1 Directed acyclic graphs (DAGs)

4.1.1 Topological ordering

4.1.2 Shortest paths

4.2 Arbitrary directed graphs

4.2.1 Eulerian paths

4.2.2 Shortest paths and the Bellman-Ford method

4.3 Literature

5. Network flows

5.1 Flows and their decompositions

5.2 Minimum-cost flows and circulations

5.2.1 The residual graph

5.2.2 A pseudo-polynomial algorithm

5.3 Bipartite matching problems

5.3.1 Perfect matching

5.3.2 Matching with capacity constraints

5.3.3 Matching with residual constraints

5.4 Covering problems

5.4.1 Disjoint cycle cover

5.4.2 Minimum path cover in a DAG

5.5 Literature

Part II – Fundamentals of Biological Sequence Analysis

6. Alignments

6.1 Edit distance

6.1.1 Edit distance computation

6.1.2 Shortest detour

6.1.3 Myers' bitparallel algorithm

6.2 Longest common subsequence

6.2.1 Sparse dynamic programming

6.3 Approximate string matching

6.4 Biological sequence alignment

6.4.1 Global alignment

6.4.2 Local alignment

6.4.3 Overlap alignment

6.4.4 Affine gap scores

6.4.5 The invariant technique

6.5 Gene alignment

6.6 Multiple alignment

6.6.1 Scoring schemes

6.6.2 Dynamic programming

6.6.3 Hardness

6.6.4 Progressive multiple alignment

6.6.5 DAG alignment

6.6.6 Jumping alignment

6.7 Literature

7. Hidden Markov models (HMMs)

7.1 Definition and basic problems

7.2 The Viterbi algorithm

7.3 The forward and backward algorithms

7.4 Estimating HMM parameters

7.5 Literature

Part III – Genome-Scale Index Structures

8. Classical indexes

8.1 k-mer index

8.2 Suffix array

8.2.1 Suffix and string sorting

8.3 Suffix tree

8.3.1 Properties of the suffix tree

8.3.2 Construction of the suffix tree

8.4 Applications of the suffix tree

8.4.1 Maximal repeats

8.4.2 Maximal unique matches

8.4.3 Document counting

8.4.4 Suffix-prefix overlaps

8.5 Literature

9. Burrows-Wheeler indexes

9.1 Burrows-Wheeler transform (BWT)

9.2 BWT index

9.2.1 Succinct LF-mapping

9.2.2 Backward search

9.2.3 Succinct suffix array

9.2.4 Batched locate queries

9.3 Space-efficient construction of the BWT

9.4 Bidirectional BWT index

9.4.1 Visiting all nodes of the suffix tree with just one BWT

9.5 BWT index for labeled trees

9.5.1 Moving top-down

9.5.2 Moving bottom-up

9.5.3 Construction and space complexity

9.6 BWT index for labeled DAGs

9.6.1 Moving backward

9.6.2 Moving forward

9.6.3 Construction

9.7 BWT indexes for de Bruijn graphs

9.7.1 Frequency-oblivious representation

9.7.2 Frequency-aware representation

9.7.3 Space-efficient construction

9.8 Literature

Part IV – Genome-Scale Algorithms

10. Read alignment

10.1 Pattern partitioning

10.2 Dynamic programming along suffix tree paths

10.3 Backtracking on BWT indexes

10.3.1 Prefix pruning

10.3.2 Case analysis pruning with the bidirectional BWT index

10.4 Suffix filtering for approximate overlaps

10.5 Paired-end and mate pair reads

10.6 Split alignment of reads

10.7 Alignment of reads to a pan-genome

10.7.1 Indexing a set of individual genomes

10.7.2 Indexing a reference genome and a set of variations

10.8 Literature

11. Genome analysis and comparison

11.1 Space-efficient genome analysis

11.1.1 Maximal repeats

11.1.2 Maximal unique matches

11.1.3 Maximal exact matches

11.2 Comparing genomes without alignment

11.2.1 Substring and k-mer kernels

11.2.2 Substring kernels with Markovian correction

11.2.3 Substring kernels and matching statistics

11.2.4 Mismatch kernels

11.2.5 Compression distance

11.3 Literature

12. Genome compression

12.1 Lempel-Ziv parsing

12.1.1 Basic algorithm for Lempel-Ziv parsing

12.1.2 Space-efficient Lempel-Ziv parsing

12.1.3 Space- and time-efficient Lempel-Ziv parsing

12.2 Bit-optimal Lempel-Ziv compression

12.3 Literature

13. Fragment assembly

13.1 Sequencing by hybridization

13.2 Contig assembly

13.2.1 Read error correction

13.2.2 Reverse complements

13.2.3 Irreducible overlap graphs

13.3 Scaffolding

13.4 Gap filling

13.5 Literature

Part V – Applications

14. Genomics

14.1 Variation calling

14.1.1 Calling small variants

14.1.2 Calling large variants

14.2 Variation calling over pan-genomes

14.2.1 Alignments on a set of individual genomes

14.2.2 Alignments on the labeled DAG of a population

14.2.3 Evaluation of variation calling results

14.3 Haplotype assembly and phasing

14.4 Literature

15. Transcriptomics

15.1 Estimating the expression of annotated transcripts

15.2 Transcript assembly

15.2.1 Short reads

15.2.2 Long reads

15.2.3 Paired-end reads

15.3 Simultaneous assembly and expression estimation

15.4 Transcript alignment with co-linear chaining

15.5 Literature

16. Metagenomics

16.1 Species estimation

16.1.1 Single-read methods

16.1.2 Multi-read and coverage-sensitive methods

16.2 Read clustering

16.2.1 Filtering reads from low-frequency species

16.2.2 Initializing clusters

16.2.3 Growing clusters

16.3 Comparing metagenomic samples

16.3.1 Sequence-based methods

16.3.2 Read-based methods

16.3.3 Multi-sample methods

16.4 Literature