This DNA Sequencer Costs Less Than an iPhone. Conclusion: I Need to Learn Bioinformatics.

Plus the “cover band learning strategy”, 39 bioinformatics tools and two The Simpsons references.

Kahlil Corazo
11 min readOct 29, 2019

Do you remember this episode in The Simpsons?

I predict that within 10 years, computers will be twice as powerful, ten thousand times larger, and so expensive that only the 5 richest kings of Europe will own them. — Dr. Frink

When this aired, many of us had our personal computers at home. We found this funny because we were living in the future, where Moore’s law had marched relentlessly — and continues to do so up to this day.

In the world of biology, the same joke could have been made with DNA sequencing. When the human genome was sequenced in the early 2000s, it cost $2.7 billion. Even if Dr. Frink predicts that the cost of sequencing will fall following Moore’s law, the logic of the joke will still work (but perhaps not its effectiveness) because the price per Megabase of sequenced DNA has fallen much faster than the price of computing power.

https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

Despite the falling cost of DNA sequencing, the capital cost of a sequencer was still in the hundreds of thousands of USD — up until a few years ago, when the Oxford Nanopore MinION came out. It is the size of a mobile phone and you plug it in your computer.

image sources https://www.genomeweb.com/sequencing-technology/oxford-nanopore-releases-pricing-minion-flow-cells-users-publish-new-data https://www.evaluate.com/vantage/articles/analysis/spotlight/oxford-nanopore-disruptive-unicorn-gunning-illumina

At the time of writing (2019), the startup package, which includes a sequencer, flow cells and reagents, only costs $1,000. You need to buy more flow cells and reagents once they are used up. Flow cells costs $500 to $900 (cheaper at volume) and can produce up to 30 Gigabases of sequenced DNA per flow cell. Cost of reagents per sequencing is around $100.

From Li, Changsheng & Lin, Feng & An, Dong & Wang, Wenqin & Huang, Ruidong. (2017). Genome Sequencing and Assembly by Long Reads in Plants. Genes. 9. 6. 10.3390/genes9010006. https://www.ncbi.nlm.nih.gov/pubmed/29283420 (note that the price/Gb for Nanopore is probably wrong — see below)
Li et al above says $750 per GB for Nanopore, but the Nanopore website says it can do $15 — $60 per GB. https://nanoporetech.com/products/comparison#output
Bleidorn, Christoph (2016). “Third generation sequencing: technology and its potential impact on evolutionary biodiversity research”. Systematics and Biodiversity. 14 (1): 1–8. doi:10.1080/14772000.2015.1099575. ISSN 1477–2000. https://www.tandfonline.com/doi/abs/10.1080/14772000.2015.1099575?journalCode=tsab20

Not yet something everyone will have in their homes, but affordable enough for an MS Biology student like me to consider for a thesis. I read a couple of papers to see how this could be done for my organism of interest. This led me to conclude that I would need to learn bioinformatics. The numbers below might also convince you to do the same. I also share my learning plan. Would love to hear your thoughts.

Here are the two papers. As you can see, my organism of interest is cacao. The first paper used a PacBio sequencer and the second one used the MinION.

At the bottom of this post, I share the 39 bioinformatics software and databases I gleaned from these two papers. For comparison, below is the part of the materials and methods section of the second paper (Morrissey et al) that describe the library preparation. Judging with just the lengths of the sections, it appears that bioinformatics comprises the majority of the work in this paper. If you happen to have some actual experience in this kind of research, would you agree with this observation?

Furthermore, check out this graph of DNA sequencing data produced over time. The blue line is Moore’s law: doubling every 18 months. The red line is the historical growth rate: doubling every 7 months. By mere volume, it is looking like there is a lot of value in learning bioinformatics.

Bleidorn, Christoph (2016). “Third generation sequencing: technology and its potential impact on evolutionary biodiversity research”. Systematics and Biodiversity. 14 (1): 1–8. doi:10.1080/14772000.2015.1099575. ISSN 1477–2000. https://www.tandfonline.com/doi/abs/10.1080/14772000.2015.1099575?journalCode=tsab20

The Cover Band Learning Strategy

How do I learn bioinformatics? Well, there’s this specialization track in Coursera:

I plan to start this next month, as I have an online course I’m currently finishing. On top of this, I think the tried-and-tested learning pathway for bands might also work for would-be bioinformaticians: do covers first before writing your own songs.

Simpsons reference #2: Episode 8, “Covercraft”

Morrissey et al, the authors of the paper on sequencing cacao using the MinION, shared their sequencing data at NCBI: https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA421343

So in theory, I could recreate the journey through their bioinformatics pipeline. Interested in collaborating to make a Morrissey et al cover band? Email me at bio@corazo.org

Turns out there’s an actual Morrissey / The Smiths cover band.

39 Bioinformatics Tools and Databases

To stretch the analogy, here’s our playlist. I culled them from the two cacao sequencing papers that use long reads, Argout et al (2017), which uses PacBio, and Morrissey et al (2019), which uses the MinION. For someone without a bioinformatics background, I found learning about each tool gave me a clearer idea of what work this kind of project may entail. Interestingly, not one of their tools overlapped.

BIOINFORMATICS SOFTWARE AND DATABASES IN ARGOUT ET AL (2017)

  1. Newbler Assembler — the software package for de novo DNA sequence assembly that came with the discontinued Roche/454 sequencer. Argout et al used the Roche/454 sequencer and Newbler for their 2011 sequencing of the cacao genome (first ever!) https://omictools.com/newbler-tool
  2. Cacao Genome Hub — repository of genomics data from the whole genome sequenced T. cacao, both V1 (Argout et al, 2011) and V2 (Argout et al, 2017). https://cocoa-genome-hub.southgreen.fr
  3. Cutadapt — software that finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads. Argout et al (2017) used it for cleaning of short reads. https://cutadapt.readthedocs.io/en/stable/
  4. fastx_clean — a command in fastxtend, an extension of FASTX-Toolkit package, a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing. The fastx_clean command allows cleaning (adapters, N, quality) of the reads in fastq files. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Argout et al (2017) used this function to perform trimming steps to their output from their new sequence of T. Cacao using Illumina HiSeq 2000. http://www.genoscope.cns.fr/externe/fastxtend/
  5. bowtie2 — a tool from Johns Hopkins university for aligning sequencing reads to long reference sequences, particularly for aligning reads of about 50 up to 100s or 1,000s of characters, and at aligning to relatively long (e.g. mammalian) genomes. Argout et al used bowtie2 for trimming the contigs of Cacao genome V1 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
  6. BLAST — The Basic Local Alignment Search Tool (hosted by the National Center for Biotechnology Information (NCBI) of the US National Library of Medicine) finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance. Argout et al used BLAST to identify chloroplast and mitochondria contigs. https://blast.ncbi.nlm.nih.gov/Blast.cgi
  7. SSPACE — A stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data. SSPACE was the tool used by Argout et al for scaffolding. https://omictools.com/sspace-tool
  8. GMcloser — Closes gaps with a preassembled contig set or a long read set (i.e., error-corrected PacBio reads). Argout et al used GMcloser as the first step in closing gaps in scaffolds. https://omictools.com/gmcloser-tool
  9. GapCloser — This was the second step in closing gaps in scaffolds. The GapCloser is designed to close the gaps emerging during the scaffolding process by SOAPdenevo or other assembler, using the aundant pair relationship of short reads. https://www.westgrid.ca/support/software/gapcloser
  10. Tassel 5 GBS v2 pipeline — TASSEL is a Java program used to evaluate traits associations, evolutionary patterns, and linkage disequilibrium. The GBSv2 analysis pipeline is an extension of TASSEL. Argout et al used this software to analyze sequencing fragments. https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline
  11. VFCtools — a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files (eg, filter out specific variants, compare files, summarize variants, convert to different file types, validate and merge files, create intersections and subsets of variants.) Argout et al used VFCtools to filter out variant call data after SNPs were called. https://vcftools.github.io/index.html
  12. JoinMap — Software for the calculation of genetic linkage maps in experimental populations of diploid species. Argout et al used JoinMap for calculating pairwise linkage recombination frequencies. https://www.kyazma.nl/index.php/JoinMap/
  13. Blastn — Blastn (also part of the NCBI’s BLAST), searches nucleotide databases using a nucleotide query. Argout et al used Blastn to transfer the structural annotations from the previously annotated reference genome to the new assembly. https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch
  14. Exonerate — a generic tool for pairwise sequence comparison. It allows you to align sequences using a many alignment models, either exhaustive dynamic programming or a variety of heuristics. Argout et al used Exonerate for quality checks for the genes that were not transferred from V1 to V2. https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate
  15. NCBI Eukaryotic Genome Annotation Pipeline — provides content for various NCBI resources including Nucleotide, Protein, BLAST, Gene and the Genome Data Viewer genome browser. Argout et al used the NCBI Eukaryotic Genome Annotation Pipeline to perform new de novo RefSeq structural annotation. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
  16. Blastp — Blastp (also part of the NCBI’s BLAST) searches protein databases using a protein query. Argout et al used Blastp to perform functional annotation for each predicted coding sequence against UniProtKB/ Swiss-Prot and UniProtKB/TrEMBL databases. https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins
  17. Uniprot — a comprehensive and freely accessible resource of protein sequence and functional information. This is where Argout et al sourced the protein databases they used for functional annotation (see Blastp entry). https://www.uniprot.org/
  18. InterProScan — allows you to scan your sequence for matches against the InterPro protein signature databases. Argout et al used InterProScan to compare sequences with the InterPro database to obtain additional protein signature information. http://www.ebi.ac.uk/interpro/search/sequence/
  19. BlastKOALA — KOALA (KEGG Orthology And Links Annotation) is Kyoto Encyclopedia of Genes and Genomes (KEGG)’s internal annotation tool for K number assignment of KEGG GENES using SSEARCH computation. KEGG GENES is a collection of gene catalogs for all complete genomes generated from publicly available resources, mostly NCBI RefSeq and GenBank. SSearch uses William Pearson’s implementation of the method of Smith and Waterman (Advances in Applied Mathematics 2; 482–489 (1981)) to search for similarities between one sequence (the query) and any group of sequences of the same type (nucleic acid or protein) as the query sequence. Argout et al used BlastKOALA to reconstruct KEGG pathways. https://www.kegg.jp/blastkoala/

BIOINFORMATICS SOFTWARE AND DATABASES IN MORRISSEY ET AL (2019)

  1. minimap and miniasm — minimap is a mapper and miniasm is a de novo assembler. Both are designed specifically for Oxford Nanopore long reads. Morrissey et al used minimap and miniasm as one of their assembly strategies. https://arxiv.org/abs/1512.01801
  2. Canu — a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION). Morrissey et al used Canu as another assembly software https://canu.readthedocs.io/en/latest/
  3. SMARTdenovo — another de novo assembler designed for Oxford Nanopore and PacBio long reads. This another assembly strategy employed by Morrissey et al https://omictools.com/smartdenovo-tool
  4. BUSCO v3 — provides quantitative measures for the assessment of genome assembly, gene set, and transcriptome completeness, based on evolutionarily-informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB v9. Morrissey et al used BUSCO v3 for evaluating the contigs of their assembly. https://busco.ezlab.org/
  5. Pilon — a software tool which can be used to automatically improve draft assemblies, as well as find variation among strains, including large event detection. Morrissey et al used Pilon to polish and improve their BUSCO statistics. http://software.broadinstitute.org/software/pilon/
  6. Albacore — provides a docker file base caller from Oxford Nanopore. Morrissey et al used Albacore for basecalling of their raw sequencing. https://omictools.com/albacore-tool
  7. LepMap2 — can account for achiasmatic meiosis to gain linkage map accuracy. It includes features for creating ultra-high-density linkage maps and for constructing two high-density linkage maps for nine-spined sticklebacks based on large single nucleotide polymorphism (SNP) panels. Morrissey et al used LepMap2 to construct a maternal genetic linkage map from the SNP genotypes. https://omictools.com/lep-map2-tool
  8. Racon — Generates high quality consensus sequences with a single instruction multiple data (SIMD) accelerated. Racon is based on tests with PacBio and Oxford Nanopore datasets. Morrissey et al used Racon in conjunction with minimap and miniasm for their first de novo assembly strategy. https://omictools.com/racon-tool
  9. nucmer — NUCleotide MUMmer is a part of the MUMmer package, for the rapid alignment of very large DNA and amino acid sequences. Morrissey et al used nucmer to identify contigs representing the chloroplast and mitochondrial genomes, by comparing post-refinement contigs from the minimap assembly to the published cacao organelle genomes. http://nebc.nox.ac.uk/bioinformatics/docs/nucmer.html
  10. Circlator — A tool to circularize genome assemblies. Morrissey et al used Circlator to reassemble contigs representing chloroplast and mitochondrial genomes that they identified using nucmer. https://sanger-pathogens.github.io/circlator/
  11. Nanopolish — The purpose of nanopolish is to improve the consensus accuracy of an assembly of Oxford Nanopore Technology sequencing reads. Morrissey et al used Nanopolish to correct coverages for organelle genomes. https://nanopolish.readthedocs.io/en/latest/quickstart_consensus.html
  12. GraphMap — Analyses nanopore sequencing reads. GraphMap progressively refines candidate alignments to robustly handle potentially high-error rates and a fast graph traversal to align long reads with speed and high precision (>95%). Morrissey et al used GraphMap to align the MinION reads from the chloroplast and mitochondrial genomes to the circularized assemblies. https://omictools.com/graphmap-tool
  13. BWA MEM — one of the algorithms under BWA, a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. Morrissey et al used BWA MEM to remap Illumina short reads to the contigs, as one of the steps in creating a final set of polished contigs. http://bio-bwa.sourceforge.net/
  14. ALLMAPS — A method capable of computing a scaffold ordering that maximizes colinearity across a collection of maps. Morrissey et al used ALLMAPS to compare nonredundant contigs with six published SNP-based linkage maps. https://omictools.com/allmaps-tool
  15. BLAT — part of the UCSC Genome Browser tools, BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25 bases or more. Morrissey et al used BLAT and other UCSC tools to lift over the SNP loci from Matina v1.1. by Argout et al (2010) previously genotyped with an Illumina Infinitum array to the Pound 7 contigs. https://genome.ucsc.edu/
  16. Augustus v3.3.1 — a program that predicts genes in eukaryotic genomic sequences. Morrissey et al used Augustus to predict gene models ab initio on the final draft contigs. http://bioinf.uni-greifswald.de/augustus/
  17. RepeatMasker — is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. Morrissey et al used RepeatMasker to identify repetitive elements using the Viridiplantae library. http://www.repeatmasker.org/
  18. Geppard — allows the calculation of dotplots even for large sequences like chromosomes or bacterial genomes. Morrissey et al used Geppard to visually compare Pound 7 contigs from the MinION assembly to both of the unpublished BAC-based haplotypes. http://cube.univie.ac.at/gepard
  19. Mauve — a system for constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Morrissey et al used Mauve to align the Pound 7 contigs to their corresponding BAC haplotype. http://darlinglab.org/mauve/mauve.html
  20. Biostrings package in R — allows manipulation of strings in R. Morrissey et al used in-house scripts based on the Biostrings package in R to summarize alignments created with Mauve. https://www.rdocumentation.org/packages/Biostrings/versions/2.40.2

Join the Morrissey et al cover band! Email me at bio@corazo.org

--

--