Shasta: A De Novo Genome Assembler for Long-Read DNA Sequencing Technology

An abstract representation of DNA.
An abstract representation of DNA.

Traditionally, genomics research has relied exclusively on the reference genome from a small group of individuals to represent an entire species. In 2017, researchers at the University of California, Santa Cruz (UCSC) demonstrated that long-read human genome assembly using nanopore technology was possible without using a reference genome — but took hundreds of thousands of compute hours to complete. A year later, the group reached a then-unprecedented milestone of reference-free (de novo) sequencing 11 human genomes in nine days

To help advance and scale nanopore sequencing, an extensive team of researchers and developers led by Paolo Carnevali at CZI and Benedict Paten at UCSC built Shasta, an in-memory computing-driven algorithm that can complete a de novo (new, never before processed and completed without a prior reference genome) human genome assembly in just a few hours.

Developed in partnership with researchers and developers from the UC Santa Cruz Genomics Institute, Shasta gives researchers vital insights into the human genome in a fraction of the time and cost of traditional methods. This paper in Nature Biotechnology details how Shasta not only yields comparable or better accuracy as other similar assemblers, but also has the lowest number of misassemblies. 

While the human genome is Shasta’s primary focus, external scientific groups have used Shasta to assemble genomes of a wide range of species, including human, plant and animal cultivars, and rare and endangered species. This exploratory co-development collaboration also resulted in 10 software releases and several scientific papers. The Shasta source code is public and available for forks by other teams that wish to continue active development, and past software releases remain available on GitHub.

Oct 26, 2020