Improving Computational Methods for High-throughput Sequence Data Analysis
Heng Li (Dana-Farber Cancer Institute)
To maintain and improve the three proposed software projects: minimap2, BWA and hifiasm, and extend them to new architectures and new data types.
minimap2 is the dominant sequence aligner for long reads. It was optimized for long reads of 85–95 percent accuracy. Although minimap2 works with long accuracy produced today, it does not take full advantage of modern data. This work will improve the performance and accuracy for long accurate reads and for long sequence assemblies— in particular around long segmental duplications and in long repetitive regions.
With the release of BWA-MEM2, this work will continue to maintain and improve BWA, and add the ARM64 support for Apple M1 and recent ARM-based servers. In collaboration with the Intel research lab, the team will explore faster indexing algorithms to replace the current one in BWA-MEM2.
hifiasm has been rapidly adopted in the community and will likely have a significant impact in the next few years. At present, hifiasm only works with PacBio’s High-Fidelity (HiFi) long reads. With Oxford Nanopore’s new chemistry and new base-calling algorithm which can bring the average base accuracy to 99 percent, the team plans to adapt the hifiasm algorithm to Nanopore data. In addition to the support of Nanopore data, the team plans to integrate Hi-C sequence data into the assembly process, which can be achieved by mapping Hi-C reads to the hifiasm assembly graph and phase unitigs using Hi-C’s long-range information.