SPAdes and QUAST Toolkits For Genome Sequence Assembly and Analysis
Anton Korobeynikov (Saint Petersburg State University)
To turn SPAdes and QUAST codebases into scalable, modular, extensible and user-friendly frameworks that will streamline future research and development in genome assembly, analysis and quality assessment.
SPAdes is a versatile de novo genome assembler that supports different input formats and modes of operation. To date, SPAdes incorporates more than 70 man-years of development efforts that include implementation of various methods and algorithms for sequence assembly and analysis. Many of these methods can be readily used outside of SPAdes. This proposal aims to improve the current SPAdes and QUAST codebases via refactoring, cleaning and resolving issues accumulated during 10 years of paper-driven research and development, leading to a modular and modern codebase of significantly higher quality that will simplify code reuse and external contributions. Additionally, the team will revamp the user documentation and develop new tutorials to allow better user experience and allow the transfer of knowledge from toolkit developers.
QUAST is a popular toolkit for genome assembly evaluation and analysis; the tool is organized as a multi-step pipeline. In this project, we propose to bring QUAST codebase on top of a modern pipeline engine, such as Snakemake and Nextflow. We plan to split the computational process into multiple subtasks and allow the workflow engine to orchestrate these tasks, including resuming terminated steps, parallel execution, cluster processing. The codebase will also be cleaned and upgraded to Python3 to eliminate unnecessary legacy and facilitate community-driven toolkit development.