Harmony Powers Robust and Scalable Single-Cell Integration
Open source single-cell data now represents a broad survey of cellular phenotypes across donors, tissues, and diseases. Single-cell integration is the best strategy to leverage these increasingly large and complex datasets into novel biological insights. To achieve this, integration algorithms must be able to scale to billions of cells, produce robust results across a range of biological systems, and simultaneously model multiple sources of technical and biological variation. The Harmony algorithm is a popular and well-benchmarked method that can already scale to one million cells on a standard laptop. Unlike other methods, which are limited to integration over a single dimension, Harmony can simultaneously model multiple sources of variation.
This project outlines three independent strategies to improve Harmony to meet the challenges of modern data. The first strategy will enable Harmony to scale to one billion cells and 10,000 datasets, making Harmony useful for both small pilot projects and massive atlas-sized reference building. The team will also automate selection of model hyperparameters, making Harmony robust to diverse study conditions and more reproducible. Finally, the project will develop a curated compendium of representative Harmony integration analyses to demonstrate how to perform accurate and robust comparison of important biological states, such as disease and tissue association. These templates will also serve as educational tools and establish the first set of multivariate single-cell integration benchmarks. Given the widespread adoption of Harmony today, these improvements will have an immediate impact on the single-cell genomics community and enable larger and more complex analyses of open source single-cell data.