Improving Product-space Forms for Single-Cell Data Representation and Understanding
Project Summary
Single-cell genomic technologies have enabled researchers to probe various fundamental properties of cellular states, trajectories, functions, repair, and response mechanisms. Examples include recent studies using scRNA-seq and scATAC-seq methods to measure gene expression levels and chromatin accessibilities, both of which are of significant importance in comparative genomics, disease diagnostics, and more broadly, personalized medicine applications. Multiomics single-cell data have highly heterogeneous features and large dimensionalities and are therefore hard to accurately impute, aggregate/fuse and process. scRNA-seq, scATAC-seq, ChIA-Drop and many other single-cell data can each be inherently merged and embedded with small distortion into low-dimensional “curved spaces” that include traditional Euclidean (flat) as well as spherical and hyperbolic spaces. Spherical spaces are the spaces of choice for periodic or cyclical components of data, while hyperbolic spaces are used for embeddings of tree-like and hierarchical data structures. Embeddings implemented in products of different curved spaces allow for natural aggregation and dimensionality reduction strategies informed by the geometry of the data, as well as significantly improve the performance and scalability of many basic and advanced learning methods such as classification, clustering and regression.
This project aims to develop new variational autoencoders for embedding heterogeneous single-cell data into product spaces of appropriate mixed curvatures and dimensions, both of which are learned during embedding; design specialized imputation algorithms that follow the geometry of the data; implement new product-space learning algorithms; and test the software on single-cell multiomics datasets.