Gene Set Enrichment Analysis for Single-Cell Data (scGSEA)
Project Summary
Given the utility of Gene Set Enrichment Analysis (GSEA) in profiling pathway and process activation in gene expression data from bulk microarray and RNA-sequencing assays, there is strong interest in assessing the degree of pathway and process activation in individual cells from single cell RNA-seq (scRNA-seq) data. Potential applications include the identification of novel cellular subtypes based on the activity of specific molecular pathways, or process-level characterization of complex cellular relationships such as those in the tumor microenvironment. Often, the single sample version of GSEA (ssGSEA) can be used to good effect for this purpose, but the sparsity of scRNA-seq datasets may introduce uncertainty in ssGSEA scores.
This project aspires to develop and distribute scGSEA, a new version of ssGSEA specifically tailored for use with single-cell data. The work will begin with a benchmarking phase to assess the performance and stability of the approach on multiple existing single-cell datasets when varying scRNA-seq expression normalization methods, gene set sizes and expression distribution in the data, and enrichment scoring statistics. With this insight, the new scGSEA approach will be implemented and distributed to investigators worldwide as Python and R packages, as well as in the GenePattern and GenePattern Notebook environments. The scGSEA code will optimize performance via multithreading and support sparse matrix representations, and all versions will provide documentation, including guidance for investigators on best practices for its use.