A data coordination plan, developed jointly by the HCA Consortium and partner institutions, will determine how data is ingested, processed, and shared. Complete details will be provided in an official data coordination document; key points are emphasized below.
All raw experimental data (e.g. FASTQ files for sequencing data) and comprehensive metadata must be submitted in a timely fashion, and prior to publication, to a common authenticated data ingestion service, either directly through the official Human Cell Atlas web portal or through a registered data broker. Upload will be made possible through a variety of mechanisms, including web-based UIs, spreadsheet templates, or direct posting via APIs (for bulk uploads or integration with lab systems). Data should be deposited at least quarterly, and ideally no later than one month after acquisition.
The HCA Consortium will govern format specifications and validation, including basic quality assurance. These requirements will be described in publicly-accessible documents, and the relevant code will be available in public repositories (e.g. on Github). All validation required by the data ingestion service will be made available in the form of tools that can be run locally prior to data upload.
After upload, data will be deposited into a primary data store and synced across multiple public cloud repositories. One or more secondary analysis pipelines (e.g. alignment for sequencing data) will be run on all submitted data, generating intermediate results (e.g. BAM files, gene-cell tables) and quality control metrics for depositing into the same data store with appropriate metadata and analysis provenance. Where necessary, different pipelines will be made available for different data types and sequencing approaches. All code for these secondary analysis pipelines will be open-source (e.g. MIT licensed), available in public code repositories, and provided in a form that is easy for labs to reproduce locally or in the cloud (e.g. with containerization).
Raw data, metadata, and derived results in the data store will be freely and openly available to any and all downstream users. This architecture will support the development of a rich ecosystem of linked tools for analysis, visualization, and complex queries, which will in turn be available to all researchers.