Back to All Open Science Grantees
Zarr: A Common Backbone for the Scalable Storage of Annotated Tensor Data
Proposal Summary
To establish Zarr as a common, cross-community mechanism for storing collections of annotated tensors with consistent access for both local and large-scale cloud data.
Project
A key feature of the Python data ecosystem is the reliance on simple but efficient primitives that provide well-defined interfaces to produce seamless user workflows. NumPy provides an in-memory representation for tensors. Dask provides parallelization of tensor access. Xarray provides metadata linking tensor dimensions. Zarr provides a missing feature — namely the scalable, persistent storage for annotated hierarchies of tensors. To date, developers have organically integrated Zarr into their own APIs as a natural fit for this missing feature, but an investment now can greatly improve the usability of data workflows, unify the APIs between the various projects, and most importantly, build trust in this format as a long-term storage mechanism via format standardization, compatibility tests, and data validation. Beyond the Python data ecosystem, the sharing of n-dimensional data, whether in the cloud or locally, can demonstrably be made easier, faster, and more scalable to the benefit of all science. The Zarr team’s goal is the use by programmers, publishers, and even cloud platforms to provide a consistent data access backbone, removing the ubiquitous data format question thereby lowering the burden on end users. The simplicity of the Zarr protocol makes it both friendly and future proof but carries the risk of multiplying variants if there is not a strong and open dialogue in the community.