Data Science Geospatial Open Source Matt Hanson

Deconstructing Analysis-Ready Data

01.08.2022

The 4th Analysis Ready Data (ARD) Workshop was held virtually in October 2021. Consisting of ten minute lightning talks packed into 1 hour sessions, participants from government and industry presented developments in “Analysis Ready Data”. Each day followed a specific theme such as calibration and validation, standards, time series, optical, SAR, and vegetation.

The problem? There is little consensus, much less a standard, on what Analysis Ready Data even is. While the last 4 years of ARD workshops have been great in showcasing progress that governmental, research, and industry groups have been making, it’s as clear as ever that community work is needed to define a standard nomenclature and specification on how to describe data quality, validation, and provenance. It is only through a good characterization of the data that users can determine how ready the data is for their analytic needs, and how any uncertainty might affect results.

This week the 20th Annual Joint Agency Commercial Imagery Evaluation (JACIE) workshop will be held, where we’ll be continuing many of the same conversations and themes. In advance of JACIE, this continues the theme I presented at the ARD workshop (“What is ARD?”), introduces the idea of Analysis Ready Metadata, and recommends next steps.

Landsat—the original ARD

The first use of the term “Analysis-Read Data” for geospatial imagery was the Landsat ARD product, released in 2017. This product was an attempt to harmonize data across all Landsat missions, atmospherically corrected to surface reflectance, and tiled to a common pixel-aligned grid.

The ARD are available to significantly reduce the burden of pre-processing on users of Landsat data.
—Landsat ARD

This facilitated the creation of time series without any additional processing, since a single pixel was a fixed location and size through the entire Landsat archive, allowing users to treat all the Landsat satellites as a single “virtual” constellation. As opposed to real constellations where multiple satellites carrying equivalent instruments are operated together as part of a single mission (e.g., Aqua and Terra MODIS instruments, Planet Doves), the Landsat ARD product is an effort to harmonize data from across similar types of instruments.

CARD4L—One ARD specification

During this time of workshops and the Landsat product, the Committee on Earth Observing Satellites (CEOS) was working on the CEOS ARD for Land (CARD4L) specifications, of which there are several: for Surface Reflectance, Surface Temperature, and two for Radar.

CEOS Analysis Ready Data for Land (CARD4L) are satellite data that have been processed to a minimum set of requirements and organized into a form that allows immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets.
—CEOS ARD

CARD4L is an important step toward defining minimum standards for a specific class of workflows (i.e., algorithms): those that require data processed into equivalent ground measurements (i.e., surface reflectance), especially for time series. CARD4L specifies a minimum level of processing, accuracy, as well as what metadata must be made available. However, it does not specify how metadata should be included or referenced. Therefore, the assessment process is currently largely manual and time consuming since the lack of a metadata standard prevents automated validation.

Partly due to this difficulty, currently the only datasets with a successful CARD4L assessment are Landsat Collection 2 Surface Reflectance and Surface Temperature. They meet the CARD4L ‘threshold’ (minimum) requirements, but not all of the ‘target’ (desired) requirements. The lack of approved datasets limits the overall usefulness of CARD4L as an interoperability metric and would benefit greatly from standardization.

Of particular note is that CARD4L does not include any requirements on file format, tiling scheme, pixel alignment or anything related to distribution and access. More on this later.

What does ARD mean?

By 2018 this powerful concept was gaining momentum. Of course users want “analysis ready” data, it would mean less data wrangling to get it in a form immediately usable. Chris Holmes wrote a blog post, “ARD Defined”, largely informed by the Landsat ARD product and community consensus at the time.

The following table summarizes those requirements, alongside the Landsat ARD product and CARD4L specifications. The requirements are grouped here into thematic categories: ‘processing’, ‘metadata’ and ‘formatting’. Note that these current definitions are focused on optical data.

	ARD Defined	Landsat ARD	CARD4L
Processing	– Atmospherically corrected – Georegistered – BRDF corrected – Sensor alignment	– Atmospherically corrected – Georegistered	– Atmospherically corrected – Georegistered – BRDF corrected (target) – Sub-pixel relative (threshold) or absolute (target)
Metadata	– Cloud mask	– Cloud mask – Sun angles – and more	– Cloud mask – Sun angles – Provenance – Bandpasses – and more
Formatting	– Tiled and pixel-aligned grid – Common projection	– Tiled and pixel-aligned grid – Common projection	– None

There is agreement on several points: that ARD is an atmospherically corrected product for one, has been georegistered accurately so that pixels line up for time series, and contains metadata for masking out clouds.

But the table also highlights some interesting differences. Both the Holmes definition and the CARD4L target spec require a BRDF correction, while that was not included in the Landsat ARD product. CARD4L requires a variety of metadata to be included for the dataset (e.g., links to documentation, spectral bandpasses), some of which certainly seems useful to users. CARD4L also puts no requirements on how the data should be formatted for distribution to users.

What should ARD mean?

The ARD definitions described above have certain limitations, and may be an overly narrow view of what ARD should be. By requiring cloud masks and atmospheric correction, we risk ignoring a plethora of use cases that have different requirements. Water analysts may not want atmospheric correction applied because it risks removing the water signal of interest. An algorithm to generate metrics and measurements of clouds, or create a novel cloud mask doesn’t need an existing cloud mask. Feature extraction algorithms, such as deriving roads or buildings from single images have no need for pixel-aligned grids, may perform fine without atmospheric correction, and may be resistant to the presence of clouds.

Let’s take a look back at the Landsat and CARD4L definitions for ARD and focus on the similar parts of the definition:

The ARD are available to significantly reduce the burden ~~of pre-processing~~ on users ~~of Landsat data~~.
—~~Landsat~~ ARD

~~CEOS~~ Analysis Ready Data ~~for Land (CARD4L)~~ are satellite data that ~~have been processed to a minimum set of requirements and organized into a form that~~ allows immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets.
—~~CEOS~~ ARD

We can see the overall sentiment: ARD should minimize the burden of repetitive and boilerplate tasks on the users of the data, saving time and compute resources. The term ARD so far has really been used as an umbrella term to describe any efforts in improving the quality, characterization, use and interoperability of measured geospatial data. And by interoperability, we mean that different datasets can be used in the same processing workflow without adding specific handling to the workflow.

ARD means easy discovery, access, and exploitation of measured geospatial data.

Taking this one step further, I’d point out that ARD is really just another way to describe what is a cloud-native geospatial ecosystem of tools and data. If data is Analysis Ready, it means we are able to find it, access it in a scalable way, and exploit it using generalized algorithms as autonomously as possible.

Deconstructing ARD

However, currently users have no guarantee that an “ARD” dataset has been processed a particular way. In his talk, “Are Analysis Ready Data Really Ready?”, Mark Friedl demonstrated that ARD data comes with no guarantee on the interoperability of the data in different workflows. It may be BRDF corrected or not, or may be topographically corrected…or not. To confuse things even more, the previous definitions of ARD conflate components that really need to be treated separately: ‘processing’, ‘metadata’, and ‘formatting’.

Processing Requirements

The predominant view of ARD is that it meets some sort of minimum processing “level” (e.g., Level 2 surface reflectance), however as has been previously noted this depends on the workflow the data will be used in. Some workflows may be flexible in what they need, while others may require specific processing steps or have a maximum tolerance of uncertainty in geometric or measurement accuracy.

It is the workflows that should determine what processing is required to produce a valid output.

Metadata Requirements

Implied in having processing requirements is that they are described in metadata. ARD data must be well-characterized, preferably including both what has been done as well as uncertainties in measurement and geo-location accuracy. Including lineage information and source data links are also important to support reproducible workflows. Indeed, if the community is ever to exploit ARD data, it must first determine how to describe processes and provenance for what has been done to the data. This allows workflows to assess whether input data meets the necessary requirements.

The SpatioTemporal Asset Catalog (STAC) specification, now at a 1.0.0 version, serves as a great vehicle for describing ARD relevant metadata. In fact, there is a STAC CARD4L extension in development showing mappings between CARD4L requirements and fields in the STAC core spec and extensions. However, since CARD4L is just one set of requirements users would be better served if there were a more generic metadata specification for characterizing ARD. This would enable any workflow or specification, CARD4L or otherwise, to provide a validator to perform automatic assessments.

Formatting Requirements

If one of the main goals of ARD is to provide an easy path for data exploitation, then any ARD should support the use of cloud-native processing. File assets should be made available in cloud friendly file formats, such as Cloud-Optimized GeoTIFF (COG) or Zarr. Beyond that, there should be no specific projection, tiling or grid requirements, as this requires making assumptions about users. After all, as soon as a data provider decides on a suitable projection for ARD, it will be the wrong one for someone. The best projection is probably the one that minimizes resampling errors in the process.

It is no oversight that the CARD4L specification does not include any formatting requirements. Geoscience Australia, one of the main champions of CARD4L, has long been utilizing methods to dynamically create datacubes (over space and time) via the OpenDataCube project. By enabling users to dynamically define projections and resolutions, and able to mix together data in multiple projections and resolutions, the requirement that data be made available as pixel-aligned tiles becomes a non-issue, as these become implementation details that are transparent to the end user.

Analysis-Ready Metadata

A necessary first step to make ARD an exploitable concept for users is to create an Analysis Ready Metadata specification to define how the data should be characterized with regard to its input source, processing applied, and uncertainties associated with the measurements.

Analysis Ready Metadata describes Analysis Ready Data for use in automated workflows.

The development of an Analysis Ready Metadata (ARM) specification underscores that data analysis should start with a thorough description of the data through metadata. The CARD4L STAC extension is a good starting point, but does not cover all use cases.

This week, JACIE 2022 will provide an opportunity for sharing ideas for how to coordinate efforts over the next year. My next blog post will dive into some more detail and propose a roadmap for the next year.

Interested in helping? An ARM specification will require stakeholders from across the industry to help with defining the spec, contribute workflow requirements, and create implementations of workflows and validators. Let us know on Twitter: @GeoSkeptic and @Element84.

Special thanks to Erin Robsinson, Chris Holmes, Arthur Elmes, Ariel Walcutt, Jeff Siarto, and Dan Pilone for helping put this post together.

Header image credit: Kjer Glacier; NASA Earth Observatory