Starting this thread to keep track of preblock implementation as part of the refactored datasets and preprocessing pipeline. The general idea with preblocks are to have PyTorch layers that can handle data transforms and scaling that can be combined together in one exportable workflow.
The original Dataset implementation in CREDIT folds loading from disk, combining different types of variables, and normalization into one class that has a confusing web of function calls to perform all of these tasks. See ERA5_Multistep_Batcher for an example. In the new data processing paradigm, we want to split the functionality into the following steps:
- Source Datasets: each of these uses a PyTorch Dataset class to load data from a particular source (e.g., ERA5, GOES, MRMS, CAM, MOM6, etc) from disk and organize into a dictionary of tensors for each type of field (prognostic, diagnostic, dynamic forcing, static) and potentially for each variable.
- Transforms of individual variables: be able to do log transforms on fields like Q or surface pressure.
- Calculate derived variables: e.g., calculating TOA radiation or solar zenith angle on the fly rather than loading from disk. Could be called from the dataset potentially.
- Regridding: if the input data are on a lat-lon grid and we want to convert to HEALPIX or regridding regional data to or from Lambert Conformal or Albers Equal Area. Also, if different data sources are on different grids and we want to regrid them to a common grid. The regridding weights would be calculated offline.
- Vertical interpolation: Aggregate 137 model levels to a smaller subset either through spline fitting or averaging.
- Normalizing: Apply bridgescaler standard, minmax, or quantile scalers to model. Scaler values are calculated with bridgescaler offline.
- Concatenating: Combine all the processed data into one datacube to go into the model.
@charlie-becker, @kevinyang-cky, and I had a meeting today to discuss implementation details for the preblocks.
- Adding metadata to tensors: Since many preprocessing operations are conditional on what source and type of variable is being used, we need to include some metadata along with the tensors describing the source, variable, level, and potentially type of variable. We want to make the syntax as clear as possible but also avoid naming collisions across datasets. The metadata schema/syntax is going to be in flux while we figure out the best option here.
- When to concatenate into a combined tensor: Charlie was expressing preference for doing this early on and using metadata to distinguish parts of the tensor to grab for preblocks. One potential issue with this is that PyTorch autograd doesn't work with in-place operations on tensors and can throw errors or just be incorrect if you try to backprop through one of these operations. I think it would be easier to concatenate near the end rather than at the beginning, but we will need to test this in more detail.
- One bridgescaler scaler or support multiple ones?
Starting this thread to keep track of preblock implementation as part of the refactored datasets and preprocessing pipeline. The general idea with preblocks are to have PyTorch layers that can handle data transforms and scaling that can be combined together in one exportable workflow.
The original Dataset implementation in CREDIT folds loading from disk, combining different types of variables, and normalization into one class that has a confusing web of function calls to perform all of these tasks. See ERA5_Multistep_Batcher for an example. In the new data processing paradigm, we want to split the functionality into the following steps:
@charlie-becker, @kevinyang-cky, and I had a meeting today to discuss implementation details for the preblocks.