Caching outputs from compass steps

Date: 2021/07/30

Contributors: Xylar Asay-Davis

Summary

We would like to have a way to download output files for compass steps from an online cache instead of generating them each time the step runs. The primary motivation for this is to optionally avoid time-consuming steps for generating meshes and initial conditions for faster regression testing with MPAS components in “forward” mode. Potential other uses could include cached results as baselines for validation. A challenge for this capability is providing an easy way for both developers and users to control which steps in a test case or suite are cached and which are run as normal.

Requirements

Requirement: cached outputs

Date last modified: 2021/07/30

Contributors: Xylar Asay-Davis

Each compass step defines its output files in the compass.Step.outputs attribute. For selected steps (see Requirement: selecting whether to use cached outputs), we require a mechanism to download cached files for each of these outputs and to use these cached files for the outputs of the step instead of computing them.

Requirement: selecting whether to use cached outputs

Date last modified: 2021/07/30

Contributors: Xylar Asay-Davis

There needs to be a mechanism for developers and users to select which steps are run as normal and which use cached outputs. For this mechanism to be practical, it should not be overly tedious or manual (e.g. manually setting a flag for each step).

Requirement: updating cached outputs

Date last modified: 2021/07/30

Contributors: Xylar Asay-Davis

There should be a documented process for creating cached outputs for steps and uploading them.

Requirement: unique identifier for cached outputs

Date last modified: 2021/07/30

Contributors: Xylar Asay-Davis

There should be a mechanism for giving each cached output file a unique identifier (such as a date stamp). A given version (git hash or release) of compass should know which cached files to download. Older cached files should be retained so that older versions of compass can still be used with these cached files.

Note

It may be worthwhile to include a process for deprecating and then deleting old cache files.

Requirement: either “normal” or “cached” versions of a step

Date last modified: 2021/07/30

Contributors: Xylar Asay-Davis

We do not require the ability to set up a “normal” and a “cached” version of the same step within a compass test case or suite. (If this is not the case, it would place important constraints on the design solution.)

Design

Design: cached outputs

Date last modified: 2021/07/30

Contributors: Xylar Asay-Davis

compass supports “databases” of input data files on the E3SM LCRC server. Files will be stored in a new compass_cache database within each MPAS core’s space on that server. If the “cached” version of a step is selected (see Design: selecting whether to use cached outputs), an appropriate “input” file will be added to the test case where the “target” is the file on the LCRC server to be cached locally for future use and the “filename” is the output file. compass will know which files on the server correspond to which output files via a python dictionary, as described in Design: unique identifier for cached outputs.

Design: selecting whether to use cached outputs

Date last modified: 2021/08/03

Contributors: Xylar Asay-Davis

A compass suite can indicate cached steps in two ways. If all steps in a test case should have cached output, the following notation is used:

ocean/global_ocean/QU240/mesh
    cached
ocean/global_ocean/QU240/PHC/init
    cached

If only some steps in a test case should have cached output, they need to be listed explicitly, as follows:

ocean/global_ocean/QU240/mesh
    cached: mesh
ocean/global_ocean/QU240/PHC/init
    cached: initial_state

Similarly, a user setting up test cases has two mechanisms for specifying which test cases and steps should have cached outputs. If all steps in a test case should have cached outputs, the suffix c can be added to the test number:

compass setup -n 90c 91c 92 ...

This approach is efficient but does not provide any control of which steps use cached outputs and which do not.

A much more verbose approach is required if some steps use cached outputs and others do not within a given test case. Each test case must be set up on its own with the -t and --cached flags as follows:

compass setup -t ocean/global_ocean/QU240/mesh --cached mesh ...
compass setup -t ocean/global_ocean/QU240/PHC/init --cached initial_state ...
...

These approaches assume that we always have either the “normal” or the “cached” version of a step within a test case or test suite (see Design: either “normal” or “cached” versions of a step) and developers or users are free to choose between them, as long as cache files have been stored on the LCRC server and added to the cached_files.json database.

Design: updating cached outputs

Date last modified: 2021/08/03

Contributors: Xylar Asay-Davis

A new compass cache command-line tool will be added. This will only be available on Chrysalis and Anvil, the machines where files can be placed on the LCRC server. This command can be run on a work directory to copy the outputs from selected steps into the appropriate directory on the LCRC server, and to create or update a python dictionary in a file cached_files.json (see Design: unique identifier for cached outputs) that maps between output files in the work directory and those on the LCRC server. For example:

compass cache -i \
    ocean/global_ocean/QU240/mesh/mesh \
    ocean/global_ocean/QU240/PHC/init/initial_state

Design: unique identifier for cached outputs

Date last modified: 2021/08/03

Contributors: Xylar Asay-Davis

Each cached file on the LCRC server will include a date stamp in the file name. For example, culled_mesh.nc will become culled_mesh.20210730.nc on the server. When compass cache is called (see Design: updating cached outputs), the date stamp will default to the date that the call is being made but can be overridden with a flag (e.g. --date 20210730).

Each MPAS core in compass will optionally include a file cached_files.json that contains a python dictionary mapping between the names of output files in the work directory and those in the compass_cache database for that MPAS core on the LCRC server. For example:

{
    "ocean/global_ocean/QU240/mesh/mesh/culled_mesh.nc": "global_ocean/QU240/mesh/mesh/culled_mesh.210803.nc",
    "ocean/global_ocean/QU240/mesh/mesh/culled_graph.info": "global_ocean/QU240/mesh/mesh/culled_graph.210803.info",
    "ocean/global_ocean/QU240/mesh/mesh/critical_passages_mask_final.nc": "global_ocean/QU240/mesh/mesh/critical_passages_mask_final.210803.nc",
    "ocean/global_ocean/QU240/PHC/init/initial_state/initial_state.nc": "global_ocean/QU240/PHC/init/initial_state/initial_state.210803.nc",
    "ocean/global_ocean/QU240/PHC/init/initial_state/init_mode_forcing_data.nc": "global_ocean/QU240/PHC/init/initial_state/init_mode_forcing_data.210803.nc"
}

Design: either “normal” or “cached” versions of a step

Date last modified: 2021/07/30

Contributors: Xylar Asay-Davis

A prototype implementation of output caching had separate versions of test cases that included cached outputs or depended on earlier test cases with cached outputs. This approach turned out to be very cumbersome. It added many “new” test cases with unique subdirectories in the work directory and required predetermining which steps should allow caching. But this approach did allow a test suite to include a “normal” version of a step and a “cached” version of that same step in the same work directory (and therefore in the same test suite).

The proposed design, described in the previous sections, would allow far more flexibility about which steps are cached and which are not. It is not clear to me how we achieve this flexibility without requiring that a given step either be set up as “normal” or “cached”, and not both in the same work directory.

Implementation

The implementation is on this branch.

Implementation: cached outputs

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

Each step has a boolean attribute cached that defaults to False but which can be set to True by a process described in Implementation: selecting whether to use cached outputs. If cached == True, when inputs and outputs are being processes, the usual inputs are ignored and instead the outputs are added as inputs. Targets in the compass_cache database are selected using the dictionary stored in the MPAS core’s cached_files.json. Namelists and steams files are also not generated.

Implementation: selecting whether to use cached outputs

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

The implementation includes the two mechanisms for selecting cached outputs described in Design: selecting whether to use cached outputs.

When setting up a test suites, a new list of lists called cached is created along with the list of test-case paths. By default, all test cases have an empty list of steps with cached outputs. Any line in a test suite file that is cached (once white space is stripped away) will indicate that all steps in that test case should use cached outputs. This is accomplished by adding a special “step” named _all as the first step in the list for the given test case. If a line of the test suite file starts with cached: (after stripping away white space), the remainder of the line is a space-separated list of step names that should be set up with cached outputs. These steps are appended to the list of cached steps for the test case. If a test case has many steps with cached outputs, it may be convenient to have multiple lines starting with cached:, as in this example.

ocean/global_convergence/cosine_bell
  cached: QU60_mesh QU60_init QU90_mesh QU90_init QU120_mesh QU120_init
  cached: QU150_mesh QU150_init QU180_mesh QU180_init QU210_mesh QU210_init
  cached: QU240_mesh QU240_init

If a user is setting up individual test cases, they can indicate that all the steps in a test case should have cached inputs with the suffix c after the test number. While there is also a flag --cached that can be used to list steps of a single test case to use from cached outputs, this feature is likely to be too cumbersome to be broadly useful. Instead, developers should probably create a test suite for test cases where users are likely to want some steps with and others without cached outputs, as in the Cosine Bell example above.

Implementation: updating cached outputs

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

The new compass cache command has been added and is defined in the compass.cache module. It takes a list of step paths as input and optional flags --dry_run (which doesn’t copy the files to the directory on the LCRC server) and --date_string, which lets a user supply a date stamp (YYMMDD) other than today’s date.

As stated in the design, the command is only available on Chrysalis and Anvil and should be run on a work directory. To support caching files from multiple MPAS cores at the same time, compass cache produces an updated database file <mpas_core>_cached_files.json in the base of the work directory where the command is run. If this file already exists before compass cache is run, the information for the specified steps will be added if it is not yet in the database or will be updated, e.g. with new date stamps, if it does exist. If no <mpas_core>_cached_files.json exists, the file cached_files.json from the python module compass.<mpas_core> is used as the starting point instead. If this file also doesn’t exist, we start with an empty dictionary.

As an example, yesterday (8/3/2021) when I made the following call:

for mesh in QU60 QU90 QU120 QU150 QU180 QU210 QU240
do
  for step in mesh init
  do
    compass cache -i ocean/global_convergence/cosine_bell/${mesh}/${step}
  done
done

the result was a cache file ocean_cached_files.json like this:

{
    "ocean/global_convergence/cosine_bell/QU60/mesh/mesh.nc": "global_convergence/cosine_bell/QU60/mesh/mesh.210803.nc",
    "ocean/global_convergence/cosine_bell/QU60/mesh/graph.info": "global_convergence/cosine_bell/QU60/mesh/graph.210803.info",
    "ocean/global_convergence/cosine_bell/QU60/init/namelist.ocean": "global_convergence/cosine_bell/QU60/init/namelist.210803.ocean",
    "ocean/global_convergence/cosine_bell/QU60/init/initial_state.nc": "global_convergence/cosine_bell/QU60/init/initial_state.210803.nc",
    "ocean/global_convergence/cosine_bell/QU90/mesh/mesh.nc": "global_convergence/cosine_bell/QU90/mesh/mesh.210803.nc",
    "ocean/global_convergence/cosine_bell/QU90/mesh/graph.info": "global_convergence/cosine_bell/QU90/mesh/graph.210803.info",
    "ocean/global_convergence/cosine_bell/QU90/init/namelist.ocean": "global_convergence/cosine_bell/QU90/init/namelist.210803.ocean",
    "ocean/global_convergence/cosine_bell/QU90/init/initial_state.nc": "global_convergence/cosine_bell/QU90/init/initial_state.210803.nc",
    ...
}

This file should be copied back to compass/ocean/cached_files.json in a branch of the compass repo, committed to the branch, and updated on master with a pull request as normal.

Implementation: unique identifier for cached outputs

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

A date string is appended to the end of files in the compass_cache database on LCRC and stored in cached_files.json. The date string defaults to the date the compass cache command is run but can be specified manually with the --date_string flag if desired.

Implementation: either “normal” or “cached” versions of a step

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

The implementation leans heavily on the assumption that a given step will either be run with cached outputs or as normal, so that both versions are not available in the same work directory or as part of the same test suite.

Nevertheless, if a separate “cached” version of a step were desired, it would be necessary to make symlinks from the cached files in the location of the “uncached” version of the step to the location of the “cached” version. For example, if the “uncached” step is

ocean/global_ocean/QU240/mesh/mesh

and the “cached” version of the step is

ocean/global_ocean/QU240/cached/mesh/mesh

symlinks could be created on the LCRC server, e.g.

/lcrc/group/e3sm/public_html/mpas_standalonedata/mpas-ocean/compass_cache/global_ocean/QU240/cached/mesh/mesh/culled_mesh.210803.nc
  -> /lcrc/group/e3sm/public_html/mpas_standalonedata/mpas-ocean/compass_cache/global_ocean/QU240/mesh/mesh/culled_mesh.210803.nc

and the cached attribute could be set to True in the constructor of the cached version of the step.

Testing

Testing: cached outputs

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

I have constructed cached versions of the following steps on the LCRC server, using test-case runs on Chrysalis.

ocean/global_ocean/QU240/mesh/mesh/
ocean/global_ocean/QU240/PHC/init/initial_state/
ocean/global_ocean/QUwISC240/mesh/mesh/
ocean/global_ocean/QUwISC240/PHC/init/initial_state/
ocean/global_ocean/QUwISC240/PHC/init/ssh_adjustment/
ocean/global_ocean/EC30to60/mesh/mesh/
ocean/global_ocean/EC30to60/PHC/init/initial_state/
ocean/global_ocean/WC14/mesh/mesh/
ocean/global_ocean/WC14/PHC/init/initial_state/
ocean/global_ocean/ECwISC30to60/mesh/mesh/
ocean/global_ocean/ECwISC30to60/PHC/init/initial_state/
ocean/global_ocean/ECwISC30to60/PHC/init/ssh_adjustment/
ocean/global_ocean/SOwISC12to60/mesh/mesh/
ocean/global_ocean/SOwISC12to60/PHC/init/initial_state/
ocean/global_ocean/SOwISC12to60/PHC/init/ssh_adjustment/
ocean/global_convergence/cosine_bell/QU60/mesh/
ocean/global_convergence/cosine_bell/QU60/init/
ocean/global_convergence/cosine_bell/QU90/mesh/
ocean/global_convergence/cosine_bell/QU90/init/
ocean/global_convergence/cosine_bell/QU120/mesh/
ocean/global_convergence/cosine_bell/QU120/init/
ocean/global_convergence/cosine_bell/QU180/mesh/
ocean/global_convergence/cosine_bell/QU180/init/
ocean/global_convergence/cosine_bell/QU210/mesh/
ocean/global_convergence/cosine_bell/QU210/init/
ocean/global_convergence/cosine_bell/QU240/mesh/
ocean/global_convergence/cosine_bell/QU240/init/
ocean/global_convergence/cosine_bell/QU150/mesh/
ocean/global_convergence/cosine_bell/QU150/init/

I have set up and run versions of all these steps with cached outputs, together with forward runs (performance_test in the global ocean test group, and forward steps in the cosine_bell test case) that make use of the cached outputs as inputs. All tests ran successfully and were bit-for-bit with a baseline that was used to produce the cached outputs.

Testing: selecting whether to use cached outputs

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

I added QUwISC240 test case to the ocean nightly test suite using cached outputs for the mesh and init test cases:

ocean/global_ocean/QUwISC240/mesh
  cached
ocean/global_ocean/QUwISC240/PHC/init
  cached
ocean/global_ocean/QUwISC240/PHC/performance_test

I created a new test suite, cosine_bell_cached_init, for the cosine_bell test case that uses cached outputs fro the mesh and init steps at each default mesh resolution:

ocean/global_convergence/cosine_bell
  cached: QU60_mesh QU60_init QU90_mesh QU90_init QU120_mesh QU120_init
  cached: QU150_mesh QU150_init QU180_mesh QU180_init QU210_mesh QU210_init
  cached: QU240_mesh QU240_init

I set up the remaining steps with cached outputs mentioned in Testing: cached outputs as follows:

compass list

compass setup -n 40c 41c 42 60c 61c 62 80c 81c 82 85c 86c 87 90c 91c 92 \
    95c 96c 97 ...

Results were bit-for-bit with the same test cases run without cached outputs.

Testing: updating cached outputs

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

All cached files used in the testing above sere created with compass cache on Chrysalis. Multiple runs of this command created, then updated the local ocean_cached_files.json, as expected. The files ended up in the expected directories on the LCRC server with the expected date strings appended to the file basename (before the extension).

The --dry_run feature also worked as expected, updating the ocean_cached_files.json without copying files. The --date_string flag could be used to specify an alternative suffix, as expected.

Testing: unique identifier for cached outputs

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

All files in the compass_cache database have date strings appended to them to make them unique. No testing has been performed yet to ensure that new cached files with new dated can be added but I don’t foresee any problems.

Testing: either “normal” or “cached” versions of a step

Date last modified: 2021/08/04

Contributors: Xylar Asay-Davis

The implementation that I tested is based on this requrements. However, in the future, the requirement could be relaxed if need be using the approach I outlined in Implementation: either “normal” or “cached” versions of a step.