.. _design_doc_cached_outputs: Caching outputs from compass steps ================================== Date: 2021/07/30 Contributors: Xylar Asay-Davis Summary ------- We would like to have a way to download output files for ``compass`` steps from an online cache instead of generating them each time the step runs. The primary motivation for this is to optionally avoid time-consuming steps for generating meshes and initial conditions for faster regression testing with MPAS components in "forward" mode. Potential other uses could include cached results as baselines for validation. A challenge for this capability is providing an easy way for both developers and users to control which steps in a test case or suite are cached and which are run as normal. Requirements ------------ .. _req_cached: Requirement: cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/07/30 Contributors: Xylar Asay-Davis Each ``compass`` step defines its output files in the ``compass.Step.outputs`` attribute. For selected steps (see :ref:`req_select`), we require a mechanism to download cached files for each of these outputs and to use these cached files for the outputs of the step instead of computing them. .. _req_select: Requirement: selecting whether to use cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/07/30 Contributors: Xylar Asay-Davis There needs to be a mechanism for developers and users to select which steps are run as normal and which use cached outputs. For this mechanism to be practical, it should not be overly tedious or manual (e.g. manually setting a flag for each step). .. _req_update: Requirement: updating cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/07/30 Contributors: Xylar Asay-Davis There should be a documented process for creating cached outputs for steps and uploading them. .. _req_unique: Requirement: unique identifier for cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/07/30 Contributors: Xylar Asay-Davis There should be a mechanism for giving each cached output file a unique identifier (such as a date stamp). A given version (git hash or release) of ``compass`` should know which cached files to download. Older cached files should be retained so that older versions of ``compass`` can still be used with these cached files. .. note:: It may be worthwhile to include a process for deprecating and then deleting old cache files. .. _req_normal_or_cached: Requirement: either "normal" or "cached" versions of a step ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/07/30 Contributors: Xylar Asay-Davis We **do not** require the ability to set up a "normal" and a "cached" version of the same step within a ``compass`` test case or suite. (If this is not the case, it would place important constraints on the design solution.) Design ------ .. _des_cached: Design: cached outputs ^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/07/30 Contributors: Xylar Asay-Davis ``compass`` supports "databases" of input data files on the E3SM `LCRC server `_. Files will be stored in a new ``compass_cache`` database within each MPAS core's space on that server. If the "cached" version of a step is selected (see :ref:`des_select`), an appropriate "input" file will be added to the test case where the "target" is the file on the LCRC server to be cached locally for future use and the "filename" is the output file. ``compass`` will know which files on the server correspond to which output files via a python dictionary, as described in :ref:`des_unique`. .. _des_select: Design: selecting whether to use cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/03 Contributors: Xylar Asay-Davis A ``compass`` suite can indicate cached steps in two ways. If all steps in a test case should have cached output, the following notation is used: .. code-block:: none ocean/global_ocean/QU240/mesh cached ocean/global_ocean/QU240/PHC/init cached If only some steps in a test case should have cached output, they need to be listed explicitly, as follows: .. code-block:: none ocean/global_ocean/QU240/mesh cached: mesh ocean/global_ocean/QU240/PHC/init cached: initial_state Similarly, a user setting up test cases has two mechanisms for specifying which test cases and steps should have cached outputs. If all steps in a test case should have cached outputs, the suffix ``c`` can be added to the test number: .. code-block:: none compass setup -n 90c 91c 92 ... This approach is efficient but does not provide any control of which steps use cached outputs and which do not. A much more verbose approach is required if some steps use cached outputs and others do not within a given test case. Each test case must be set up on its own with the ``-t`` and ``--cached`` flags as follows: .. code-block:: none compass setup -t ocean/global_ocean/QU240/mesh --cached mesh ... compass setup -t ocean/global_ocean/QU240/PHC/init --cached initial_state ... ... These approaches assume that we always have either the "normal" or the "cached" version of a step within a test case or test suite (see :ref:`des_normal_or_cached`) and developers or users are free to choose between them, as long as cache files have been stored on the LCRC server and added to the ``cached_files.json`` database. .. _des_update: Design: updating cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/03 Contributors: Xylar Asay-Davis A new ``compass cache`` command-line tool will be added. This will only be available on Chrysalis and Anvil, the machines where files can be placed on the LCRC server. This command can be run on a work directory to copy the outputs from selected steps into the appropriate directory on the LCRC server, and to create or update a python dictionary in a file ``cached_files.json`` (see :ref:`des_unique`) that maps between output files in the work directory and those on the LCRC server. For example: .. code-block:: bash compass cache -i \ ocean/global_ocean/QU240/mesh/mesh \ ocean/global_ocean/QU240/PHC/init/initial_state .. _des_unique: Design: unique identifier for cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/03 Contributors: Xylar Asay-Davis Each cached file on the LCRC server will include a date stamp in the file name. For example, ``culled_mesh.nc`` will become ``culled_mesh.20210730.nc`` on the server. When ``compass cache`` is called (see :ref:`des_update`), the date stamp will default to the date that the call is being made but can be overridden with a flag (e.g. ``--date 20210730``). Each MPAS core in ``compass`` will optionally include a file ``cached_files.json`` that contains a python dictionary mapping between the names of output files in the work directory and those in the ``compass_cache`` database for that MPAS core on the LCRC server. For example: .. code-block:: none { "ocean/global_ocean/QU240/mesh/mesh/culled_mesh.nc": "global_ocean/QU240/mesh/mesh/culled_mesh.210803.nc", "ocean/global_ocean/QU240/mesh/mesh/culled_graph.info": "global_ocean/QU240/mesh/mesh/culled_graph.210803.info", "ocean/global_ocean/QU240/mesh/mesh/critical_passages_mask_final.nc": "global_ocean/QU240/mesh/mesh/critical_passages_mask_final.210803.nc", "ocean/global_ocean/QU240/PHC/init/initial_state/initial_state.nc": "global_ocean/QU240/PHC/init/initial_state/initial_state.210803.nc", "ocean/global_ocean/QU240/PHC/init/initial_state/init_mode_forcing_data.nc": "global_ocean/QU240/PHC/init/initial_state/init_mode_forcing_data.210803.nc" } .. _des_normal_or_cached: Design: either "normal" or "cached" versions of a step ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/07/30 Contributors: Xylar Asay-Davis A prototype implementation of output caching had separate versions of test cases that included cached outputs or depended on earlier test cases with cached outputs. This approach turned out to be very cumbersome. It added many "new" test cases with unique subdirectories in the work directory and required predetermining which steps should allow caching. But this approach *did* allow a test suite to include a "normal" version of a step and a "cached" version of that same step in the same work directory (and therefore in the same test suite). The proposed design, described in the previous sections, would allow far more flexibility about which steps are cached and which are not. It is not clear to me how we achieve this flexibility without requiring that a given step either be set up as "normal" or "cached", and not both in the same work directory. Implementation -------------- The implementation is on `this branch `_. .. _imp_cached: Implementation: cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis Each step has a boolean attribute ``cached`` that defaults to ``False`` but which can be set to ``True`` by a process described in :ref:`imp_select`. If ``cached == True``, when inputs and outputs are being processes, the usual inputs are ignored and instead the outputs are added as inputs. Targets in the ``compass_cache`` database are selected using the dictionary stored in the MPAS core's ``cached_files.json``. Namelists and steams files are also not generated. .. _imp_select: Implementation: selecting whether to use cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis The implementation includes the two mechanisms for selecting cached outputs described in :ref:`des_select`. When setting up a test suites, a new list of lists called ``cached`` is created along with the list of test-case paths. By default, all test cases have an empty list of steps with cached outputs. Any line in a test suite file that is ``cached`` (once white space is stripped away) will indicate that all steps in that test case should use cached outputs. This is accomplished by adding a special "step" named ``_all`` as the first step in the list for the given test case. If a line of the test suite file starts with ``cached:`` (after stripping away white space), the remainder of the line is a space-separated list of step names that should be set up with cached outputs. These steps are appended to the list of cached steps for the test case. If a test case has many steps with cached outputs, it may be convenient to have multiple lines starting with ``cached:``, as in this example. .. code-block:: none ocean/global_convergence/cosine_bell cached: QU60_mesh QU60_init QU90_mesh QU90_init QU120_mesh QU120_init cached: QU150_mesh QU150_init QU180_mesh QU180_init QU210_mesh QU210_init cached: QU240_mesh QU240_init If a user is setting up individual test cases, they can indicate that all the steps in a test case should have cached inputs with the suffix ``c`` after the test number. While there is also a flag ``--cached`` that can be used to list steps of a single test case to use from cached outputs, this feature is likely to be too cumbersome to be broadly useful. Instead, developers should probably create a test suite for test cases where users are likely to want some steps with and others without cached outputs, as in the Cosine Bell example above. .. _imp_update: Implementation: updating cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis The new ``compass cache`` command has been added and is defined in the ``compass.cache`` module. It takes a list of step paths as input and optional flags ``--dry_run`` (which doesn't copy the files to the directory on the LCRC server) and ``--date_string``, which lets a user supply a date stamp (YYMMDD) other than today's date. As stated in the design, the command is only available on Chrysalis and Anvil and should be run on a work directory. To support caching files from multiple MPAS cores at the same time, ``compass cache`` produces an updated database file ``_cached_files.json`` in the base of the work directory where the command is run. If this file already exists before ``compass cache`` is run, the information for the specified steps will be added if it is not yet in the database or will be updated, e.g. with new date stamps, if it does exist. If no ``_cached_files.json`` exists, the file ``cached_files.json`` from the python module ``compass.`` is used as the starting point instead. If this file also doesn't exist, we start with an empty dictionary. As an example, yesterday (8/3/2021) when I made the following call: .. code-block:: bash for mesh in QU60 QU90 QU120 QU150 QU180 QU210 QU240 do for step in mesh init do compass cache -i ocean/global_convergence/cosine_bell/${mesh}/${step} done done the result was a cache file ``ocean_cached_files.json`` like this: .. code-block:: none { "ocean/global_convergence/cosine_bell/QU60/mesh/mesh.nc": "global_convergence/cosine_bell/QU60/mesh/mesh.210803.nc", "ocean/global_convergence/cosine_bell/QU60/mesh/graph.info": "global_convergence/cosine_bell/QU60/mesh/graph.210803.info", "ocean/global_convergence/cosine_bell/QU60/init/namelist.ocean": "global_convergence/cosine_bell/QU60/init/namelist.210803.ocean", "ocean/global_convergence/cosine_bell/QU60/init/initial_state.nc": "global_convergence/cosine_bell/QU60/init/initial_state.210803.nc", "ocean/global_convergence/cosine_bell/QU90/mesh/mesh.nc": "global_convergence/cosine_bell/QU90/mesh/mesh.210803.nc", "ocean/global_convergence/cosine_bell/QU90/mesh/graph.info": "global_convergence/cosine_bell/QU90/mesh/graph.210803.info", "ocean/global_convergence/cosine_bell/QU90/init/namelist.ocean": "global_convergence/cosine_bell/QU90/init/namelist.210803.ocean", "ocean/global_convergence/cosine_bell/QU90/init/initial_state.nc": "global_convergence/cosine_bell/QU90/init/initial_state.210803.nc", ... } This file should be copied back to ``compass/ocean/cached_files.json`` in a branch of the compass repo, committed to the branch, and updated on ``master`` with a pull request as normal. .. _imp_unique: Implementation: unique identifier for cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis A date string is appended to the end of files in the ``compass_cache`` database on LCRC and stored in ``cached_files.json``. The date string defaults to the date the ``compass cache`` command is run but can be specified manually with the ``--date_string`` flag if desired. .. _imp_normal_or_cached: Implementation: either "normal" or "cached" versions of a step ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis The implementation leans heavily on the assumption that a given step will either be run with cached outputs or as normal, so that both versions are not available in the same work directory or as part of the same test suite. Nevertheless, if a separate "cached" version of a step were desired, it would be necessary to make symlinks from the cached files in the location of the "uncached" version of the step to the location of the "cached" version. For example, if the "uncached" step is .. code-block:: none ocean/global_ocean/QU240/mesh/mesh and the "cached" version of the step is .. code-block:: none ocean/global_ocean/QU240/cached/mesh/mesh symlinks could be created on the LCRC server, e.g. .. code-block:: none /lcrc/group/e3sm/public_html/mpas_standalonedata/mpas-ocean/compass_cache/global_ocean/QU240/cached/mesh/mesh/culled_mesh.210803.nc -> /lcrc/group/e3sm/public_html/mpas_standalonedata/mpas-ocean/compass_cache/global_ocean/QU240/mesh/mesh/culled_mesh.210803.nc and the ``cached`` attribute could be set to ``True`` in the constructor of the cached version of the step. Testing ------- .. _test_cached: Testing: cached outputs ^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis I have constructed cached versions of the following steps on the LCRC server, using test-case runs on Chrysalis. .. code-block:: none ocean/global_ocean/QU240/mesh/mesh/ ocean/global_ocean/QU240/PHC/init/initial_state/ ocean/global_ocean/QUwISC240/mesh/mesh/ ocean/global_ocean/QUwISC240/PHC/init/initial_state/ ocean/global_ocean/QUwISC240/PHC/init/ssh_adjustment/ ocean/global_ocean/EC30to60/mesh/mesh/ ocean/global_ocean/EC30to60/PHC/init/initial_state/ ocean/global_ocean/WC14/mesh/mesh/ ocean/global_ocean/WC14/PHC/init/initial_state/ ocean/global_ocean/ECwISC30to60/mesh/mesh/ ocean/global_ocean/ECwISC30to60/PHC/init/initial_state/ ocean/global_ocean/ECwISC30to60/PHC/init/ssh_adjustment/ ocean/global_ocean/SOwISC12to60/mesh/mesh/ ocean/global_ocean/SOwISC12to60/PHC/init/initial_state/ ocean/global_ocean/SOwISC12to60/PHC/init/ssh_adjustment/ ocean/global_convergence/cosine_bell/QU60/mesh/ ocean/global_convergence/cosine_bell/QU60/init/ ocean/global_convergence/cosine_bell/QU90/mesh/ ocean/global_convergence/cosine_bell/QU90/init/ ocean/global_convergence/cosine_bell/QU120/mesh/ ocean/global_convergence/cosine_bell/QU120/init/ ocean/global_convergence/cosine_bell/QU180/mesh/ ocean/global_convergence/cosine_bell/QU180/init/ ocean/global_convergence/cosine_bell/QU210/mesh/ ocean/global_convergence/cosine_bell/QU210/init/ ocean/global_convergence/cosine_bell/QU240/mesh/ ocean/global_convergence/cosine_bell/QU240/init/ ocean/global_convergence/cosine_bell/QU150/mesh/ ocean/global_convergence/cosine_bell/QU150/init/ I have set up and run versions of all these steps with cached outputs, together with forward runs (``performance_test`` in the global ocean test group, and ``forward`` steps in the ``cosine_bell`` test case) that make use of the cached outputs as inputs. All tests ran successfully and were bit-for-bit with a baseline that was used to produce the cached outputs. .. _test_select: Testing: selecting whether to use cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis I added QUwISC240 test case to the ocean ``nightly`` test suite using cached outputs for the ``mesh`` and ``init`` test cases: .. code-block:: none ocean/global_ocean/QUwISC240/mesh cached ocean/global_ocean/QUwISC240/PHC/init cached ocean/global_ocean/QUwISC240/PHC/performance_test I created a new test suite, ``cosine_bell_cached_init``, for the ``cosine_bell`` test case that uses cached outputs fro the ``mesh`` and ``init`` steps at each default mesh resolution: .. code-block:: none ocean/global_convergence/cosine_bell cached: QU60_mesh QU60_init QU90_mesh QU90_init QU120_mesh QU120_init cached: QU150_mesh QU150_init QU180_mesh QU180_init QU210_mesh QU210_init cached: QU240_mesh QU240_init I set up the remaining steps with cached outputs mentioned in :ref:`test_cached` as follows: .. code-block:: bash compass list compass setup -n 40c 41c 42 60c 61c 62 80c 81c 82 85c 86c 87 90c 91c 92 \ 95c 96c 97 ... Results were bit-for-bit with the same test cases run without cached outputs. .. _test_update: Testing: updating cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis All cached files used in the testing above sere created with ``compass cache`` on Chrysalis. Multiple runs of this command created, then updated the local ``ocean_cached_files.json``, as expected. The files ended up in the expected directories on the LCRC server with the expected date strings appended to the file basename (before the extension). The ``--dry_run`` feature also worked as expected, updating the ``ocean_cached_files.json`` without copying files. The ``--date_string`` flag could be used to specify an alternative suffix, as expected. .. _test_unique: Testing: unique identifier for cached outputs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis All files in the ``compass_cache`` database have date strings appended to them to make them unique. No testing has been performed yet to ensure that new cached files with new dated can be added but I don't foresee any problems. .. _test_normal_or_cached: Testing: either "normal" or "cached" versions of a step ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Date last modified: 2021/08/04 Contributors: Xylar Asay-Davis The implementation that I tested is based on this requrements. However, in the future, the requirement could be relaxed if need be using the approach I outlined in :ref:`imp_normal_or_cached`.