Generalize Calendar supported by Analysis
Xylar Asay-Davis
date: 2017/02/09
Summary
Currently, the time variable in `xarray` data sets within MPAS-Analysis has two major shortcomings, inherited from `xarray` (through `pandas` and `numpy.datetime64`). First, only the Gregorian calendar is supported. Second, there is not support for dates outside the years 1678 to 2262. The analysis needs to support both the Gregorian ('gregorian') and the 365-day ('gregorian_noleap') calendars. It also needs to support, at a minimum, years between 0001 and 9999, and preferably arbitrary years both positive and negative. A major challenge is that it seems that xarray cannot easily be forced to use an alternative representation of dates to the troublesome `numpy.datetime64` type (see, for example, [pydata/xarray#1084](https://github.com/pydata/xarray/issues/1084)). The most obvious alternative, `datetime.datetime`, seemingly cannot be used directly in `xarray` because objects of this type are converted to `numpy.datetime64` objects at various stages when using features from pandas, raising errors when dates are out of range. While an alternative date class (e.g. `netcdftime.DatetimNoLeap`) might be used to represent dates on the 'gregorian_noleap' calendar, there is no such preexisting alternative for the 'gregorian' calendar. The solution proposed herein is to store time as floating-point days since the reference date 0001-01-01 and to convert dates in this format to `datetime.datetime` and `MpasRelativeDelta` objects whenever mathematical manipulation of dates is required. A successful implementation would produce essentially identical analysis to what is currently produced, but making use of the dates from the MPAS calendar (whether Gregorian or 365-day) without the need for artifical offsets (e.g. `yearOffset` used in the current code. Plots of horizontal fields would remain unchanged while plots of time series would have a time axis with the simulation date instead of the offset date.Requirements
Requirement: The 'Time' coordinate of xarray data sets must be consistent
with the MPAS calendar
Date last modified: 2017/02/09
Contributors: Xylar Asay-Davis
For all data sets used in the analysis, the ‘Time’ coordinate must represent dates on the appropriate MPAS calendar, either ‘gregorian’ or ‘gregorian_noleap’, depending on the namelist option ‘config_calendar_type’. There must be ways of mathematically manipulating times (e.g. adding/subtracting offsets and figuring out the amount of time between two dates) and of making plots that are consistent with these calendars.
Requirement: The 'Time' coordinate of xarray data sets must support at least years
0001 and 9999, and preferably any conceivable value
Date last modified: 2017/02/16
Contributors: Xylar Asay-Davis
For all data sets used in the analysis, the ‘Time’ coordinate must, at a minimum,
support years between 0001 and 9999 (the range of datetime.datetime
) and preferably
a broader range.
Algorithmic Formulations (optional)
Design solution: The 'Time' coordinate of xarray data sets must be consistent
with the MPAS calendar
Date last modified: 2017/02/11
Contributors: Xylar Asay-Davis, Phillip J. Wolfram
The proposed solution represents time in xarray.DataSet
objects as the number of
days since the reference date 0001-01-01.
This is reasonable because the smallest unit of time output in MPAS components is
seconds (and unlikely to ever be shorter than ms). We note that a date specified
as a 64-bit float has a precision high enough to represent seconds for dates up
to +/- 100 million years:
>>> import sys
>>> 1./(sys.float_info.epsilon*365*24*60*60)
142808207.36207813
We should have no trouble representing any number we might want (including paleo timescales) with this system.
For purposes of performing mathematical operations and plotting dates, these
values will be converted to datetime.datetime
objects (via the proposed
days_to_datetime
utility function) and back (via the proposed
datetime_to_days
).
The conversion operations within datetime_to_days
and days_to_datetime
will be
performed with the calendar-aware functions netCDF4.date2num
and
netCDF4.num2date
, respectively. Both functions will support lists/arrays of dates
(for efficiency and simplicity of calling code) in addition to single values.
Curve ploting can be supported with matplotlib.pyplot.plot_date
, which takes a date
of exactly the format used here (days since 0001-01-01). The compatibility with plot_date
was part of the reason for choosing this format for the date.
Design solution: The 'Time' coordinate of xarray data sets must support at least years
0001 and 9999, and preferably any conceivable value
Date last modified: 2017/02/09
Contributors: Xylar Asay-Davis
Same as above. In theory, the use of days since 0001-01-01 would allow any year
to be supported, not just the range from 0001 to 9999. However, the conversions
to datetime.datetime
objects for mathematical manipulation will constrain
the dates to be between datetime.min
(0001-01-01) and datetime.max
(9999-12-31).
Design and Implementation
Implementation: The 'Time' coordinate of xarray data sets must be consistent
with the MPAS calendar
Date last modified: 2017/02/16
Contributors: Xylar Asay-Davis
The proposed implementation is on the branch xylar/generalize_calendar
A helper funciton, mpas_xarray._parse_dataset_time
, computes times as days since
0001-01-01, and serves as a replacement for mpas_xarray._get_datetimes
.
Note: the current implementation breaks the convention that ``mpas_xarray`` remains separate from the rest of MPAS-Analyis by using 3 functions from ``timekeeping.utility`` in ``mpas_xarray``:
from ..timekeeping.utility import string_to_days_since_date, \
days_to_datetime, datetime_to_days
This violates the first requirement in the `Design Document: Moving variable mapping out of mpas_xarray <https://github.com/xylar/MPAS-Analysis/blob/design_doc_variable_mapping_reorg/design_docs/variable_mapping_reorg.md>`_. I am open to alternative solutions for keeping ``mpas_xarray`` separate from the rest of analysis but these 3 functions do not conceptually belong in ``mpas_xarray``. The problem is exacerbated by the fact that there are analysis-specific functions in ``timekeeping``, meaning that this cannot easily be made a submodule of ``mpas_xarray`` (nor would this make very much logical sense). Having 2 ``timekeeping`` modules, one for ``mpas_xarray`` and one for MPAS-Analysis, seems unnecessarily confunsing.
The functions generalized_reader.open_multifile_dataset
and
mpas_xarray.open_multifile_dataset
have been updated to use this method for parsing
times. This involves removing the year_offset
argument and adding an optional
simulation_start_time
argument for supplying a date to use to convert variables
like daysSinceStartOfSim
to days since 0001-01-01.
An example of opening a data set and manipulating times withe the new approach in the OHC script is:
from ..shared.timekeeping.utility import get_simulation_start_time, \
date_to_days, days_to_datetime, string_to_datetime
...
def ohc_timeseries(config, streamMap=None, variableMap=None):
...
simulationStartTime = get_simulation_start_time(streams)
...
ds = open_multifile_dataset(file_names=file_names,
calendar=calendar,
simulation_start_time=simulation_start_time,
time_variable_name='Time',
variable_list=variable_list,
variable_map=variableMap,
start_date=startDate,
end_date=endDate)
timeStart = string_to_datetime(startDate)
timeEnd = string_to_datetime(endDate)
# Select year-1 data and average it (for later computing anomalies)
timeStartFirstYear = string_to_datetime(simulation_start_time)
if timeStartFirstYear < timeStart:
startDateFirstYear = simulation_start_time
firstYear = int(startDateFirstYear[0:4])
endDateFirstYear = '{:04d}-12-31_23:59:59'.format(firstYear)
filesFirstYear = streams.readpath(streamName,
startDate=startDateFirstYear,
endDate=endDateFirstYear,
calendar=calendar)
dsFirstYear = open_multifile_dataset(
file_names=filesFirstYear,
calendar=calendar,
simulation_start_time=simulation_start_time,
time_variable_name='Time',
variable_list=variable_list,
variable_map=variableMap,
start_date=startDateFirstYear,
end_date=endDateFirstYear)
else:
dsFirstYear = ds
firstYear = timeStart.year
timeStartFirstYear = date_to_days(year=firstYear, month=1, day=1,
calendar=calendar)
timeEndFirstYear = date_to_days(year=firstYear, month=12, day=31,
hour=23, minute=59, second=59,
calendar=calendar)
dsFirstYear = dsFirstYear.sel(Time=slice(timeStartFirstYear,
timeEndFirstYear))
meanFirstYear = dsFirstYear.mean('Time')
...
yearStart = days_to_datetime(ds.Time.min()).year
yearEnd = days_to_datetime(ds.Time.max()).year
timeStart = date_to_days(year=yearStart, month=1, day=1,
calendar=calendar)
timeEnd = date_to_days(year=yearEnd, month=12, day=31,
calendar=calendar)
if preprocessedReferenceRunName != 'None':
print ' Load in OHC from preprocessed reference run...'
inFilesPreprocessed = '{}/OHC.{}.year*.nc'.format(
preprocessedInputDirectory, preprocessedReferenceRunName)
dsPreprocessed = open_multifile_dataset(
file_names=inFilesPreprocessed,
calendar=calendar,
simulation_start_time=simulation_start_time,
time_variable_name='xtime')
yearEndPreprocessed = days_to_datetime(dsPreprocessed.Time.max()).year
...
The replicate_cycles
function in sea_ice.timeseries
has been a particular
challenge with the existing calendar. Here is that function with the new ‘Time’
coordinate:
def replicate_cycle(ds, dsToReplicate, calendar):
dsStartTime = days_to_datetime(ds.Time.min(), calendar=calendar)
dsEndTime = days_to_datetime(ds.Time.max(), calendar=calendar)
repStartTime = days_to_datetime(dsToReplicate.Time.min(),
calendar=calendar)
repEndTime = days_to_datetime(dsToReplicate.Time.max(),
calendar=calendar)
repSecondTime = days_to_datetime(dsToReplicate.Time.isel(Time=1),
calendar=calendar)
period = (MpasRelativeDelta(repEndTime, repStartTime) +
MpasRelativeDelta(repSecondTime, repStartTime))
startIndex = 0
while(dsStartTime > repStartTime + (startIndex+1)*period):
startIndex += 1
endIndex = 0
while(dsEndTime > repEndTime + (endIndex+1)*period):
endIndex += 1
dsShift = dsToReplicate.copy()
times = days_to_datetime(dsShift.Time, calendar=calendar)
dsShift.coords['Time'] = ('Time',
datetime_to_days(times + startIndex*period,
calendar=calendar))
# replicate cycle:
for cycleIndex in range(startIndex, endIndex):
dsNew = dsToReplicate.copy()
dsNew.coords['Time'] = ('Time',
datetime_to_days(times + (cycleIndex+1)*period,
calendar=calendar))
dsShift = xr.concat([dsShift, dsNew], dim='Time')
return dsShift
Implementation: The 'Time' coordinate of xarray data sets must support at least years
0001 and 9999, and preferably any conceivable value
Date last modified: 2017/02/09
Contributors: Xylar Asay-Davis
Same as above.
Testing
Testing and Validation: The 'Time' coordinate of xarray data sets must be consistent
with the MPAS calendar
Date last modified: 2017/02/11
Contributors: Xylar Asay-Davis
In [xylar/generalize_calendar](https://github.com/xylar/MPAS-Analysis/tree/generalize_calendar),
unit testing has been added for `timekeeping` and `mpas_xarray` that checks both the `gregorian`
and `gregorian_noleap` calendars under simple test conditions. However, we have no data sets
that test `gregorian`, so we have a somewhat limited ability to test this calendar option.
Fortunately, there are also no immediate plans to run with `gregorian`.
I will make sure all tests with config files in the `configs/lanl` and `configs/edison`
directories produce bit-for-bit results with the current `develop`.
Testing and Validation: The 'Time' coordinate of xarray data sets must support at least years
0001 and 9999, and preferably any conceivable value
Date last modified: 2017/02/11
Contributors: Xylar Asay-Davis
Unit tests have been added to ensure that dates both close to 0001-01-01 and typical calendar dates (e.g. 2017-01-01) function as expected.
@akturner’s MPAS-SeaIce run with real dates (mentioned in #81) has been successfully run with the proposed approach. This run started in 1958, and had presented a problem for MPAS-Analysis with the previous calendar.