-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporating the CFA convention for aggregated datasets into CF #508
Comments
David,
And below the Figure 1 you write:
Probably I am missing something here, but to me this seems contradictory? Anyway, that is a detail, and I think the more important questions are the one you raise in the Technical Proposal Summary:
To me this is no doubt a good idea, which already has a strong community backing.
Perhaps an outline somewhere in the main text: end of Chapter 2 regarding aggregation files and their relation to the fragment files, somewhere in Chapter 3 regarding aggregation variables? And then an exhaustive description in an Appendix? This, brings me a more general thought that I have been thinking about for some time: |
Thank you for you comments, Lars, and sorry that it has taken me some time to respond. Even though you are the only person to have commented here (and in support), this proposal has been scrutinised carefully at two CF workshops, with a group decision being reached in 2023 to work towards incorporating CFA into CF. I'm therefore minded to move to writing the PR, now that Lars has made a good suggestion of how and where the content could go into the existing CF conventions. This shouldn't take too long, because it will largely be a "cut and paste" job from the existing CFA description, which was deliberately written in a CF-ish style in anticipation of this :).
Good point. The first statement applies to the reading of the data, and the second to the writing of the data. The CFA conventions do not give any guidance on the decision of how fragment files can be combined prior to creating an aggregation variable, rather once you have an aggregation in mind, they provide a framework in which you can encode it in such a way that other people can decode it. If I give you two datasets (A and B) then the CFA conventions won't give you any help in working out if A and B can be sensibly combined into a single larger dataset (C). There are various ways in which you could work this out yourself - you could inspect the metadata and apply an aggregation algorithm (e.g. this one, or by visual inspection), or base it on files names (e.g. I know that model outputs from
I like the idea of a Chapter 2 outline. I might suggest content from Introduction, Terminology, Aggregation variables, and Aggregation instructions (without its subsections) for Chapter 2, and everything else - which is most of the existing CFA document - (Standardized aggregation instructions, Non-standardized terms, Fragment Storage and examples) for the appendix.
Just a thought - the TOC currently shows all subnsections - maybe it could be restricted to just one level of subsection, so for instance Chapter 7 would go from
to
That alone would remove 71 lines from the TOC! But as you say, any more on that should be discussed elsewhere, which I would welcome. |
I think this is generally a good idea and have been meaning to go over the details. A quick thought about the table of contents: Would it be easy in the web view to collapse the subsection hierarchy to 1 or 2 levels, then click on an upper level to display its subsections? That might give a newbie a more accessible overview. On the other hand, I usually just execute "find" for some key word I know is relevant to what I want to look up, and if that word becomes hidden (in a hidden low level subsection), then I may have a harder time navigating quickly to the relevant section. So I can see arguments for the current expanded table of contents. |
Hello, We have finally prepared a pull request for incorporating aggregation into the CF conventions: #534 It touches on 9 files: Chapter 1: Terminology All the best, |
Hello, I fully appreciate that this is a large pull request, but it would be very nice if someone who wasn't involved in its development could look it over. The PR already has the support of the original CFA authors (@JonathanGregory, @bnlawrence, @nmassey001, @sadielbartholomew and myself), but at least one "outside" perspective is necessary, I think. It would be great if this could get into CF-1.12, which means in practice that it would have to all agreed by (roughly) the end of October. Any takers? We'd be much obliged :) Many thanks, |
H all, I have now studied the proposal as given above and also read the CFA conventions documentation. What's the easiest way to review the pull request? When I look at the changes made to files, I can see the new text, but it isn't particularly easy to read. [I know I should know how to do this by now.]. NOTE ADDED AFTER THE FACT: I HAVE NOW BEGUN TO READ THE ACTUAL PULL REQUEST, WHICH DIFFERS SUBSTANTIALLY FROM THE PROPOSAL ABOVE, SO MOST OF THE FOLLOWING COMMENTS MAY BE IRRELEVANT. I'LL PROVIDE FEEDBACK ON THE ACTUAL PULL REQUEST IN THE NEXT DAY OR TWO. Anyway, based on what I've read, I have a few comments and questions:
I understand that the data writer can name these however (s)he likes, but for the CF documentation, I think the above would be easier for users to understand (unless, of course, I've completely misrepresented what they are). |
Thanks, Karl - I very much appreciate your comments. I've seen your edit about reading the latest, and will hold off responding to everything until you have done so. However, it could be useful to mention some of your comments now, and with reference to the PR text:
I has indeed. In section 2.8.2 Fragment Interpretation it says that converting the fragment to its canonical form may involve "Transforming the fragment's data to have the aggregation variable's units (e.g. as required when aggregating time fragments whose units have different reference date/times)."
That's right - it's a convenience feature. Instead of having to update the file paths of 1 million fragment file names when teh files are moved, if the file names have been defined with a substitution (cf. environment variable) then you just have to update that one attribute to set the new location for the 1 million files. The new text in section 2.8.1 Aggregated Dimensions and Data aims to clarify this: "The use of substitutions can save space in the aggregation file; and in the event that the fragment locations need to be updated after the aggregation file has been created, it may be possible to achieve this by modifying the substitutions attribute rather than by changing the actual location fragment array variable values."
If the intention were to only ever aggregation netCDF fragments, I may agree, but we'd like the conventions to allow the aggregation of non-CF datasets (2.8. Aggregation Variables: "Fragment datasets may be CF-compliant or have any other format, thereby allowing an aggregation variable to act as a |
Hi David and all, [My input below is, I hope, constructive, because I think that adoption of a CF-compliant approach to creating aggregated datasets will be very useful. Thanks for all your work on this. It's likely that something quite obvious has eluded me, in which case, please excuse by ignorance, but perhaps you could provide further explanation (or examples) that might enlighten me.] I've spent some time studying the proposed pull request changes to the conventions document. I spent most of my time trying to figure out exactly how to interpret the fragment_shape array and thinking about how I might form an aggregated array from the fragment arrays, based on the (array size?) numbers it gives. I failed. Then, I thought about what alternative options there might be for providing mapping information in a concise form, which codes could use to combine fragments into a single aggregate. In the pdf file attached below, I've suggested an alternative to the pull request proposal and highlighted it in yellow. I think my approach is easier to explain to users and would facilitate the construction of aggregated variables. (It has similarities with conventional "pointer" approaches to accessing array data.) I know code has already been written based on the original proposal, so perhaps my alternative will not be popular. More likely, those of you who spent so much time coming up with the "fragment_shape" approach of describing how the fragments fit together will find an obvious problem with my suggestion. If so, perhaps all that is needed is a better explanation of your method. In particular, as a first step, I would be interested in someone telling me what the "fragment_shape" is for the example I came up with. (See the few lines of red-highlighted text below the colorful graphic on the attached.) Perhaps that will enable me to finally "get" how this shape information can be used. The following document contains a suggestion on how the approach might be modified and made simpler to explain to new users. Most of the "edit suggestions" contained in the file are unrelated to the new approach. Thanks again for all the thought and work that has already gone into this. |
Just wanted to ask about a use case: Suppose I want to aggregate a surface temperature field provided by multiple models, all on a common grid. There is no "model axis" in the files. Can I combine the fields defining a "model label" coordinate? |
Karl asked:
Yes, provided that all of the models are on the same domain, of course. Here is a modification of the new Example L.1, with an extra "model" axis included:
We could include this example in the appendix. |
Hi Karl, Just quickly jumping on your suggested structural change, before getting into your text suggestions and questions ... I'm intrigued by your new I can't see any problems with your new approach by just thinking about it :) (I'd like to run it past my software implementation to be sure). I'm wondering if the super-general figure on page 12 of conventions_aggregation_PR_KET.pdf should be be restricted to the cases allowed by the current "fragment array" - i.e. where the all of the fragments are aligned in neat hyper-rows. This is because a) I doubt there's a use case for non-aligned fragments; b) I very much doubt that there's any software out there that can handle the fully general case whilst also applying "lazy loading" of the data, and we shouldn't encourage people to tackle this very thorny problem without need; and c) full generality could easily be allowed if a use case ever arose. What does anyone else think? |
Hi David, That is very encouraging. I must admit I was unable to understand how your fragment_shape information got utilized. I didn't realize that the constraint was imposed that "all of the fragments are aligned in neat hyper-rows." (I'm not sure I still understand what that means exactly, but for now, that's o.k.) There is an important constraint even for my "super-general" example: All fragments must be logically rectangular (in hyperspace), and together they must fill a logically-rectangular aggregated array. One can think of each fragment as a block and the blocks together are used to build a single aggregated block (without leaving any spaces). I agree we could be more restrictive (for reasons you've listed above) if that really makes a difference to those writing code. Thinking in terms of fortran-style coding (which is my default thought process), I don't think it would be difficult to handle the general case, but then I'm not familiar with what you say is "lazy loading" of the data. Is that loading the data into a vector not preserving the multi-dimensional structure? In any case, I think the primary advantage offered by the alternative approach is that it seems to me to be easier to explain. Let's see what others think. A note on the examples and notation: New users might better follow what we're doing if we change the keyword (under aggregated_data) from "shape" to "map" since the values tell you how to map your fragments into the aggregated array. In the example, I named the "shape" variable "fragment_starts", but a better variable name might be "insert_at", since the fragment arrays get inserted in the aggregated array at the index values provided by the "insert_at" variable. |
While I'm thinking about it, the term "address" doesn't immediately bring to mind the name of a variable, but rather a location of the variable; would "identifier" be a better term? It doesn't specifically have to be a variable name, but could be, as you noted earlier, an integer or some other kind of identifier of the variable of interest. |
Hi Karl, I'm going to tackle all of your comments very soon, but first would like to try to conclude the discussion on the I have thought a lot more about your suggested proposal for replacing the
Your point that it was hard to understand the description is certainly correct, though! I propose a new, and hopefully understandable, description of the original The features must comprise either all three of the map The integer-valued For each element of the fragment array, the
The part of each aggregated dimension that is occupied by a fragment is defined by the fragment size along that dimension, offset by sizes of the fragments that precede it in the fragment array. For instance, the fragment in fragment array position The rows of the
When the aggregated data is scalar, the fragment array is also scalar and the map fragment array variable must be stored as a scalar variable containing the value |
Hi David, I'm not going to be able to get to this before next week, unfortunately. A quick read through raised a question. Could you clarify what you mean by "all of the fragments are aligned in neat hyper-rows."? This apparently excludes the "super-general" case I was considering, but I'm just not sure what the actual constraints are on what kind of fragments can be aggregated. Also, above you state "For each element of the fragment array, the map variable defines the number of elements occupied by its fragment along each of the aggregated dimensions". I might be able to figure out what is meant by studying the example, but on its own I can't visualize what you have in mind. thanks, |
Hi Karl,
Yes - sorry about this made-up phrase! I'm struggling to find the correct terminology ... How about (borrowing from processor distribution of parellised NWP and climate models) "The fragments comprise a regular domain decomposition of the aggregated data". Both the following two examples, have a 2 x 2 fragment array, but fragments are not fully aligned in the second case, so that is not OK.
|
O.K., I think I get it now. The partitioning of any given dimension into fragments must be consistent across all fragments comprising the aggregate. So, for example, you couldn't aggregate data that was originally stored on a global grid for part of a simulation with data you might have stored in two parts (say a N.H. chunk and a separate S.H. chunk) for the remainder of the simulation. Right? I suppose if it were possible to aggregate two "aggregate" files into a super-aggregate, then you could handle the above case. First you would aggregate the two hemispheres of data for the portion of the simulation where they had been separately stored. Then you would aggregate this aggregated data with the data stored originally on a global grid. [I'm not suggesting this is an important "use case", but just checking on the limits of the approach you've proposed.] |
In the conventions document, I think we need to mention that an aggregation variable can't be directly accessed through the netCDF API, but an intermediary code must be written that interprets the construction of an aggregated array. This intermediary code will obtain the fragments using the netCDF API, and enable the user then to manipulate the aggregated array. Do I have that right? One question that popped into my head is: Does your current "intermediary code" enable the user to obtain a subset of the aggregated variable without reading in the entire array? For example, if I have 100 years of a monthly, globally-gridded 3-d field (like AirTemperature(time, pressure, lat, lon)) which has been saved as 10-year chunks, and I've constructed an aggregation file spanning the 10 fragments, can I ask your intermediary code to extract just the 500 hPa pressure level of data from the aggregated dataset without first storing in memory all pressure-levels? |
Depends on your definition of intermediary code: yes, the data can be accessed through the NetCDF API, but it's a two step process, you need to use the NetCDF API to work out which fragment files to open and where to put the content in the aggregated array, but both steps use the NetCDF API ... ... and yes, both the working implementations (xarray and cf-python) are fully lazy and only extract the data you want when you want to do the computation (but of course CF itself doesn't require or say anything about that). |
Dear Karl
As it happens, @davidhassell and I discussed this situation on Thursday. Yes, you could take such a hierarchical approach, if necessary. An aggregation may have aggregations as its fragments. Best wishes Jonathan |
Hi - just to let you know that I'm in the midst of preparing a detailed review of Karl's comments, and preparing a new PR that incorporates many of the suggestions made here (and mainly by Karl!). I hope to have it all ready in a day or two. |
Hi David and all, I now understand how you know how to "map" the fragments to the aggregated array given the aggregated_data's shape specifications. I hadn't realized all the constraints placed on the fragments that make this work. I think newbies would understand your approach more easily if they had in mind the constraints. [When I was first trying to figure it out, I had in mind that you could aggregate fragments that had fewer constraints, and I couldn't see how your approach would handle that. Under either of the two approaches we considered, the following constraints are imposed: • The fragment arrays cannot be “ragged”, i.e., they can, in general, be visualized as multi-dimensional rectangular solids. For the original approach (@davidhassell et al.), there is an additional constraint (which is difficult to describe): • Along each dimension, the same number of fragment arrays will together fill that dimension’s aggregated space, independent of all other dimensions. Although along a given dimension the fragments comprising the aggregated array can occupy unequal portions of that dimension, they must be aligned consistently across all the other dimensions. None of the fragments comprising the aggregated array will be offset from any neighboring fragment. While thinking about the generality of the approaches, the following use-cases came to mind:
Regridding can be computationally expensive, so I might choose to do that once and for all before starting my analysis. The most straightforward way of proceeding would be to consider each file individually, regrid its data, and then rewrite it. This would result in the same number of files as before, but now with all data on a common grid. Ideally, I would then simply aggregate the data across time and across models to form a 100-year multi-model gridded dataset of surface temperature. As I understand it, this would only be possible if the original model output were stored in identical temporal chunks. Since this is not the case (with, for example, some models storing data in 20-year chunks and others in 10-year chunks), how would you proceed?
|
Hi Karl, Thank you for your constraints and examples! Here is a sneak preview of text from the new PR, which I think encpasulates all of our requirements: The aggregated dimensions are partitioned by the fragments (in their canonical forms, see Section 2.8.2 "Fragment Interpretation"), and this partitioning is consistent across all of the fragments, i.e. any two fragments either span the same part of a given aggregated dimension, or else do not overlap along that same dimension. In addition, each fragment data value provides exactly one aggregated data value, and each aggregated data value comes from exactly one fragment. With these constraints, the fragments can be organised into a fully-populated orthogonal multidimensionsal array of fragments, for which the size of each dimension is equal to the number of fragments that span its corresponding aggregated dimension. The aggregated data is formed by combining the fragments in the same relative positions as they appear in the array of fragments, and with no gaps or overlaps between neighbouring fragments.
As you say, unless the partitioning across time is consistent between models, you can't aggregate this example.
If all models have data for the same 10 year period, then they can be aggregated, otherwise not. In the latter case, the mechanics of aggregation would of course allow you to stitch them together along a "model" dimension, what you put as the commone time coordinates? Like with many things in CF, just because you can do something, it doesn't mean that it correctly describes what you did :).
These can not be aggregated, because at least one of the fragments will have units that are not equivalent (i.e. convertible) to the the units defined on the aggregation variable. |
Hi David, Just to let you know I liked the clause, “ this partitioning is consistent across all of the fragments, i.e. any two fragments either span the same part of a given aggregated dimension, or else do not overlap along that same dimension”. That explains it better than I could have. Regarding the use cases that can’t be handled, perhaps we should consider what modifications would be needed to make it possible to handle these, since I think for CMIP data it would be quite useful. Since aggregation is mostly handled in index space, as I understand it, then shouldn’t it be possible for a user to aggregate datasets based solely on indexes? When generating an aggregate file, a user could provide whatever coordinate values they wanted for 1 or more of the aggregate’s dimensions. If the user elected not to provide values, then your software could read the coordinate values from the fragment files (and presumably check for consistency across fragments. I’m quite ignorant about your software package(s). Does your package help users create the aggregate files (as shown in the examples), or is that left totally up to the user? When a user accesses data through an aggregate file, does your package check that the coordinate values match the coordinate values already in the aggregate file? Still coming up to speed on this. |
Hi All, I don't have any particular comments on the content of the conventions, I was fairly satisfied with the scope and coverage, and I'm very much in favour of the overall aim - to persist aggregations to disk, rather than relying on on-the-fly aggregations for which Xarray seems particularly slow. I have a few suggestions from my exposure so far:
In summary, I also agree with others contributing to this thread that this is a useful addition to the CF conventions as a whole, and I would hope they are advertised/promoted accordingly, since there are other aggregation formats with significantly more widespread awareness (Kerchunk/Zarr etc.). My main suggestions revolve around some additional documentation specifically aimed at people with no or little knowledge of the specific terminology with NetCDF or CF in general. |
Hi all, I'm still preparing responses to Karl's PDF comments, and have yet to properly read the last couple of comments from Karl and Dan (I will do all of that today!), but I'd to get my alternative PR out there. This incorporates a lot of Karl's suggestions, and thanks to those, regardless of where we end up, I think that it is a much clearer exposition.
|
Already since the beginning I have been, and still, in favour of this addition, although I have not followed this discussion in any detail. A general (in fact CF wide) thought that again spring so mind when I read @dwest77a's comment is that i think that the CF should try (hard) to be independent of any language references. Do not misunderstand this as I am in any way critical to the work Daniel have done -- rather the contrary! But sometimes references to language --- especially python --- implementations may limit the discussion. If some aspect of a language implementation of the proposal stands out in some way (difficult, simple, slow, fast, easy ....), or depends on a mechanism or library only available in that language (we are here talking about python) that might make it difficult to implement the proposal in other languages, then I think this needs to be discussed. I have no idea if this is the case here. Anyway, and in general I think it is important in the long run to keep a clear separation between the CF Conventions as a convention/standard, and its implementations in different languages. Not having done that I personally think this is one of the weaknesses of many very popular formats/tools/etc. Daniel mentions zarr and kerchunk (I have no experience of either, but they seem to be oriented/limited to python). If we really want a closer connection between CF and some software tool I think that C is the way to go (still?), with bindings to R, Matlab, Julia, Fortran and more. But that is for another another day and conversation. |
Replying to Karl,
An aggregation file contains exactly the same information as its equivalent non-aggregation file that contains copies of all of the fragment data. The new PR has the line "The aggregated data is identical to the data that would be stored within a dataset that contained the equivalent non-aggregation variable.". This is near the end of the description but should, I realise, be much prominent. This means that aggregation is not solely in index space - we can't aggregate a 1 degree horizontal grid with a 2 horizontal degree grid because we couldn't represent the data in a non-aggregation file.
There is no requirement nor expectation for CF nor any software implementation to check other parts of the dataset (e.g. the coordinates) with any equivalent entities in the fragment files. You can easily create the coordinates as aggregation variables themselves if you wanted, which would mean that their values would come only from the fragments. This has implications for accessibility (i.e. the coordinates values are not readily available to casual inspection, e.g. with [*] This aggregation algorithm is completely based on the CF data model, and should be guaranteed to work for any conceivable CF datasets. it was proposed to CF nearly a decade ago (https://cfconventions.org/Data/Trac-tickets/78.html), which was probably a little premature. But now that we are on the brink of being able to store aggregations in CF, perhaps its time to think again about providing guidance on how to create them. This is not part of this proposal, though! |
To be sure, is it possible (even if perhaps inadvisable) for me to create an aggregation file linking 10-years of monthly-mean data created by multiple models even when the models have assumed different calendars (so some include leap years and others don't). Assume i've recorded in my aggregation file 120 values for the time dimension of the aggregated array, which I've based, let's say, on a no-leap-year calendar. Now if the software reading my aggregation file simply creates the aggregation array based on the mapping information I've provided, won't I get a properly defined aggregation array? Or is the aggregation software required to check that the time coordinates values I've defined are consistent with the coordiinate values stored in each of the fragment files? Basically, can't I create an aggregated dataset that has only approximately correct time coordinate information (for some models), which I deem to be good enough for analysis of multi-model ensembles? |
This is an interesting case. Let's first assume that only the field data is aggregated, and the time coordinates are not (i.e. the time coordinates a are normal variable whose data is in the file). You could create an encoded aggregation, but the time coordinates values would not match up with with data, which seems bad! If the data represented monthly means, the mis-match appears to go away, but because a given month may have a different lengths across the models, averaging weights based on the time coordinates would be incorrect. If the time coordinate variable was also aggregated then you would get an error at read time, because the calendars of some fragments would not be convertible to the calendar of the aggregated (time) variable. Remember - aggregation is just a different encoding of a normal CF-netCDF variable, and has doesn't allow you do store anything different than can be stored without aggregation. We must make this clear in the text. |
After some off-line discussions, we (the original CFA authors) have agreed that the functionality that allows file name string substitutions, and the ability to store multiple fragments per file names per fragment is not needed after all. Essentially, we realised that the use we envisaged - that of using CFA files to act as indexes to large distributed archives - can be just as easily managed without these two features, and that writing software to manage them is actually really hard! The use case hasn't gone away, but the presumed mechanisms to support it are not needed, and so can be removed from the convention, thereby making them considerably simpler. Removing this functionality from the text (PR #561) is as simple as removing the following two paragraphs: The location variable may have an extra trailing dimension that allows multiple versions of fragments to be specified. Each version contains equivalent information, so that any version that exists can be selected for use in the aggregated data. This could be useful when it is known that a fragment could be stored in various locations, but it is not known which of them might exist at any given time. For instance, when remotely stored and locally cached versions of the same fragment have been defined, an application program could choose to only retrieve the remote version if the local version does not exist. Every fragment must have at least one location, but not all fragments need to have the same number of versions. Where fragments have fewer versions than others, the extra trailing dimension are padded with missing values. See [example-L.2]. A fragment dataset location may be defined with any number of string substitutions, each of which is provided by the location variable’s substitutions attribute. The substitutions attribute takes a string value comprising blank-separated elements of the form "substitution: replacement", where substitution is a case-sensitive keyword that defines part of a location variable value which is to be replaced by replacement in order to find the actual fragment dataset location. A location variable value may include any subset of zero or more of the substitution keywords. After replacements have been made, the fragment dataset location must be an absolute URI or a relative-path URI reference. The substitution keyword must have the form ${*}, where * represents any number of any characters. For instance, the fragment dataset location https://remote.host/data/file.nc could be stored as ${path}file.nc, in conjunction with substitutions="${path}: https://remote.host/data/". The order of elements in the substitutions attribute is not significant, and the substitutions for a given fragment must be such that applying them in any order will result in the same fragment dataset location. The use of substitutions can save space in the aggregation file; and in the event that the fragment locations need to be updated after the aggregation file has been created, it could be possible to achieve this by modifying the substitutions attribute rather than by changing the actual location variable values. See [example-L.3]. and their associated examples. Is everyone OK with this? If a more compelling use case ever arose, reinstating either or both of these paragraphs would be trivial. Thanks, |
This is a follow-on of #508 (comment)
The reason I brought up this case is that we routinely compare model and observational datasets containing monthly means where different calendars have been imposed. If for a leap year, we plot the February mean temperature map for 20 models, we don't care that some models might have 29 days contributing to the mean and others only 28 days. That's because the differences in models are generally large enough that the difference in averaging period has negligible impact on what we're looking at. So I think for some purposes, there will be lots of folks who will find it useful to aggregate multi-model monthly data in the way described above. Am I correct that my "bad idea" would be possible if the field data is aggregated, but the time coordinates are not"? If so, I think that's great. [With multi-model data aggregated in this way, it would be easy to calculate multi-model means and standard deviations, and carry out various other types of analysis (and not worry about mismatched time-coordinates.)] |
Hi Karl,
That's right, provided that each time coordinate cell represents For other sized time coordinate cells (e.g. 1 day), the time dimension sizes in the fragments won't all equal the size of the time dimension in the aggregation file (because different models have different calendars), and your software will raise an error when it tries to create the aggregated data array. |
Incorporating the CFA convention for aggregated datasets into CF
Moderator
To be decided
Moderator Status Review [last updated: 2024-02-07]
Requirement Summary
This is a proposal to incorporate the CFA conventions into CF.
CFA (Climate and Forecast Aggregation) is a convention for recording aggregations of data, without copying their data.
The CFA conventions were discussed at the 2021 and 2023 annual CF workshops, the latter discussion resulting in an agreement to propose their incorporation into CF.
By an “aggregation” we mean a single dataset which has been formed by combining several datasets stored in any number of files. In the CFA convention, an aggregation is recorded by variables with a special function, called “aggregation variables”, in a single netCDF file called an “aggregation file”. The aggregation variables contain no data but instead record instructions on both how to find the data in their original files, and how to combine the data into an aggregated data array. An aggregation variable will almost always take up a negligible amount of disk space compared with the space taken up by the data that belongs to it, because each constituent piece, called a “fragment”, of the aggregated data array is represented solely by file and netCDF variable names and a few indices that describe where its data should be placed relative to the other fragments (see examples 1 and 2).
Example 1: For a timeseries of surface air temperature from 1861 to 2100 that is archived across 24 files, each spanning 10 years, it is useful to view this as if it were a single netCDF dataset spanning 240 years.
CFA has been developed since 2012 and is now a stable and versioned convention that has been fully implemented by cf-python for both aggregation file creation and reading.
Note that this proposal does not cover how to decide whether or not the data arrays of two existing variables could or should be aggregated into a single larger array. That is a software implementation decision. For instance, cf-python has an algorithm for this purpose (We think that the cf-python aggregation rules are complete and consistent because they are entirely based on the CF data model.)
Storing aggregations of existing datasets is useful for data analysis and archive curation. Data analysis benefits from being able to view an aggregation as a single entity and from avoiding the computational expense of creating aggregations on-the-fly; and aggregation files can act as metadata-rich archive indices that consume a very small amount of disk space.
The CFA conventions only affect the representation of a variable’s data, and thus they work alongside all CF metadata, i.e. the CFA conventions do not duplicate, extend, nor re-define any of the metadata elements defined by the CF conventions.
An aggregation file may, and often will, contain both aggregation variables and normal CF-netCDF variables i.e. those with data arrays. All kinds of CF-netCDF variables (e.g. data variables, coordinate variables, cell measures) can be aggregated using the CFA conventions. For instance an aggregated data variable (whose actual data are in other files) may have normal CF-netCDF coordinate variables (whose data are in the aggregation file).
Another approach to file aggregation without copying data is NcML Aggregation, which has been extensively used. CFA is similar in intent to NcML but is more general and efficient, because it
Technical Proposal Summary
The CFA conventions currently have their own document (https://github.com/NCAS-CMS/cfa-conventions/blob/main/source/cfa.md) which describes in detail how to create and interpret an "aggregation variable", i.e. a netCDF variable that does not contain a data array but instead has attributes that contain instructions on how to assemble the data array as an aggregation of data from other sources.
A Pull Request to incorporate CFA into CF has not been created yet. Before starting any work on translating the content of the CFA document into the CF conventions document, it is important to get the community’s consensus that this is a good idea, and about how the new content should be structured (e.g. a new section, a new appendix, both, or something else).
The main features of CFA are summarised in example 2, a CDL view of an aggregation of two 6-month datasets into a single 12-month variable (see the CFA document for details).
Example 2: An aggregation data variable whose aggregated data comprises two fragments. Each fragment spans half of the aggregated time dimension and the whole of the other three aggregated dimensions, and is stored in an external netCDF file in a variable called temp. The fragment URIs define the file locations. Both fragment files have the same format, so the format variable can be stored as a scalar variable.
Benefits
Aggregations persisted to disk allow users and software libraries to access pre-created aggregations with no complicated and time-consuming processing.
Status Quo
Not being able to persist fully generalised aggregations to disk means that every user/software library has to be able to create their own aggregations every time the data files are accessed. This is a complicated and time-consuming task.
Associated pull request
None yet (see above).
CFA authors
CFA has been developed by David Hassell, Jonathan Gregory, Neil Massey, Bryan Lawrence, and Sadie Bartholomew.
Contributors to CFA discussions at the CF workshops
Chris Barker, Ethan Davies, Roland Schweitzer, Karl Taylor, Charlie Zender, and Klaus Zimmermann (please let us know if we have accidentally missed you off this list).
The text was updated successfully, but these errors were encountered: