Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read unstructured domain resample #18

Merged
merged 15 commits into from
May 22, 2024

Conversation

jeff-cole
Copy link
Collaborator

This test case extends read_domain_decomp_resample to regrid data on unstructured LFRic mesh to rectilinear grid.

@jeff-cole
Copy link
Collaborator Author

I set all the tests for unstructured meshes to be known failures as they fail the tolerance test. I feel the tolerance is set to low for these tests, I could just set a tolerance which would make all the tests pass. Would that be the best thing to do? The maximum relative error for these tests are

gulfstream max relative error = 0.6005423709670573
harmonic   max relative error = 0.3918016243246943
sinusiod   max relative error = 0.028933139088239804
vortex     max relative error = 0.27763766952570046

Which are much larger than the current tolerance value of 0.005.

@jeff-cole jeff-cole requested a review from mo-marqh May 9, 2024 12:52
Copy link
Member

@mo-marqh mo-marqh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jeff-cole

this is looking good, i have some detail comments within this review that would benefit from explaining or amending.

elif nc_method == 'data_func':
# create a netCDF file from an analytic function
os.chdir(cls.test_dir)
import xios_examples.gen_netcdf as gn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please may we keep imports at the top of the module?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved as requested


def create_ncfile_unstructured(ncmeshout, meshin_file, meshin_varname, func, add_bounds=True, data_prefix='', data_suffix=''):

# Create unstructured mesh from lfric mesh netCDF file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please may we have this as a triple-quote docstring for the function, rather than a hash-comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

def create_ncfile(ncfile, nlat, nlon, func, dim_prefix='', dim_suffix='', data_prefix='', data_suffix=''):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please may we have a triple-quote docstring for this function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

mesh_varname=defaults['mesh_varname'],
nlat=defaults['nlat'], nlon=defaults['nlon'],
nlatr=defaults['nlatr'], nlonr=defaults['nlonr']):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a docstring would help here.
One unknown on reading: can file_out be a path that includes directories and file name, or is it limited to file name within current working dir? (see shared_testing comments)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring added. Yes you can use full paths.

inf], cwd=cls.test_dir, check=True)
elif nc_method == 'data_func':
# create a netCDF file from an analytic function
os.chdir(cls.test_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd like to avoid changing directories within the test runs if possible.
could gn.run( take a full path, using cls.test_dir to avoid the os.chdir call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this works fine, changed

# these tests are hitting exceptions with XIOS3
# but not XIOS2, so skip for XIOS3 runner
setattr(TestResampleDomain, tname,
unittest.skip(TestResampleDomain.make_a_resample_test(f)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please may i check, are there exceptions with XIOS3 for these tests, different from the known_failures?
if so, what is XIOS3 reporting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must attempt I never tried these tests with XIOS3 just assumed they did not work as per the other tests. I not sure what the other tests failures are, is there any information on that? Anyway I tried running with XIOS3 and got this error

In file "field.cpp", function "void xios::CField::setContextClient(xios::CContextClient*)", line 1496 -> Grid not defined for odatax (if field is an input field, set read_access to true)

which didn't happen with XIOS2. The field is an input field but I don't know what set read_access to true means, I have already got mode="read", do you have any idea on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a look at the documentation and there is a read_access attribute but it is for a field not a file as I originally tried. From the user guide

In order not to waste memory, the instant data of a field can be read from the
model only if:
• it is part of a file whose attribute mode is “read”
• or its attribute read_access is set to “true”.
In any other case, trying to access the field data would cause an error.

Looks like this has changed in XIOS3

After adding the read_access attribute I no longer got the above error I now get this error

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f38a25483ff in ???
#1  0x7f38a61a4caa in _ZNKSs4_Rep12_M_is_leakedEv
        at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1715013693798/work/libstdc++-v3/include/bits/cow_string.h:214
#2  0x7f38a61a5da7 in _ZNSs4_Rep7_M_grabERKSaIcES2_
        at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1715013693798/work/libstdc++-v3/include/bits/cow_string.h:265
#3  0x7f38a61a5dee in _ZNSsC2ERKSs
        at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1715013693798/work/libstdc++-v3/include/bits/cow_string.h:546
#4  0x55c4acc707c0 in _ZN4xios20CTransformationPaths23getNextTransformationIdEv
        at /home/users/jcole/software/xios3_trunk_r2629/src/transformation/transformation_path.cpp:97
#5  0x55c4ac99c806 in _ZN4xios5CGrid24buildTransformationGraphERNS_17CGarbageCollectorEbPS0_ddRS3_bPNS_6CFieldE
        at /home/users/jcole/software/xios3_trunk_r2629/src/node/grid.cpp:2141
#6  0x55c4ac91ef0c in _ZN4xios6CField18buildWorkflowGraphERNS_17CGarbageCollectorE
        at /home/users/jcole/software/xios3_trunk_r2629/src/node/field.cpp:746
#7  0x55c4ac652500 in _ZN4xios8CContext15closeDefinitionEv
        at /home/users/jcole/software/xios3_trunk_r2629/src/node/context.cpp:1208
#8  0x55c4acb28e41 in cxios_context_close_definition
        at /home/users/jcole/software/xios3_trunk_r2629/src/interface/c/icdata.cpp:133
#9  0x55c4ac2bfe37 in __idata_MOD_xios_close_context_definition
        at /home/users/jcole/software/xios3_trunk_r2629/config_jasmin_debug/ppsrc/xios/interface/fortran/idata.f90:691
#10  0x55c4ac26f7cd in initialise
        at /home/users/jcole/software/git/jeff-cole/tcd-XIOS-demonstration/xios_examples/read_unstructured_domain_resample/resample.F90:62
#11  0x55c4ac26f45c in resample
        at /home/users/jcole/software/git/jeff-cole/tcd-XIOS-demonstration/xios_examples/read_unstructured_domain_resample/resample.F90:18
#12  0x55c4ac26fd98 in main
        at /home/users/jcole/software/git/jeff-cole/tcd-XIOS-demonstration/xios_examples/read_unstructured_domain_resample/resample.F90:9

Is this similat to what you saw for the other tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jeff-cole

Many of the tests run fine with XIOS2 & XIOS3 (there are some exceptions, but we'd like to keep these to a minimum if we can, and explore to try and have version-portable example code)

Where & how did you run these tests to get the invalid_memory_reference error please?

i've just tried branching from your branch and enabling these test for XIOS3 in a fork-branch:
https://github.com/mo-marqh/tcd-XIOS-demonstration/tree/read_unstructured_domain_resample
mo-marqh@7640c1e
I have set off the github runners within my fork seems to run these tests for XIOS3 and complete, with the same tolerance known-error, but no Segmentation fault
https://github.com/mo-marqh/tcd-XIOS-demonstration/actions/runs/9157120279/job/25172852674#step:6:92

it would be good to try and understand this environment difference if we can.
In the meantime, please may you remove these skip patterns for these tests and run them on github for this branch?

thank you
mark

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mo-marqh

I think using exceptedFailure hides errors other than not meeting the tolerance requirements. When I turned off skipping the XIOS3 test it passed with 4 known failures. When I then increased the tolerance value and removed the known failures it failed with a python netCDF4 problem, which was previously masked by using exceptedFailure.

I fixed the netCDF4 problem (I was using a feature which was added after the version being using by the github runners) and now it fails with the same error as I got above (not using read_access). I will fix this and see if I get the invalid_memory_reference error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get the same invalid_memory_reference error, this time I got

/home/runner/work/tcd-XIOS-demonstration/tcd-XIOS-demonstration/xios_examples/read_unstructured_domain_resample/xios_client_0.err
-> error : WARNING: Unexpected request for buffer to communicate with server 0
In file "type_impl.hpp", function "void xios::CType<T>::_checkEmpty() const [with T = int]",  line 210 -> Data is not initialized

I also tried running it on archer2 (not with the trunk but 2 versions earlier) and this seemed to get further but hung.

All of the read_axis/read_domain tests skip running XIOS3, can you remember what problem they had which made you skip them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original runs were done on jasmin, this is what gave the invalid_memory_reference error,

mesh_file = 'mesh_C12.nc'


# A list of input `.cdl` files where XIOS is known to produce different
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a comment inconsistency, there are no input .cdl files here, i think, they are generated from Python code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed this to not refer to .cdl files but I don't really understand what this comment means, there is no list here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't see where this file is being used in any of the code, is it required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is needed to get the mesh topology information to write the LFRic unstructured netCDF file. It is referenced in file test_resample_cases.py when you define the TestResampleDomain class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and used in the gn.run call in file shared_testing.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clear, thank you.

given that this is an 84k binary file, i wonder whether it would be useful for repository readability to include this as a plain text .cdl file and to build it using ncgen in the test preparation? it's not much of a size change, it would be 134k

I'm a bit wary of storing binary files within the repository as it's harder to understand what they are, and not possible to track changes meaningfully

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this module being used?
I can't see how it's being called from anything else? (am I missing this?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this isn't used, I will delete it

import numpy as np

class dataFunc:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class docstring please

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mo-marqh
Copy link
Member

I set all the tests for unstructured meshes to be known failures as they fail the tolerance test. I feel the tolerance is set to low for these tests, I could just set a tolerance which would make all the tests pass. Would that be the best thing to do? The maximum relative error for these tests are

gulfstream max relative error = 0.6005423709670573
harmonic   max relative error = 0.3918016243246943
sinusiod   max relative error = 0.028933139088239804
vortex     max relative error = 0.27763766952570046

Which are much larger than the current tolerance value of 0.005.

i'd like to do a bit more digging into this and assess this tolerance value and what might be appropriate for us to target.

if we can do this as part of the review, then that's okay, but we can merge this as 'known failures' and spawn a follow up ticket to address tolerances for resampling

@mo-marqh
Copy link
Member

I set all the tests for unstructured meshes to be known failures as they fail the tolerance test. I feel the tolerance is set to low for these tests, I could just set a tolerance which would make all the tests pass. Would that be the best thing to do? The maximum relative error for these tests are

gulfstream max relative error = 0.6005423709670573
harmonic   max relative error = 0.3918016243246943
sinusiod   max relative error = 0.028933139088239804
vortex     max relative error = 0.27763766952570046

i'd like to do a bit more digging into this and assess this tolerance value and what might be appropriate for us to target.

if we can do this as part of the review, then that's okay, but we can merge this as 'known failures' and spawn a follow up ticket to address tolerances for resampling

hi @jeff-cole

i think what i would like would be to set these tolerances such that they pass for these values, and known_failures are not required.

Then i'd like to create a separate issue with the aim of justifying these tolerances, so that we can show willing, and have evidence beyond just values that worked.

does this seem reasonable to you?
ty, m

@mo-marqh mo-marqh merged commit 10cde6a into MetOffice:main May 22, 2024
4 checks passed
This was referenced May 22, 2024
@jeff-cole jeff-cole deleted the read_unstructured_domain_resample branch May 22, 2024 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants