Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUMMA interacts with more forcing files than needed for specified simulation length #501

Open
wknoben opened this issue Jan 20, 2022 · 10 comments

Comments

@wknoben
Copy link
Collaborator

wknoben commented Jan 20, 2022

Bug Reports

  • SUMMA v3; recent develop branch; hash edd328c
  • Compiler: GCC 7.3.0
  • Operating system: Linux
  • A description of relevant model settings: n/a
  • A summary of the bug or error message you are getting: SUMMA reads or checks forcing data from all files listed in forcingFileList.txt, despite only needing to read a subset of them for the specified simulation duration
@KyleKlenk
Copy link
Contributor

I addressed this in SUMMA-Actors and the way I did it was I modified the file-manager to have two more options. This allows Summa to know how many files it should read.

The extra options were:
forcingFreq 'month' ! the frequeny of forcing files (month, year)
forcingStart '1979-01-01' ! starting date of the forcing file list

If this is acceptable I can write out a broader plan for changes.

@andywood
Copy link
Collaborator

andywood commented Oct 29, 2022 via email

@KyleKlenk
Copy link
Contributor

Hi Andy,

Thanks for the feedback, I have been experimenting with SUMMA for some time and it is good to get some feedback on places in the code where changes should be avoided like the fileManager.

I ran some tests that are hopefully helpful to this open issue.

Dataset: North America - 492 Forcing Files - 517,315 GRUs - Chunked at 1000 GRU per chunk
Machine configuration: 64 CPU - 2TB RAM - Connected to Network Area Storage with 10 Gbit connection
netCDF version: 4.5.2

Running 1 GRU for one month with 1 forcing file in the forcingFileList.txt

  • Run Time: 1 Second

Running 1 GRU for one month with loading 492 forcing files in the forcingFileList.txt

  • Run Time: 30 Seconds for first run
  • Run Time: 1-3 Seconds for subsequent runs (looks like caching is taking effect)

This I think would be the worst case scenario. When running more than one GRU at a time the running of the GRUs starts to take up the majority of the runtime.

Hopefully this information is helpful. I did some extra testing on another machine that is closer to a typical consumer PC with 4-CPU, 16GB of RAM and 1 GBit connection. It was getting the benefits of caching from the NAS which made the results of loading in one file vs multiple basically the same. I can share them if needed once the cache is cleared but I expect they will be similar to the above with the difference coming from the network connection speed to our NAS.

Thanks again for your comments,
Kyle

Some extra information on why I made changes in my experimental code that should have been addressed on my side before making changes to the code.

  • When I was doing testing for my experimental version of SUMMA I was using a cluster with slurm and requesting 1-CPU and 1-GB of RAM in an interactive session. This configuration may have made the problem look worse than it actually was as there may have been little benefit from caching. During this testing I was usually testing with one GRU at a time and not multiple but different configurations of forcing files, again making the problem look worse than it may be for the typical user.

@andywood
Copy link
Collaborator

andywood commented Oct 30, 2022 via email

@wknoben
Copy link
Collaborator Author

wknoben commented Oct 31, 2022

Hi Kyle,

Just dropping a few thoughts here.

  1. I agree with Andy that changing the fileManager is probably best left as a last resort option. If we can get the same improvements by making the code cleverer than that's probably the way to go.

  2. The caching solution is not something that can be relied on in an HPC environment, because cached memory is ephemeral. There's no guarantee that the data persists in cache, especially if you request a memory allocation that's just about enough to simply run the model. I had a long conversation with our HPC team about this that you might find interesting (see: https://jira.usask.ca/servicedesk/customer/portal/2/ISD-364972).

  3. I agree with Andy that it might be worthwhile to have a thorough check of the forcing reading code, to see if it is as optimal as it can be.

Cheers,
Wouter

@KyleKlenk
Copy link
Contributor

Thanks for the comments,

I can take a look at the source code and can report what I find and propose changes here. The reasoning for not changing the fileManager makes sense to me. I applied this change naively in Summa-Actors a while ago and looks like it needs to be reverted. Because I need to do the revert for our experimental version looking into other solutions and proposing them here should be no problem.

That is a good point about the caching on HPC systems too. I remember you sharing that conversation a while back but it was a good refresher and I agree not something to rely on.

Thanks,
Kyle

@KyleKlenk
Copy link
Contributor

Hi Andy and Wouter,

I went through the source code and this is what I found inside ffile_info.f90.

The forcingFileList.txt is read in and then how ever many lines are there is how the length is determined for the array of forcing files.

Then inside a do loop, each file in that array is opened and the descriptive information such as variable names and the number of time steps within that netCDF forcing file are used to populate the forcFileInfo data structure. When the user gives too many files in the forcingFileList.txt the loop will not exit until it has opened and populated information for every netcdf file that was listed in forcingFileList.txt.

It seems to me a check could be implemented to break the do loop here so once we have passed the forcing files that are not needed we do not continue. This is I think what you had in mind.

The scenario of how to handle the situation where the user has more forcing files in the beginning of the list, ie. start date starts at a forcing file that is not the first in the forcingFileList.txt could be covered by the same check but instead of breaking the loop we move to the next iteration. The catch here is that read_force.f90 will still open the beginning files unless iFile is set to the file that corresponds with the start of the simulation and not the start of the forcingFileList.

I can clarify as needed if something doesn't make sense. I am still learning how best to communicate these issues. In the end this doesn't happen if the user is responsible for ensuring the number of forcing files is correct for their simulation.

Best,
Kyle

@andywood
Copy link
Collaborator

andywood commented Nov 3, 2022 via email

@andywood
Copy link
Collaborator

andywood commented Nov 3, 2022 via email

@wknoben
Copy link
Collaborator Author

wknoben commented Nov 3, 2022

Reading Andy's comment, it seems that we can agree that implementing these procedures has the potential to make debugging large-domain, long-term runs a bit more convenient. I would add that making the model's behaviour a bit more intuitive has an extra benefit for new users, who might need a while to realize that reducing the forcing file list size can speed up debugging runs.

Seeing how this seems like a fairly low time investment, it might be worth it to implement these suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants