-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak Investigation #32
Comments
If you look at the LEAP run above you'll see the difference. Yes, it uses a ton of memory during rechunking but then levels out so this means it's only our kerchunk worfklows that show the bad memory pattern and we can start whittling it down further |
I was able to get Job (configured to run as one worker process on DirectRunner)class PrintKeyValueFn(beam.DoFn):
def process(self, element):
for mapper in element:
print(f"{mapper}")
#for key, value in mapper.items():
# print(f"Key: {key}, Value: {value}")
from apache_beam.options.pipeline_options import PipelineOptions
import argparse
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args()
with beam.Pipeline(argv=pipeline_args) as p:
(
p
| beam.Create(pattern.items())
| OpenWithKerchunk(
remote_protocol=earthdata_protocol,
file_type=pattern.file_type,
# lat/lon are around 5k, this is the best option for forcing kerchunk to inline them
inline_threshold=6000,
storage_options=fsspec_open_kwargs,
)
| CombineReferences(
concat_dims=CONCAT_DIMS,
identical_dims=IDENTICAL_DIMS,
target_options=target_fsspec_open_kwargs,
remote_options=target_fsspec_open_kwargs,
remote_protocol='s3',
mzz_kwargs={},
precombine_inputs=False
) | "Print Key-Value Pairs" >> beam.ParDo(PrintKeyValueFn())
) JH Memory DistributionSorted by Own (late in the job)Sorted by Total (late in the job)Notable Other Theories@norlandrhagen is correct that the biggest contributor is the result of the huge reduce we do in memory on the whole https://beam.apache.org/documentation/transforms/java/aggregation/combine/
☝️ |
The From how I understand it, the StoreToZarr recipe builds an Xarray schema, writes and empty zarr and then inserts dataset chunks into that store as they arrive, which isn't blocking. Maybe we can rewrite the Kerchunk pipeline to act in a similar way vis-a-vis appending? |
Been looking at this for a few minutes and trying to get a handle on the various pieces implicated by the Perhaps something like this for
I'm also seeing a pre_combine flag - it doesn't look to me like it quite does the above but I'm not confident about kerchunk/multizarrtozarr behavior to be honest |
Good stuff in here to review later but closing for now |
Problem
A dependency of
pangeo-forge-recipes
might be producing a memory leak based on the memory distribution we see during job runs.Next Steps (all can be done in parallel)
run recipes in GCP Dataflow (an over-provisioned instance) and see if we see the same memory pattern
profile the jobs with the DirectRunner locally using Scalene or see what we can do with
apache-beam
about profiling as this issue talks aboutIt's a guess, but this might have to deal with file openers and pickling. Anyhow some issues that might give us a bearing:
ecmwf/cfgrib#283
fsspec/filesystem_spec#825
apache/beam#28246 (comment)
Distributions
~7k time steps of MURSST data (
WriteCombinedReference
workflow):https://github.com/developmentseed/pangeo-forge-staging/tree/mursst-kerchunk/recipes/mursst
~5k time steps of GPM IMERG data (
WriteCombinedReference
workflow):https://github.com/developmentseed/pangeo-forge-staging/tree/gpm_imerg/recipes/gpm_imerg
~20 yearly time steps of LEAP data (
StoreToZarr
workflow):https://github.com/ranchodeluxe/leap-pgf-example/tree/main/feedstock
The text was updated successfully, but these errors were encountered: