Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of dbmfile-based dogpile cache #243

Closed
amotl opened this issue Nov 22, 2020 · 17 comments · Fixed by #567
Closed

Get rid of dbmfile-based dogpile cache #243

amotl opened this issue Nov 22, 2020 · 17 comments · Fixed by #567
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@amotl
Copy link
Member

amotl commented Nov 22, 2020

Describe the bug
The dogpile cache based on dbmfile is brittle.

To reproduce
Run Wetterdienst from different Python environments and see accessing the shared cache break more often than not, see #217, #232, #233, #242 and #244. Also, #236 seems to be related as well.

Expected behavior
Wetterdienst should work in all circumstances, even when switching between different Python environments.

Additional context
While the documentation about pickle [1] promises that

The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.

it apparently still has problems in our context. While I currently don't have a clue why, I figure it might be coming from marshalling/unmarshalling data frames from different versions of Pandas. We can either investigate this further or use a different means of data storage and/or serialization protocol for the dogpile cache.

[1] https://docs.python.org/3/library/pickle.html

@amotl
Copy link
Member Author

amotl commented Nov 25, 2020

At #254 (comment), we found the dbmfile-based cache also has serious concurrency issues. When looking at future usage of the REST API, this is yet another reason to finally improve the situation here.

@gutzbenj
Copy link
Member

Dear @amotl ,

as you have already mentioned here [1], we should change to another backend as apart from the latest problems we faced the dbmbackend is not supported for Windows. Available backends of dogpile.cache are described at [2]. The dogpile.cache redis backend offers options for cases of concurrency and may be the suitable replacement.

distributed_lock – boolean, when True, will use a redis-lock as the dogpile lock. Use this when multiple processes will be talking to the same redis instance. When left at False, dogpile will coordinate on a regular threading mutex.

For future developments we may as well want to follow the project at [3] that has shown improvements in memory consumption with other serialization methods.

[1]

# TODO: Make backend configurable, e.g. better use Redis.

[2] https://dogpilecache.sqlalchemy.org/en/latest/api.html#module-dogpile.cache.backends.memory
[3] https://github.com/jvanasco/dogpile_backend_redis_advanced

@kmuehlbauer
Copy link
Collaborator

kmuehlbauer commented Dec 10, 2020

Hi @gutzbenj and @amotl, concerning retrieving/caching of remote assets, did you already consider using fsspec (https://filesystem-spec.readthedocs.io/en/latest/index.html)?

@amotl
Copy link
Member Author

amotl commented Dec 10, 2020

Dear Kai,

thanks for bringing this to our attention. The option to cache files from a target filesystem locally [1] would be appropriate, right? In this regard, we would have to check if the target filesystem can be just something accessed over HTTP through the regular Python requests module when downloading artefacts from opendata.dwd.de [2], right?

The feature documentation [3] says:

Consistent API to many different storage backends. The general API and functionality were proven with the projects s3fs and gcsfs (along with hdfs3 and adlfs), within the context of Dask and independently.

as well as [4]:

Any path of any file-system can be mapped to a local directory using pyfuse and fsspec.fuse.run(). This feature is experimental, but basic file listing with details, and read/write should generally be available to the extent that the remote file-system provides enough information.

So, we will have to check how to wrap fsspec's target filesystem concept around a general directory index structure as served by HTTP servers?

That would indeed be very elegant in order to shift the caching away from the read-through caching currently employed by dogpile.cache upon Python function calls. Instead, it would be based upon fetching remote resources from HTTP servers.

Please add further adjustments if you believe I am thinking into the wrong direction here.

With kind regards,
Andreas.

[1] https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
[2] https://opendata.dwd.de/
[3] https://filesystem-spec.readthedocs.io/en/latest/features.html
[4] https://filesystem-spec.readthedocs.io/en/latest/features.html#mount-anything-with-fuse

@amotl
Copy link
Member Author

amotl commented Dec 10, 2020

How to wrap fsspec's target filesystem concept around a general directory index structure as served by HTTP servers?

Indeed, there is a generic fsspec.implementations.http.HTTPFileSystem implementation, which would probably fit the bill perfectly.

Unfortunately, it is not listed at [5], so I missed it first hand. test_http.py implements the respective test case for that.

[5] https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations

@kmuehlbauer
Copy link
Collaborator

I'm stumbling over fsspec every once in a while since several months, but did not have a deeper look into it until now. The user documentation still has room for improvements when it comes to specific examples. What I tried so far is fetching DWD radar data with a very simple approach:

import fsspec
import wradlib as wrl
fname = "https://opendata.dwd.de/weather/radar/sites/sweep_vol_z/ess/hdf5/"
files = list(fsspec.get_mapper(fname))
with fsspec.open(fname+files[-1]) as f:
    ds = wrl.io.open_odim(f, loader="h5netcdf")
    display(ds[0].data)

Alone this is really nice for just grabbing some data from remote https.

I also tried to comprehend the full possibilities (caching, fusing into local filesystem etc) but did not come as far as i wished. But it seems like it is adopted widely around the big data science packages (pandas, xarray, pangeo).

@amotl
Copy link
Member Author

amotl commented Dec 10, 2020

Dear Kai,

I quickly put something up at [1]. This is excellent.

Thanks again and with kind regards,
Andreas.

[1] https://gist.github.com/amotl/0a2eb63708b8a0cf4dc457b1e6a87455

@meteoDaniel
Copy link
Contributor

In your code you have added a Todo and mentioned redis.

Here are some reasons why you should not use redis per default:

  • Running an additional service in parallel will increase the number of dependencies
  • It is difficult to run redis in a single docker container environment. Not impossible but state of the art is using docker-compose and mount the redis container into the main container. And docker-compose is not compatible to cloud infrastructure or kubernetes.

@amotl
Copy link
Member Author

amotl commented Apr 21, 2021

Hi Daniel,

thanks for your feedback, we appreciate to learn from people in which environments they are running Wetterdienst.

Sure, Redis will only be optional but will save us from any hassle when running Wetterdienst within a multithreaded environment because accessing a filesystem-based cache on different platforms in a thread-safe manner is not always easy.

Wetterdienst will always be able to be run both in interactive or batch mode as well as in daemon mode and we will try to both keep the balance and to optimize runtime behavior in all of those scenarios. Introducing a strict runtime dependency like having to run a Redis service in a scenario you outlined above would be a bad idea.

In addition to that, I want to elaborate a bit more about the direction we are heading towards with respect to getting rid of the dbmfile-based cache implemented by dogpile. As suggested by @kmuehlbauer, we started looking into fsspec by @martindurant and contributors. Expanding on that, we identified all spots that needed to be fixed and extended to be able to make it an appropriate replacement for the current implementation. @gutzbenj already made a start with fsspec/filesystem_spec#507 and I recently submitted fsspec/filesystem_spec#560 and fsspec/filesystem_spec#561. I hope to be able to return to both patches soon in order to implement the suggestions by @martindurant.

With kind regards,
Andreas.

@amotl
Copy link
Member Author

amotl commented May 13, 2021

Hi again,

with #431, we are replacing dogpile.cache with fsspec. However, this will not persistently cache all resources on disk yet. On another, slightly related note, I also want to share DiskCache [1,2] by @grantjenks and contributors here.

@gutzbenj already drew my attention to it the other day. It might even be suitable to accompany the fsspec-based caching implementation [3].

With kind regards,
Andreas.

[1] http://www.grantjenks.com/docs/diskcache/
[2] https://github.com/grantjenks/python-diskcache
[3] https://github.com/intake/filesystem_spec/blob/master/fsspec/caching.py

@martindurant
Copy link

Certainly, fsspec would be happy to see alternative, more fully-featured caching schemes. Integration should generally be pretty simple.

@amotl
Copy link
Member Author

amotl commented Aug 8, 2021

Hi there,

@gutzbenj bumped me to start working on #431 again. While doing so, I discovered an issue when upgrading to fsspec>=2021.6.0.

Problem

When using fsspec>=2021.6.0, Wetterdienst apparently observes a regression. With this version and newer, the test suite will croak:

FAILED tests/provider/dwd/observation/test_fileindex.py::test_file_index_creation
FAILED tests/provider/dwd/observation/test_metaindex.py::test_meta_index_creation

- FileNotFoundError: ['https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/kl/historical/']

Reason

The reason is that the new list_remote_files_fsspec function, in turn using filesystem.glob() and filesystem.expand_path(), will not produce the same results as before.

Solution

Downgrading to fsspec==2021.5.0 fixed the problem, see 4f6ea1c.

Further investigation

@martindurant: I will investigate this further and will let you know about the outcome. In the meanwhile, we fixed it by downgrading to fsspec==2021.5.0.

With kind regards,
Andreas.

@amotl
Copy link
Member Author

amotl commented Aug 10, 2021

At #431 (comment), @martindurant asked if there is already a linked fsspec issue to the regression we observed. Thank you!

I believe @gutzbenj spent some minutes trying to come up with a minimal repro, but I haven't heard back from him yet.

@martindurant: I didn't want to prematurely open an issue because I wanted to first investigate whether the error is on us and fsspec just got improved by properly raising exceptions now where it did not do so before. If this is the case, the mitigation would rather be making Wetterdienst adjust to the new behavior, so there would be no reason to bother you in any way on the fsspec issue tracker.

@gutzbenj
Copy link
Member

gutzbenj commented Aug 10, 2021

Dear @amotl ,

instead of using fs.glob() I'd suggest using fs.find()to get a directory listing such as with the following example.

from fsspec.implementations.http import HTTPFileSystem
import pandas as pd

URL = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl"

fs = HTTPFileSystem()

files = fs.find(URL)

# Apply filters via pandas
df = pd.DataFrame({"file": files})

df = df[df.file.str.endswith(".zip")]

However, when evaluating this example again, I was able to observe a regression with requests==2.26.0.

With kind regards,
Benjamin.

@amotl
Copy link
Member Author

amotl commented Aug 10, 2021

Hi @martindurant,

I will investigate this further and will let you know about the outcome. In the meanwhile, we fixed it by downgrading to fsspec==2021.5.0.

Regarding the issue we observed with fsspec>=2021.6.0:

[...] first investigate whether the error is on us and fsspec just got improved recently by properly raising exceptions now where it did not do so before.

That's it. It was just that I missed to appropriately handle FileNotFoundError in our code so the test code choked on that. It got fixed with 6e2adc6. So sorry for the noise!

#492 bundles all improvements to the baseline implementation #431, thanks to @gutzbenj.

With kind regards,
Andreas.

@amotl
Copy link
Member Author

amotl commented Aug 12, 2021

Hi again,

in this context, I would also like to reference fsspec/filesystem_spec#639.

With kind regards,
Andreas.

@amotl
Copy link
Member Author

amotl commented Jan 28, 2022

Dear Benjamin,

I would like to salute you for your efforts on this, and, at the same time, I wish the upstream improvements on behalf of fsspec/filesystem_spec#895 much fortune!

With kind regards,
Andreas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants