Get rid of dbmfile-based dogpile cache #243

amotl · 2020-11-22T19:29:10Z

Describe the bug
The dogpile cache based on dbmfile is brittle.

To reproduce
Run Wetterdienst from different Python environments and see accessing the shared cache break more often than not, see #217, #232, #233, #242 and #244. Also, #236 seems to be related as well.

Expected behavior
Wetterdienst should work in all circumstances, even when switching between different Python environments.

Additional context
While the documentation about pickle [1] promises that

The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.

it apparently still has problems in our context. While I currently don't have a clue why, I figure it might be coming from marshalling/unmarshalling data frames from different versions of Pandas. We can either investigate this further or use a different means of data storage and/or serialization protocol for the dogpile cache.

[1] https://docs.python.org/3/library/pickle.html

The text was updated successfully, but these errors were encountered:

amotl · 2020-11-25T21:11:28Z

At #254 (comment), we found the dbmfile-based cache also has serious concurrency issues. When looking at future usage of the REST API, this is yet another reason to finally improve the situation here.

gutzbenj · 2020-11-25T23:59:40Z

Dear @amotl ,

as you have already mentioned here [1], we should change to another backend as apart from the latest problems we faced the dbmbackend is not supported for Windows. Available backends of dogpile.cache are described at [2]. The dogpile.cache redis backend offers options for cases of concurrency and may be the suitable replacement.

distributed_lock – boolean, when True, will use a redis-lock as the dogpile lock. Use this when multiple processes will be talking to the same redis instance. When left at False, dogpile will coordinate on a regular threading mutex.

For future developments we may as well want to follow the project at [3] that has shown improvements in memory consumption with other serialization methods.

[1]

wetterdienst/wetterdienst/util/cache.py

Line 12 in e03a62e

# TODO: Make backend configurable, e.g. better use Redis.

[2] https://dogpilecache.sqlalchemy.org/en/latest/api.html#module-dogpile.cache.backends.memory
[3] https://github.com/jvanasco/dogpile_backend_redis_advanced

kmuehlbauer · 2020-12-10T07:33:50Z

Hi @gutzbenj and @amotl, concerning retrieving/caching of remote assets, did you already consider using fsspec (https://filesystem-spec.readthedocs.io/en/latest/index.html)?

amotl · 2020-12-10T10:44:54Z

Dear Kai,

thanks for bringing this to our attention. The option to cache files from a target filesystem locally [1] would be appropriate, right? In this regard, we would have to check if the target filesystem can be just something accessed over HTTP through the regular Python requests module when downloading artefacts from opendata.dwd.de [2], right?

The feature documentation [3] says:

Consistent API to many different storage backends. The general API and functionality were proven with the projects s3fs and gcsfs (along with hdfs3 and adlfs), within the context of Dask and independently.

as well as [4]:

Any path of any file-system can be mapped to a local directory using pyfuse and fsspec.fuse.run(). This feature is experimental, but basic file listing with details, and read/write should generally be available to the extent that the remote file-system provides enough information.

So, we will have to check how to wrap fsspec's target filesystem concept around a general directory index structure as served by HTTP servers?

That would indeed be very elegant in order to shift the caching away from the read-through caching currently employed by dogpile.cache upon Python function calls. Instead, it would be based upon fetching remote resources from HTTP servers.

Please add further adjustments if you believe I am thinking into the wrong direction here.

With kind regards,
Andreas.

[1] https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
[2] https://opendata.dwd.de/
[3] https://filesystem-spec.readthedocs.io/en/latest/features.html
[4] https://filesystem-spec.readthedocs.io/en/latest/features.html#mount-anything-with-fuse

amotl · 2020-12-10T10:50:10Z

How to wrap fsspec's target filesystem concept around a general directory index structure as served by HTTP servers?

Indeed, there is a generic fsspec.implementations.http.HTTPFileSystem implementation, which would probably fit the bill perfectly.

Unfortunately, it is not listed at [5], so I missed it first hand. test_http.py implements the respective test case for that.

[5] https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations

kmuehlbauer · 2020-12-10T11:25:41Z

I'm stumbling over fsspec every once in a while since several months, but did not have a deeper look into it until now. The user documentation still has room for improvements when it comes to specific examples. What I tried so far is fetching DWD radar data with a very simple approach:

import fsspec
import wradlib as wrl
fname = "https://opendata.dwd.de/weather/radar/sites/sweep_vol_z/ess/hdf5/"
files = list(fsspec.get_mapper(fname))
with fsspec.open(fname+files[-1]) as f:
    ds = wrl.io.open_odim(f, loader="h5netcdf")
    display(ds[0].data)

Alone this is really nice for just grabbing some data from remote https.

I also tried to comprehend the full possibilities (caching, fusing into local filesystem etc) but did not come as far as i wished. But it seems like it is adopted widely around the big data science packages (pandas, xarray, pangeo).

amotl · 2020-12-10T12:27:51Z

Dear Kai,

I quickly put something up at [1]. This is excellent.

Thanks again and with kind regards,
Andreas.

[1] https://gist.github.com/amotl/0a2eb63708b8a0cf4dc457b1e6a87455

meteoDaniel · 2021-04-21T11:59:55Z

In your code you have added a Todo and mentioned redis.

Here are some reasons why you should not use redis per default:

Running an additional service in parallel will increase the number of dependencies
It is difficult to run redis in a single docker container environment. Not impossible but state of the art is using docker-compose and mount the redis container into the main container. And docker-compose is not compatible to cloud infrastructure or kubernetes.

amotl · 2021-04-21T17:26:34Z

Hi Daniel,

thanks for your feedback, we appreciate to learn from people in which environments they are running Wetterdienst.

Sure, Redis will only be optional but will save us from any hassle when running Wetterdienst within a multithreaded environment because accessing a filesystem-based cache on different platforms in a thread-safe manner is not always easy.

Wetterdienst will always be able to be run both in interactive or batch mode as well as in daemon mode and we will try to both keep the balance and to optimize runtime behavior in all of those scenarios. Introducing a strict runtime dependency like having to run a Redis service in a scenario you outlined above would be a bad idea.

In addition to that, I want to elaborate a bit more about the direction we are heading towards with respect to getting rid of the dbmfile-based cache implemented by dogpile. As suggested by @kmuehlbauer, we started looking into fsspec by @martindurant and contributors. Expanding on that, we identified all spots that needed to be fixed and extended to be able to make it an appropriate replacement for the current implementation. @gutzbenj already made a start with fsspec/filesystem_spec#507 and I recently submitted fsspec/filesystem_spec#560 and fsspec/filesystem_spec#561. I hope to be able to return to both patches soon in order to implement the suggestions by @martindurant.

With kind regards,
Andreas.

amotl · 2021-05-13T11:21:49Z

Hi again,

with #431, we are replacing dogpile.cache with fsspec. However, this will not persistently cache all resources on disk yet. On another, slightly related note, I also want to share DiskCache [1,2] by @grantjenks and contributors here.

@gutzbenj already drew my attention to it the other day. It might even be suitable to accompany the fsspec-based caching implementation [3].

With kind regards,
Andreas.

[1] http://www.grantjenks.com/docs/diskcache/
[2] https://github.com/grantjenks/python-diskcache
[3] https://github.com/intake/filesystem_spec/blob/master/fsspec/caching.py

martindurant · 2021-05-13T12:47:54Z

Certainly, fsspec would be happy to see alternative, more fully-featured caching schemes. Integration should generally be pretty simple.

amotl · 2021-08-08T14:50:05Z

Hi there,

@gutzbenj bumped me to start working on #431 again. While doing so, I discovered an issue when upgrading to fsspec>=2021.6.0.

Problem

When using fsspec>=2021.6.0, Wetterdienst apparently observes a regression. With this version and newer, the test suite will croak:

FAILED tests/provider/dwd/observation/test_fileindex.py::test_file_index_creation
FAILED tests/provider/dwd/observation/test_metaindex.py::test_meta_index_creation

- FileNotFoundError: ['https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/1_minute/kl/historical/']

Reason

The reason is that the new list_remote_files_fsspec function, in turn using filesystem.glob() and filesystem.expand_path(), will not produce the same results as before.

Solution

Downgrading to fsspec==2021.5.0 fixed the problem, see 4f6ea1c.

Further investigation

@martindurant: I will investigate this further and will let you know about the outcome. In the meanwhile, we fixed it by downgrading to fsspec==2021.5.0.

With kind regards,
Andreas.

amotl · 2021-08-10T13:43:08Z

At #431 (comment), @martindurant asked if there is already a linked fsspec issue to the regression we observed. Thank you!

I believe @gutzbenj spent some minutes trying to come up with a minimal repro, but I haven't heard back from him yet.

@martindurant: I didn't want to prematurely open an issue because I wanted to first investigate whether the error is on us and fsspec just got improved by properly raising exceptions now where it did not do so before. If this is the case, the mitigation would rather be making Wetterdienst adjust to the new behavior, so there would be no reason to bother you in any way on the fsspec issue tracker.

gutzbenj · 2021-08-10T17:08:56Z

Dear @amotl ,

instead of using fs.glob() I'd suggest using fs.find()to get a directory listing such as with the following example.

from fsspec.implementations.http import HTTPFileSystem
import pandas as pd

URL = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl"

fs = HTTPFileSystem()

files = fs.find(URL)

# Apply filters via pandas
df = pd.DataFrame({"file": files})

df = df[df.file.str.endswith(".zip")]

However, when evaluating this example again, I was able to observe a regression with requests==2.26.0.

With kind regards,
Benjamin.

amotl · 2021-08-10T21:31:58Z

Hi @martindurant,

I will investigate this further and will let you know about the outcome. In the meanwhile, we fixed it by downgrading to fsspec==2021.5.0.

Regarding the issue we observed with fsspec>=2021.6.0:

[...] first investigate whether the error is on us and fsspec just got improved recently by properly raising exceptions now where it did not do so before.

That's it. It was just that I missed to appropriately handle FileNotFoundError in our code so the test code choked on that. It got fixed with 6e2adc6. So sorry for the noise!

#492 bundles all improvements to the baseline implementation #431, thanks to @gutzbenj.

With kind regards,
Andreas.

amotl · 2021-08-12T17:51:12Z

Hi again,

in this context, I would also like to reference fsspec/filesystem_spec#639.

With kind regards,
Andreas.

amotl · 2022-01-28T19:07:50Z

Dear Benjamin,

I would like to salute you for your efforts on this, and, at the same time, I wish the upstream improvements on behalf of fsspec/filesystem_spec#895 much fortune!

With kind regards,
Andreas.

This was referenced Nov 22, 2020

Improve environment reporting #246

Closed

Make caching smarter by honoring data publication schedule #250

Closed

Improve speed of the test harness by using "pytest-xdist" #254

Closed

amotl mentioned this issue Nov 26, 2020

Improve speed of the test harness #253

Closed

This was referenced Dec 14, 2020

Improve speed of the test harness by using "pytest-xdist" #278

Closed

Serving data directly to Grafana #281

Closed

mosmix_forecasts.py doesn't work #289

Closed

amotl mentioned this issue Dec 23, 2020

Remove DWIM date filter for cli stations #293

Merged

amotl mentioned this issue Feb 4, 2021

API Question: Acquisition interval and weather alerts #342

Open

amotl mentioned this issue Feb 19, 2021

Various issues after upgrading to 0.14.0, probably related to end_date #356

Closed

This was referenced Mar 3, 2021

Update wetterdienst, add Docker and loading gif earthobservations/wetterdienst-ui#2

Merged

Enable listings cache for HTTP filesystem fsspec/filesystem_spec#560

Closed

Improve speed of the test harness by using "pytest-xdist" #370

Closed

gutzbenj mentioned this issue Apr 8, 2021

Initialization of API on MOSMIX slow? #412

Closed

gutzbenj added bug Something isn't working help wanted Extra attention is needed labels Apr 13, 2021

amotl mentioned this issue Apr 25, 2021

Start migrating from dogpile.cache to FSSPEC #431

Merged

amotl mentioned this issue Aug 10, 2021

_gdbm.error: [Errno 11] Resource temporarily unavailable: '/home/runner/.cache/wetterdienst/dogpile/metaindex.dbm' #491

Closed

amotl mentioned this issue Aug 10, 2021

Make "fsspec>=2021.6.0" work #492

Merged

gutzbenj mentioned this issue Jan 27, 2022

Use fsspec for all types of file access #567

Merged

gutzbenj closed this as completed in #567 Jan 28, 2022

amotl mentioned this issue Jan 28, 2022

Add optional file-based listings caching fsspec/filesystem_spec#895

Open

amotl mentioned this issue Apr 20, 2024

DiskCache: Disk Backed Cache requests-cache/requests-cache#988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rid of dbmfile-based dogpile cache #243

Get rid of dbmfile-based dogpile cache #243

amotl commented Nov 22, 2020 •

edited

Loading

amotl commented Nov 25, 2020

gutzbenj commented Nov 25, 2020

kmuehlbauer commented Dec 10, 2020 •

edited by amotl

Loading

amotl commented Dec 10, 2020

amotl commented Dec 10, 2020 •

edited

Loading

kmuehlbauer commented Dec 10, 2020

amotl commented Dec 10, 2020

meteoDaniel commented Apr 21, 2021

amotl commented Apr 21, 2021 •

edited

Loading

amotl commented May 13, 2021 •

edited

Loading

martindurant commented May 13, 2021

amotl commented Aug 8, 2021 •

edited

Loading

amotl commented Aug 10, 2021

gutzbenj commented Aug 10, 2021 •

edited by amotl

Loading

amotl commented Aug 10, 2021 •

edited

Loading

amotl commented Aug 12, 2021 •

edited

Loading

amotl commented Jan 28, 2022

Get rid of dbmfile-based dogpile cache #243

Get rid of dbmfile-based dogpile cache #243

Comments

amotl commented Nov 22, 2020 • edited Loading

amotl commented Nov 25, 2020

gutzbenj commented Nov 25, 2020

kmuehlbauer commented Dec 10, 2020 • edited by amotl Loading

amotl commented Dec 10, 2020

amotl commented Dec 10, 2020 • edited Loading

kmuehlbauer commented Dec 10, 2020

amotl commented Dec 10, 2020

meteoDaniel commented Apr 21, 2021

amotl commented Apr 21, 2021 • edited Loading

amotl commented May 13, 2021 • edited Loading

martindurant commented May 13, 2021

amotl commented Aug 8, 2021 • edited Loading

Problem

Reason

Solution

Further investigation

amotl commented Aug 10, 2021

gutzbenj commented Aug 10, 2021 • edited by amotl Loading

amotl commented Aug 10, 2021 • edited Loading

amotl commented Aug 12, 2021 • edited Loading

amotl commented Jan 28, 2022

amotl commented Nov 22, 2020 •

edited

Loading

kmuehlbauer commented Dec 10, 2020 •

edited by amotl

Loading

amotl commented Dec 10, 2020 •

edited

Loading

amotl commented Apr 21, 2021 •

edited

Loading

amotl commented May 13, 2021 •

edited

Loading

amotl commented Aug 8, 2021 •

edited

Loading

gutzbenj commented Aug 10, 2021 •

edited by amotl

Loading

amotl commented Aug 10, 2021 •

edited

Loading

amotl commented Aug 12, 2021 •

edited

Loading