-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of dbmfile-based dogpile cache #243
Comments
At #254 (comment), we found the dbmfile-based cache also has serious concurrency issues. When looking at future usage of the REST API, this is yet another reason to finally improve the situation here. |
Dear @amotl , as you have already mentioned here [1], we should change to another backend as apart from the latest problems we faced the dbmbackend is not supported for Windows. Available backends of dogpile.cache are described at [2]. The dogpile.cache redis backend offers options for cases of concurrency and may be the suitable replacement.
For future developments we may as well want to follow the project at [3] that has shown improvements in memory consumption with other serialization methods. [1] wetterdienst/wetterdienst/util/cache.py Line 12 in e03a62e
[2] https://dogpilecache.sqlalchemy.org/en/latest/api.html#module-dogpile.cache.backends.memory [3] https://github.com/jvanasco/dogpile_backend_redis_advanced |
Hi @gutzbenj and @amotl, concerning retrieving/caching of remote assets, did you already consider using |
Dear Kai, thanks for bringing this to our attention. The option to cache files from a target filesystem locally [1] would be appropriate, right? In this regard, we would have to check if the target filesystem can be just something accessed over HTTP through the regular Python The feature documentation [3] says:
as well as [4]:
So, we will have to check how to wrap That would indeed be very elegant in order to shift the caching away from the read-through caching currently employed by Please add further adjustments if you believe I am thinking into the wrong direction here. With kind regards, [1] https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally |
Indeed, there is a generic Unfortunately, it is not listed at [5], so I missed it first hand. [5] https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations |
I'm stumbling over import fsspec
import wradlib as wrl
fname = "https://opendata.dwd.de/weather/radar/sites/sweep_vol_z/ess/hdf5/"
files = list(fsspec.get_mapper(fname))
with fsspec.open(fname+files[-1]) as f:
ds = wrl.io.open_odim(f, loader="h5netcdf")
display(ds[0].data) Alone this is really nice for just grabbing some data from remote https. I also tried to comprehend the full possibilities (caching, fusing into local filesystem etc) but did not come as far as i wished. But it seems like it is adopted widely around the big data science packages (pandas, xarray, pangeo). |
Dear Kai, I quickly put something up at [1]. This is excellent. Thanks again and with kind regards, [1] https://gist.github.com/amotl/0a2eb63708b8a0cf4dc457b1e6a87455 |
In your code you have added a Todo and mentioned redis. Here are some reasons why you should not use redis per default:
|
Hi Daniel, thanks for your feedback, we appreciate to learn from people in which environments they are running Wetterdienst. Sure, Redis will only be optional but will save us from any hassle when running Wetterdienst within a multithreaded environment because accessing a filesystem-based cache on different platforms in a thread-safe manner is not always easy. Wetterdienst will always be able to be run both in interactive or batch mode as well as in daemon mode and we will try to both keep the balance and to optimize runtime behavior in all of those scenarios. Introducing a strict runtime dependency like having to run a Redis service in a scenario you outlined above would be a bad idea. In addition to that, I want to elaborate a bit more about the direction we are heading towards with respect to getting rid of the dbmfile-based cache implemented by dogpile. As suggested by @kmuehlbauer, we started looking into With kind regards, |
Hi again, with #431, we are replacing @gutzbenj already drew my attention to it the other day. It might even be suitable to accompany the With kind regards, [1] http://www.grantjenks.com/docs/diskcache/ |
Certainly, fsspec would be happy to see alternative, more fully-featured caching schemes. Integration should generally be pretty simple. |
Hi there, @gutzbenj bumped me to start working on #431 again. While doing so, I discovered an issue when upgrading to ProblemWhen using
ReasonThe reason is that the new SolutionDowngrading to Further investigation@martindurant: I will investigate this further and will let you know about the outcome. In the meanwhile, we fixed it by downgrading to With kind regards, |
At #431 (comment), @martindurant asked if there is already a linked fsspec issue to the regression we observed. Thank you! I believe @gutzbenj spent some minutes trying to come up with a minimal repro, but I haven't heard back from him yet. @martindurant: I didn't want to prematurely open an issue because I wanted to first investigate whether the error is on us and fsspec just got improved by properly raising exceptions now where it did not do so before. If this is the case, the mitigation would rather be making Wetterdienst adjust to the new behavior, so there would be no reason to bother you in any way on the fsspec issue tracker. |
Dear @amotl , instead of using from fsspec.implementations.http import HTTPFileSystem
import pandas as pd
URL = "https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl"
fs = HTTPFileSystem()
files = fs.find(URL)
# Apply filters via pandas
df = pd.DataFrame({"file": files})
df = df[df.file.str.endswith(".zip")] However, when evaluating this example again, I was able to observe a regression with With kind regards, |
Hi @martindurant,
Regarding the issue we observed with
That's it. It was just that I missed to appropriately handle #492 bundles all improvements to the baseline implementation #431, thanks to @gutzbenj. With kind regards, |
Hi again, in this context, I would also like to reference fsspec/filesystem_spec#639. With kind regards, |
Dear Benjamin, I would like to salute you for your efforts on this, and, at the same time, I wish the upstream improvements on behalf of fsspec/filesystem_spec#895 much fortune! With kind regards, |
Describe the bug
The dogpile cache based on dbmfile is brittle.
To reproduce
Run Wetterdienst from different Python environments and see accessing the shared cache break more often than not, see #217, #232, #233, #242 and #244. Also, #236 seems to be related as well.
Expected behavior
Wetterdienst should work in all circumstances, even when switching between different Python environments.
Additional context
While the documentation about
pickle
[1] promises thatit apparently still has problems in our context. While I currently don't have a clue why, I figure it might be coming from marshalling/unmarshalling data frames from different versions of Pandas. We can either investigate this further or use a different means of data storage and/or serialization protocol for the dogpile cache.
[1] https://docs.python.org/3/library/pickle.html
The text was updated successfully, but these errors were encountered: