-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create abstraction for caching #458
Create abstraction for caching #458
Conversation
A high-level question: what is the use-case for the in-memory cache if it doesn't use inter-process shared memory or, easier, a multiprocessing manager? It kicks in when submitting batches of samples to a worker, correct? But what about single samples? |
@mdbenito Yes, it would be used in a single step of computations in a worker e.g. a single permuation. |
Sorry, I think I wasn't clear enough. A process will only benefit of the InMemoryCache (irrespective of the sampling method) if it computes more than one marginal utility and there is a hit. For permutation sampling this can be achieved by batching two or more computations. This is not a default behaviour (remember Markus implemented it, and we left it there as a temporary hack) and the user needs to explicitly batch the samples. Otherwise, different futures will be executed in different processes and only the use of System V type shared memory or a managed dict solves the issue. My question is: why not go for the shared mem? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments. But my main concern is about an in-memory cache that won't be exploited at all with the default config since worker processes only compute one marginal. Except for semivalues that can batch, but this is not default behaviour and it isn't documented that one should do use it.
@mdbenito Thanks for the review! I addressed the code related comments. As for the I thought about using what you suggested above, inter-process shared memory or, easier, a multiprocessing manager, but I wasn't sure about investing more time in that before making sure that caching is actually useful (#459). If you still think that the current implementation of |
As discussed in our meeting, this is currently not working u1 = Utility(model, data, scorer)
u2 = Utility(model2, data, scorer)
v1 = compute_values(u1) # misses cache
v2 = compute_values(u2) # misses cache
v3 = compute_values(u1) # hits cache |
Description
This PR closes #189 closes #124
I tried at first to rely on joblib's Memory class by just implementing a new backend for Memcached but that proved to be too cumbersome because it relies on I/O operations (opening files, writing to files, etc.) so I just took inspiration from their versions while keeping some details from our previous implementation (e.g. CacheStats, repeated evaluations).
I created a separate issue (#459) for the creation of a notebook showcasing the use of caching and its benefits.
Changes
CacheBackend
base class for caching backend implementations.InMemoryCacheBackend
,DiskCacheBackend
,MemcachedCacheBackend
.CachedFunc
class to wrap cached functions and methods.time_threshold
from 0.3 to 0.0memcached
extra and thus makepymemcache
an optional dependency.MemcachedConfig
toCachedFuncConfig
, remove memcached client config from it.Checklist
If notebooks were added/changed, added boilerplate cells are tagged with"tags": ["hide"]
or"tags": ["hide-input"]