-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptation to sklearn #143
Comments
Yes, I think that would be a good idea. I'm all for better integration with scikit-learn in general #31 When improving the API for 2.0, making one that's compatible with scikit-learn if possible would be useful https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator |
@rth Could you draft a general workflow for the integration with sklearn? I have to learn a bit more about sklearn to get to know the dos and don'ts.. |
Essentially one could start with minimal functionality along the lines of, from sklearn.base import BaseEstimator
class OrdinaryKrigging(BaseEstimator):
def __init__(self, ...):
...
def fit(self, X, y):
...
def predict(self, X, return_std=False):
... where Looking at the usage of GaussianProcessRegressor in scikit-learn could be a start. Then one would progressively add features, and form time to time run We already have a scikit-learn wrapper class BTW Line 27 in 40a8140
|
So we should have a look at: |
Worth looking into but there are also things outdated there. |
OK. In the end I am thinking of a single class for kriging where the dimension is just a parameter. And since Ordinary kriging is just universal kriging with a constant drift (or simply an unbiased estimator), we could combine all 4 kriging classes that are present at the moment. You once mentioned that anisotropy and rotation could be provided by pipelines (#138). |
Yeah, the implementation could be a single class for kriging. The dimension could just be determined from the
The idea is that if some of those calculations, are stand alone transformations of coordinates (i.e. X (n_samples, n_features), where features = ['x', 'y']) converted to X_transformed with the same shape, then it could be a stand-alone pre-processing class. And pipeline could be built using from sklearn.pipeline.make_pipeline
pipe = make_pipeline(Preprocessor(), OrdinalKriging())
pipe.fit(coords, target)
z = pipe.predict(coords_grid) where internally for This logic is built-in in scikit-learn pipelines, and it's possible to chain muliple processing steps like this. And also combine estimators from different packages, assuming they follow the scikit-learn API. I haven't done a detailed analysis to see to what extent it's possible in pykrige. One limitation of scikit-learn pipelines is that the target variable (previously known as z) cannot be transformed in pipeline steps. But there are workarounds if it's really a blocker. |
One question I have about the pre-processor is: What should be fitted? So we would only use |
If fit does nothing, it's fine. But it's still needs to be there for API compatibility, from sklearn.base import BaseEstimator, TransformerMixin
class Preprocessor(BaseEstimator, TransformerMixin): # or give it a better name
def __init__(self, parameter="some_default"):
self.parameter = parameter
def fit(self, X, y):
"""This estimator is stateless, fit does nothing."""
return self
def transform(self, X):
# actual transformation here
return X_transformed The base classes are necessary to,
def fit_transform(X, y):
return self.fit(X, y).transform(X) which in this case is equivalent to just a transform. |
I was digging through the sklearn code and it seems, that you can't use the kdtree directly in cython (link). Since it would be good to prevent the python-overhead in creating the distance matrix, I would suggest forking the critical parts of sklearn.neighbors (link):
Only thing that needs to be modified, is commenting out advantages:
ATM the kriging matrix is created in python and passed to cython (when doing the full kriging). With n nearest neighbors, a list of points is passed to cython to cut out the reduced kriging matrix (again, the full matrix is present) The idea would be, to create a companion class to the future kriging class in cython, that holds all necessary information and does the number-crushing. All we need is the variogram in cython, which is already present. We could also cut down the functionality of the kdtree to only provid:
So we can keep the external code base at a minimum. |
Do you think there is no way of doing this without using KDTree/Balltree from cython? E.g. computing the calculation in small batches or using sparse distance matrices. See e.g.
That code is legacy and quite convoluted (cf e.g. scikit-learn/scikit-learn#4217) I would not recommend forking (and maintaining) it unless there is no way around it.
There is a SLEP 10 to replace |
If we can't use the KDTree in cython, the loop over the gird-points needs to be done in python, or the full-distance matrix needs to be constructed and passed to cython to do the loop there. In the first case, we get a massive overhead in time, in the second case we get a massive overhead in memory consumption. It seems, that the sklearn team doesn't want to provide kdtree for cython, since they think, it is to much of an maintainance overhead. I'll think about it. |
Not if we use a sparse distance matrix (that only stores K nearest neigbours for each point).
You could try opening an issue about it there, but I do anticipate mostly negative feedback indeed due to maintenance costs. |
If we use only K nearest neighbors, we still can't provide a search radius for the moving window. K-nearest neigbors always depend on the densitiy of the input data. |
I think it should also be possible to create a sparse distance matrix for neighbors within a given radius: i.e. a |
sklearn provides a lot of functionality, we could use to simplify our code.
Distance matrix calculation:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
KDtree:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html
Search radius to optimize moving window kriging (See: #57):
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius
Nearest neighbors:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query
Beside that, one can specify the metric in use (see: #120):
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html
Or for geo-coordinates (see: #121):
(np.deg2rad needed here)
So this could solve a lot of issues.
The text was updated successfully, but these errors were encountered: