Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the status / plan for vectorized APIs? #341

Open
wingkitlee0 opened this issue Jan 14, 2024 · 12 comments
Open

What is the status / plan for vectorized APIs? #341

wingkitlee0 opened this issue Jan 14, 2024 · 12 comments

Comments

@wingkitlee0
Copy link

Currently, what is the recommend way to do this (with this package)?
For example, I have two numpy arrays of lat/lon and want to convert them into an array of h3 hex

@wingkitlee0
Copy link
Author

For reference:
master...wingkitlee0:h3-py:basic-vectorized

@ajfriend
Copy link
Contributor

We tried previously having an "unstable" API to experiment with things like vector functions: #147

I'd be interested in bringing that back in the 4.0 branch, probably still as an unstable/experimental module to indicate the API may change in the future as we figure out what we want it to look like.

@wingkitlee0
Copy link
Author

@ajfriend Thanks for the info.

@wingkitlee0
Copy link
Author

wingkitlee0 commented May 19, 2024

@ajfriend I took a second look on this issue...

A few questions:

  1. It seems that h3-py was designed to be "self-contained" without numpy. But I was thinking to have a set of APIs that accept numpy arrays. This may require numpy in cython which becomes a dependency (not 100% if we need numpy in cython). I am curious about this restriction.
  2. There are a few backends: basic_str, basic_int, etc that shares the same API. Only a subset of them will be relevant for a "vectorized" version. I am thinking to put those vectorized functions in an experimental module (e.g., api.experimental.vectorized) which does not use the symlink-ed __init__.py. Does it make sense?
  3. Do you have a sense of how much faster a vectorized version would be? For now, if I move the for-loop from python to C I get around 30% faster in integer representation.

Thanks

@ajfriend
Copy link
Contributor

ajfriend commented May 19, 2024

  1. It seems that h3-py was designed to be "self-contained" without numpy. But I was thinking to have a set of APIs that accept numpy arrays. This may require numpy in cython which becomes a dependency (not 100% if we need numpy in cython). I am curious about this restriction.

I haven't been deep in the Cython code in a while, but if my memory serves, numpy shouldn't be a hard dependency because it is compatible with the https://docs.python.org/dev/library/stdtypes.html#memoryview interface, which is how we interface with Cython. See some of the discussion around memoryview objects in https://uber.github.io/h3-py/api_comparison.html . To that point, h3.api.memview_int is actually the base API that the other three are built on top of.

A memoryview object is easily converted into a numpy array (and can be done without copying the data; see the docs link above). Any function expecting a memoryview object can accept a numpy array without having to "know about" (or import) numpy arrays.

Because numpy is compatible with the memoryview interface, numpy isn't a dependency in terms of the library being able to accept numpy arrays as inputs. The only reason we include numpy as an optional dependency is that you need it if you want to return numpy arrays as outputs (without expecting the user to convert a memoryview).

(I'll try to get to your other points soon.)

@ajfriend
Copy link
Contributor

  • There are a few backends: basic_str, basic_int, etc that shares the same API. Only a subset of them will be relevant for a "vectorized" version. I am thinking to put those vectorized functions in an experimental module (e.g., api.experimental.vectorized) which does not use the symlink-ed __init__.py. Does it make sense?

That makes sense. Although, recently, I've been thinking there might be another potential way to organize this rather than putting the code in api.experimental.vectorized. h3-py exposes its Cython interface (although I'm not sure we officially support it yet) and other Cython packages should be able to simply add it as a Cython dependency. We should probably come up with a short example repo showing people how to do this (#373).

This would let others write fast Cython code involving the h3-py C and Cython functions, without having to have those live inside the main repo. This vectorized API could be one such example. We don't necessarily have to do it this way, but I wanted to raise the idea and see what other folks thought.

@ajfriend
Copy link
Contributor

  1. Do you have a sense of how much faster a vectorized version would be? For now, if I move the for-loop from python to C I get around 30% faster in integer representation.

I don't off the top of my head. I think it would depend on the application. That being said, I think developing some benchmarks would be helpful. You might be able to test things fairly quickly with something like https://github.com/uber/h3-py/blob/master/tests/test_cython/cython_example.pyx

@wingkitlee0
Copy link
Author

another potential way to organize this
other Cython packages should be able to simply add it as a Cython dependency.

That's an interesting idea. I think exposing cython functions is good.

However, I think Cython is hard for most python users. Most people probably prefer pip install h3 and get all the good stuff. In other word, I would like to see more people to use H3 for their geospatial projects and it would be easier if we have only one h3 python library. Maybe a limited subset of vectorized functions is good enough to start. More advanced users can build their library with the Cython functions..

Should we continue the discussion back in Slack (for more visibility?)

BTW, thanks for answering my questions. After reading a little bit more memoryview, I agree that we do not need numpy in Cython for this.

@ajfriend
Copy link
Contributor

ajfriend commented May 20, 2024

For what I'm thinking, the average user would not need to know anything about Cython. If we produced a separate package (called whatever, maybe h3vectorized for example), users would be able to just pip install that. But development would be much easier in the h3vectorized repo, since all the hard CMake and scikit-build stuff is handled separately by the h3-py repo (in the same way that a package that cimports numpy doesn't need to build all of numpy from scratch each time).

Since we're not sure exactly what the vectorized API looks like just yet, I think that could provide a lot of benefits: You're less strongly coupled to the h3-py repo and have more freedom to experiment with API design. In a lot of ways, that's a more natural way to explore an API than having an experimental submodule. You would also not be subject to the fairly rigid versioning requirements we have for h3 bindings to match the core library version in major and minor semver numbers.

If we later finalize an API, we could migrate it into h3-py directly or keep it as a separate package.

Again, I'm not saying this is the only way to do it, but I do think it is an attractive option. But, all in all, once we get that first example package working, I'd expect this to be the much easier path.

@ajfriend
Copy link
Contributor

Also, feel free to flag the discussion in Slack, but I'd like to keep the main discussion on Github since Slack history gets lost relatively quickly.

@ajfriend
Copy link
Contributor

I threw together an example that still needs some work, but this would be the idea: https://github.com/ajfriend/h3_example_package

@wingkitlee0
Copy link
Author

Thanks for the clarifications! I think it makes all sense now. I didn't realize h3-py is more than a Cython wrapper but also build the H3 C library.

migrate it into h3-py directly or keep it as a separate package.

sounds good

Thanks also for the example.. I will try to build something on it..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants