-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a generic data type converter to the Cursor
object
#442
Conversation
👍 I really like this approach and would definitely prefer it over #437. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments regarding some of the implementation details.
I'm also not sure about the level of separation. E.g. the converter application could probably be separated from the cursor without much loss of API/UI convenience. But no strong opinion here.
On the other hand, I think something common like setting the timezone for datetime objects could be made even easier for users.
src/crate/client/converter.py
Outdated
def get(self, type_: int) -> Callable[[Optional[int]], Optional[Any]]: | ||
return self.mappings.get(type_, self._default) | ||
|
||
def set(self, type_: int, converter: Callable[[Optional[int]], Optional[Any]]) -> None: | ||
self.mappings[type_] = converter | ||
|
||
def remove(self, type_: int) -> None: | ||
self.mappings.pop(type_, None) | ||
|
||
def update(self, mappings: Dict[int, Callable[[Optional[int]], Optional[Any]]]) -> None: | ||
self.mappings.update(mappings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to make this mutable?
I also wonder if we should introduce an Enum
or similar to give names to the type ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to make this mutable?
I also wasn't sure, but liked the flexibility in the end to make the user able to add and remove converters as they like. See also my other comment below about the blueprint I took from PyAthena.
I also wonder if we should introduce an
Enum
or similar to give names to the type ids?
Good idea. I've introduced an Enum
with 00dbb42.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wasn't sure, but liked the flexibility in the end to make the user able to add and remove converters as they like.
I'd keep it immutable for now. It reduces the API surface and makes it easier to reason about the mappings.
And if a legitimate use-case pops up, it's easy to change later. On the other hand, changing a mutable mapping to immutable later on would be a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you fine with removing mutability by 299b771?
>>> connection.client.set_next_response({ | ||
... "col_types": [4, 5, 11], | ||
... "rows":[ [ "foo", "10.10.10.1", 1658167836758 ] ], | ||
... "cols":[ "name", "address", "timestamp" ], | ||
... "rowcount":1, | ||
... "duration":123 | ||
... }) | ||
|
||
>>> cursor.execute('') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would confuse users because this is a mocked cursor and client.set_next_response
isn't something that they can use.
Either we need to point this out explicitly, hide it or use real data and don't mock it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the beginning of the file cursor.txt
, there is a corresponding note about this detail.
Hardcode the next response of the mocked connection client, so we won't need a sql statement
to execute.
After that, the set_next_response()
method is already used five times within cursor.txt
, so I didn't hesitate to continue implementing the test cases this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't/shouldn't hide it for the rendered docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, cursor.txt
is not part of the rendered docs at https://crate.io/docs/python/, it is only used as a doctest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created #448 to have a separate discussion about the topic how to improve/adjust doctests vs. rendered documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't write doctests to ensure the client functions. Doctests are to ensure examples in the documention work. The primary concern is the documentation, not the test coverage it gives. If we use this for testing, we should convert it to regular python tests.
It's much easier to run isolated cases for debugging etc. when using the regular test framework instead of having one large doctest where you can only run the full thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree with you. I think your guidance should be implemented seperately, being tracked by #448. If you want to see it sooner than later, than I will be happy to tackle it before coming back to this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to handle this separately in a follow up
There is btw. also https://peps.python.org/pep-0249/#type-objects in the DB API definition - I didn't have a closer look at it, but we should double check to see if it affects what the PR would be doing. |
Thank you for endorsing this patch and for your valuable suggestions. |
1ea7037
to
1410547
Compare
Hi again, other than the inline comments, let me address further questions here.
Done, mostly with 1410547.
Done with b21c38e.
I also thought back and forth about this detail. In the end, I followed the path outlined by https://github.com/laughingman7743/PyAthena very closely, where I originally discovered this way of doing it. See PyAthena/converter.py.
Indeed, that would be sweet. I will think about how we could add this feature.
I also found that section in the DBAPI spec, but I think it is exclusively about input parameter conversion:
On the other hand, the current implementation is exclusively about output data conversion. That being said, the converter functions itself can well be reused for that use case, but otherwise I think it is a different area, and to be addressed within a different patch. With kind regards, |
def _to_datetime(value: Optional[float]) -> Optional[datetime]: | ||
""" | ||
https://docs.python.org/3/library/datetime.html | ||
""" | ||
if value is None: | ||
return None | ||
return datetime.utcfromtimestamp(value / 1e3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @mfussenegger, @matriv, and @seut,
regarding @mfussenegger's proposal to let the user optionally specify a time zone, and make the machinery return a timezone-aware datetime object, I see two options here.
Let's assume the user defines a custom timezone object like
>>> import datetime
>>> tz_mst = datetime.timezone(datetime.timedelta(hours=7), name="MST")
and a timestamp in Epoch like
# Fri, 16 Sep 2022 11:09:16 GMT
timestamp = 1663326556
On this spot, which variant would be semantically the right one? Note that datetime.utcfromtimestamp
does not accept the tz
keyword argument, it will always return a naive datetime object with tzinfo=None
.
-
We can either use
datetime.fromtimestamp
, which does accept thetz
keyword argument.>>> datetime.datetime.fromtimestamp(timestamp, tz=tz_mst) datetime.datetime(2022, 9, 16, 18, 9, 16, tzinfo=datetime.timezone(datetime.timedelta(seconds=25200), 'MST')) >>> datetime.datetime.fromtimestamp(timestamp, tz=datetime.timezone.utc) datetime.datetime(2022, 9, 16, 11, 9, 16, tzinfo=datetime.timezone.utc)
-
Or we can replace the
tzinfo
on the naive datetime object returned byutcfromtimestamp
, making it timezone-aware.>>> datetime.datetime.utcfromtimestamp(timestamp).replace(tzinfo=tz_mst) datetime.datetime(2022, 9, 16, 11, 9, 16, tzinfo=datetime.timezone(datetime.timedelta(seconds=25200), 'MST'))
I think option 1. is the right choice. So the implementation would look like something along the lines of:
def set_timezone(self, tz):
"""
When the user specifies a time zone upfront, all returned datetime objects
will be timezone-aware.
"""
if tz is not None and not isinstance(tz, datetime.timezone):
raise TypeError(f"Timezone object has wrong type: {tz}")
self.timezone = tz
def _to_datetime(self, value):
"""
Convert CrateDB's `TIMESTAMP` column to the native Python `datetime` object.
When a timezone is given, return a timezone-aware object.
Otherwise, return a naive object which is meant to be the time in the UTC time zone.
"""
if self.timezone is not None:
return datetime.fromtimestamp(value / 1e3, tz=self.timezone)
else:
return datetime.utcfromtimestamp(value / 1e3)
What do you think?
With kind regards,
Andreas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another implementation could look like this, it is always using datetime.fromtimestamp(value / 1e3, tz=self.timezone)
, where self.timezone
would be datetime.timezone.utc
by default, i.e. it will always return timezone-aware datetime objects.
While definitively easier and more straight-forward, it could be less performant when not aiming to use timezone-aware datetime objects and stay in "naive" land on purpose.
def __init__(self):
# Return datetype objects in UTC by default.
self.timezone = datetime.timezone.utc
def set_timezone(self, tz):
"""
Set the time zone you want your `datetime` objects to be returned as.
"""
if not isinstance(tz, datetime.timezone):
raise TypeError(f"Timezone object has wrong type: {tz}")
self.timezone = tz
def _to_datetime(self, value):
"""
Convert CrateDB's `TIMESTAMP` column to a timezone-aware native Python `datetime` object.
"""
return datetime.fromtimestamp(value / 1e3, tz=self.timezone)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitively easier and more straight-forward
When thinking about it once more, I think I would prefer the implementation variant outlined in my latest post. In this way, it is always consistent that timezone-aware datetime objects are returned from the data converter machinery.
it could be less performant when not aiming to use timezone-aware datetime objects and stay in "naive" land on purpose.
I think when users are aiming for raw speed, they would completely opt out of the data type converter machinery at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again. Talk is cheap. I've converged a few more thoughts about this into a proposed patch at #445, completely separating the topic of timezone-aware datetime
handling from this one.
src/crate/client/converter.py
Outdated
def get(self, type_: int) -> Callable[[Optional[int]], Optional[Any]]: | ||
return self.mappings.get(type_, self._default) | ||
|
||
def set(self, type_: int, converter: Callable[[Optional[int]], Optional[Any]]) -> None: | ||
self.mappings[type_] = converter | ||
|
||
def remove(self, type_: int) -> None: | ||
self.mappings.pop(type_, None) | ||
|
||
def update(self, mappings: Dict[int, Callable[[Optional[int]], Optional[Any]]]) -> None: | ||
self.mappings.update(mappings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wasn't sure, but liked the flexibility in the end to make the user able to add and remove converters as they like.
I'd keep it immutable for now. It reduces the API surface and makes it easier to reason about the mappings.
And if a legitimate use-case pops up, it's easy to change later. On the other hand, changing a mutable mapping to immutable later on would be a breaking change.
>>> connection.client.set_next_response({ | ||
... "col_types": [4, 5, 11], | ||
... "rows":[ [ "foo", "10.10.10.1", 1658167836758 ] ], | ||
... "cols":[ "name", "address", "timestamp" ], | ||
... "rowcount":1, | ||
... "duration":123 | ||
... }) | ||
|
||
>>> cursor.execute('') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't/shouldn't hide it for the rendered docs?
0c8a1ce
to
3049ae6
Compare
# FIXME: Why does this croak on statements like ``DROP TABLE cities``? | ||
# Note: When needing to debug the test environment, you may want to | ||
# enable this logger statement. | ||
# log.exception("Executing SQL statement failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About
Tests: Add inline comment about how to debug the test environment.
Background story
It was needed because the tests suddenly started failing mysteriously on my machine.
Out of the blue, the test suite started croaking like TABLE 'locations' already exists
, only to find that CrateDB would not accept any write operations after finding that my disk usage was 96%, while I had still 60G of free disk space.
Unfortunately, I did not discover this easily, because the test suite swallowed the exception about FORBIDDEN/12/index read-only
on this spot. When enabling the log.exception
statement, the error response from CrateDB became immediately visible within the full stacktrace.
Thoughts
I like having such inline comments in place. Please let me know if you have any better advice.
src/crate/client/converter.py
Outdated
# Map data type identifier to converter function. | ||
_DEFAULT_CONVERTERS: Dict[DataType, ConverterFunction] = { | ||
DataType.IP: _to_ipaddress, | ||
DataType.TIMESTAMP_WITH_TZ: _to_datetime, | ||
DataType.TIMESTAMP_WITHOUT_TZ: _to_datetime, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the DefaultTypeConverter
should apply the _to_datetime
function to both kinds of TIMESTAMP
types?
src/crate/client/doctests/cursor.txt
Outdated
>>> from crate.client.converter import Converter, DataType | ||
|
||
>>> converter = Converter() | ||
>>> converter.set(DataType.BIT, lambda value: int(value[2:-1], 2)) | ||
>>> cursor = connection.cursor(converter=converter) | ||
|
||
Proof that the converter works correctly, ``B\'0110\'`` should be converted to | ||
``6``. CrateDB's ``BIT`` data type has the numeric identifier ``25``:: | ||
|
||
>>> connection.client.set_next_response({ | ||
... "col_types": [25], | ||
... "rows":[ [ "B'0110'" ] ], | ||
... "cols":[ "value" ], | ||
... "rowcount":1, | ||
... "duration":123 | ||
... }) | ||
|
||
>>> cursor.execute('') | ||
|
||
>>> cursor.fetchone() | ||
[6] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the type converter for the BIT
data type should be included into the DefaultTypeConverter
, like the converters for IP
and TIMESTAMP
? I am talking about this one:
>>> converter.set(DataType.BIT, lambda value: int(value[2:-1], 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd leave it out for now. Easy enough to add later if requested by users
>>> connection.client.set_next_response({ | ||
... "col_types": [4, 5, 11], | ||
... "rows":[ [ "foo", "10.10.10.1", 1658167836758 ] ], | ||
... "cols":[ "name", "address", "timestamp" ], | ||
... "rowcount":1, | ||
... "duration":123 | ||
... }) | ||
|
||
>>> cursor.execute('') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't write doctests to ensure the client functions. Doctests are to ensure examples in the documention work. The primary concern is the documentation, not the test coverage it gives. If we use this for testing, we should convert it to regular python tests.
It's much easier to run isolated cases for debugging etc. when using the regular test framework instead of having one large doctest where you can only run the full thing.
src/crate/client/converter.py
Outdated
@property | ||
def mappings(self) -> Dict[DataType, ConverterFunction]: | ||
return self._mappings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@property | |
def mappings(self) -> Dict[DataType, ConverterFunction]: | |
return self._mappings |
src/crate/client/converter.py
Outdated
def set(self, type_: DataType, converter: ConverterFunction) -> None: | ||
self.mappings[type_] = converter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def set(self, type_: DataType, converter: ConverterFunction) -> None: | |
self.mappings[type_] = converter | |
def set(self, type_: DataType, converter: ConverterFunction) -> None: | |
self.mappings[type_] = converter |
As already mentiond, I don't think we should make this mutable if it isn't required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just added 14a355d, to follow your suggestion.
src/crate/client/converter.py
Outdated
InputVal = Any | ||
ConverterFunction = Callable[[Optional[InputVal]], Optional[Any]] | ||
ConverterDefinition = Union[ConverterFunction, Tuple[ConverterFunction, "ConverterDefinition"]] | ||
ColTypesDefinition = Union[int, List[Union[int, "ColTypesDefinition"]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputVal = Any | |
ConverterFunction = Callable[[Optional[InputVal]], Optional[Any]] | |
ConverterDefinition = Union[ConverterFunction, Tuple[ConverterFunction, "ConverterDefinition"]] | |
ColTypesDefinition = Union[int, List[Union[int, "ColTypesDefinition"]]] | |
ConverterFunction = Callable[[Optional[Any]], Optional[Any]] | |
ConverterDefinition = Union[ConverterFunction, Tuple[ConverterFunction, "ConverterDefinition"]] | |
ColTypesDefinition = Union[int, List[Union[int, "ColTypesDefinition"]]] |
No need for custom vocabulary for Any
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 4e88a10.
src/crate/client/converter.py
Outdated
def col_type_to_converter(self, type_: ColTypesDefinition) -> ConverterDefinition: | ||
""" | ||
Resolve integer data type identifier to its corresponding converter function. | ||
Also handles nested definitions with a *list* of data type identifiers on the | ||
right hand side, describing the inner type of `ARRAY` values. | ||
|
||
It is important to resolve the converter functions first, in order not to | ||
hog the row loop with redundant lookups to the `mappings` dictionary. | ||
""" | ||
result: ConverterDefinition | ||
if isinstance(type_, list): | ||
type_, inner_type = type_ | ||
if DataType(type_) is not DataType.ARRAY: | ||
raise ValueError(f"Data type {type_} is not implemented as collection type") | ||
result = (self.get(DataType(type_)), self.col_type_to_converter(inner_type)) | ||
else: | ||
result = self.get(DataType(type_)) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this bake the array handling also into a function? Then this could return a function that the cursor can directly call to convert the result - instead of having to go through convert
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d7fff77
to
6bf53a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left two more comments, otherwise looks good to me
src/crate/client/connection.py
Outdated
@@ -123,12 +130,16 @@ def __init__(self, | |||
self.lowest_server_version = self._lowest_server_version() | |||
self._closed = False | |||
|
|||
def cursor(self): | |||
def cursor(self, cursor=None, **kwargs) -> Cursor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the cursor=None
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably slipped in from the role model implementation of PyAthena. I think it can be removed. Thanks for spotting it.
def cursor(self, cursor: Optional[Type[BaseCursor]] = None, **kwargs) -> BaseCursor:
src/crate/client/cursor.py
Outdated
@staticmethod | ||
def get_default_converter(mappings: Optional[ConverterMapping] = None) -> Converter: | ||
""" | ||
Return the standard converter instance. | ||
""" | ||
return DefaultTypeConverter(more_mappings=mappings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? It's just wrapping a constructor which could be called directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like get_default_converter()
is currently never called with any parameters, so I guess it can also be removed. Thanks!
This comment was marked as resolved.
This comment was marked as resolved.
1dc87ee
to
f635078
Compare
This will allow converting fetched data from CrateDB data types to Python data types in different ways. Its usage is completely optional. When not used, the feature will not incur any overhead. There is also a `DefaultTypeConverter`, which aims to become a sane default choice when looking at this kind of convenience. It will enable the user to work with native Python data types from the start, for all database types where this makes sense, without needing to establish the required set of default value converters on their own. If the `DefaultTypeConverter` does not fit the user's aims, it is easy to define custom type converters, possibly reusing specific ones from the library.
f635078
to
3f79183
Compare
Hi there,
Introduction
After converging #395 to #437, it became clear / emerged that we wanted to have the type conversion
This patch aims to resolve both aspects.
Details
This will allow converting fetched data from CrateDB data types to Python data types in different ways. Its usage is completely optional. When not used, the feature will not incur any overhead.
There is also a
DefaultTypeConverter
, which aims to become a sane default choice when looking at this kind of convenience. It will enable the user to work with native Python data types from the start, for all database types where this makes sense, without needing to establish the required set of default value converters on their own.If the
DefaultTypeConverter
does not fit the user's aims, it is easy to define custom type converters, possibly reusing specific ones from the library.Acknowledgements
The patch follows the implementation suggested by the PyAthena driver very closely, see PyAthena/converter.py. Thank you!
Synopsis
Thoughts
In the form of an early review, please let me know if you endorse this approach and which adjustments you would like to see.
Personally, I would add more test cases to
test_cursor.py
, in order to exercise the machinery in more detail and to increase code coverage, and probably make the functionality more present in the documentation.Also, as @mfussenegger mentioned at https://github.com/crate/crate-python/pull/437/files#r930822457, it might make sense to also add the important aspect of converting elements within
ARRAY
types to this patch already.With kind regards,
Andreas.
Backlog
TIMESTAMP
types as timezone-aware native Pythondatetime
objects #445. /cc @mfusseneggerBIT
data type should be included into theDefaultTypeConverter
.crate-python/src/crate/client/doctests/cursor.txt
Line 347 in 33ecbf3