Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize CrateDB data type implementations #65

Open
amotl opened this issue Jun 9, 2024 · 0 comments
Open

Modernize CrateDB data type implementations #65

amotl opened this issue Jun 9, 2024 · 0 comments

Comments

@amotl
Copy link
Member

amotl commented Jun 9, 2024

About

On behalf of GH-9, we added an SQLAlchemy data type implementation for CrateDB's FLOAT_VECTOR data type. On this task, we discovered a few details that made us think we should also modernize the other data type implementations, eventually.

The reason is because they have been conceived a while ago already, around SQLAlchemy 1.x, and additional mechanisms have been introduced to SQLAlchemy since then.

Implementation Notes

An improved implementation of the FloatVector data type for CrateDB,
compared to the previous implementation on behalf of the LangChain adapter.

The previous implementation, based on SQLAlchemy's UserDefinedType, didn't
respect the python_type property on backward/reverse resolution of types.
This was observed on Meltano's database connector machinery doing a
type cast, which led to a NotImplementedError.

typing.cast(type, sql_type.python_type) => NotImplementedError

The UserDefinedType approach is easier to implement, because it doesn't
need compiler support.

To get full SQLAlchemy type support, including support for forward- and
backward resolution / type casting, the custom data type should derive
from SQLAlchemy's TypeEngine base class instead.

When deriving from TypeEngine, you will need to set the __visit_name__
attribute, and add a corresponding visitor method to the CrateTypeCompiler,
in this case, visit_FLOAT_VECTOR.

Now, rendering a DDL succeeds. However, when reflecting the DDL schema back,
it doesn't work until you will establish a corresponding reverse type mapping.

By invoking SELECT DISTINCT(data_type) FROM information_schema.columns;,
you will find out that the internal type name is float_vector, so you
announce it to the dialect using TYPES_MAP["float_vector"] = FloatVector.

Still not there: NotImplementedError: Default TypeEngine.as_generic() heuristic method was unsuccessful for target_cratedb.sqlalchemy.vector.FloatVector. A custom as_generic() method must be implemented for this type class.

So, as it signals that the type implementation also needs an as_generic
property, let's supply one, returning sqltypes.ARRAY.

It looks like, in exchange to those improvements, the get_col_spec
method is not needed any longer.

TODO: Would it be a good idea to derive from SQLAlchemy's
ARRAY right away, to get a few of the features without
the need to redefine them?

Please note the outcome of this analysis and the corresponding implementation
has been derived from empirical observations, and from the feeling that we also
lack corresponding support on the other special data types of CrateDB (ARRAY and
OBJECT) within the SQLAlchemy dialect, i.e. "that something must be wrong or
incomplete". In this spirit, it is advisable to review and improve their
implementations correspondingly.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant