-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Union type with dataclass ambiguous error and support superset comparison #5858
Fix Union type with dataclass ambiguous error and support superset comparison #5858
Conversation
Signed-off-by: mao3267 <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5858 +/- ##
==========================================
- Coverage 37.14% 37.07% -0.07%
==========================================
Files 1313 1316 +3
Lines 131619 132176 +557
==========================================
+ Hits 48891 49010 +119
- Misses 78481 78908 +427
- Partials 4247 4258 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
…yteorg#5489-dataclass-mismatch Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
…t (one level) dataclass Signed-off-by: mao3267 <[email protected]>
One note @mao3267, this problem does not only affect dataclasses but any other type that uses protobuf struct as transport. Even combinations of different types that all use protobuf struct.
Maybe we can use the literal type's "type structure" field for this? Or we should document how transformers for other types can provide the schema in a way that they can "participate in the logic". TL;DR: It would be nice if the compiler logic in flytepropeller didn't have "special treatment" for dataclasses but general treatment for json-like structures with schemas that the dataclass transformer makes use of - but other transformers can as well. |
…yteorg#5489-dataclass-mismatch
Signed-off-by: mao3267 <[email protected]>
just dicussed with @mao3267 , he will explain how it works now, this will support both dataclass and pydantic basemodel in summary. |
Signed-off-by: mao3267 <[email protected]>
Let's get this done this week @mao3267 |
Currently, we support both Although the schemas from |
Signed-off-by: Future-Outlier <[email protected]>
the exact match part is okay and we should fix that but I'm confused about the other part. looking at this example
can you explain why this should work? Why doesn't mypy complain in this example? Or does it? I almost feel like if we're going to go down this route, it should be the other way around. If cc @fg91 if you want to take a look as well. wasn't there another pr where we were discussing an LGPL library also? |
@@ -19,6 +19,8 @@ type trivialChecker struct { | |||
} | |||
|
|||
func removeTitleFieldFromProperties(schema map[string]*structpb.Value) { | |||
// TODO: Explain why we need this | |||
// TODO: givse me example about dataclass vs. Pydantic BaseModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example comparing dataclass and Pydantic BaseModel. As shown, the schema for dataclass includes a title field that records the name of the class. Additionally, the additionalProperties
field is absent from the Pydantic BaseModel schema because its value is false
. cc @eapolinario
dataclass | Pydantic.BaseModel |
---|---|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add the comment, writing the entire schema would make it too lengthy. Would it be acceptable to use something like this instead?
class A:
a: int
Pydantic.BaseModel: {"properties": {"a": {"title": "A", "type": "integer"}}}
dataclass: {"properties": {"a": {"type": "integer"}}, "additionalProperties": false}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you proposing to preprocess the schemas so that one can mix and match dataclasses and base models given their schemas are aligned? I.e. task expects a dataclass with schema "A" and I pass a base model that has the same schema.
I personally feel this is not necessary and think it would be totally acceptable to consider a dataclass and a base model not a match by default. Especially if this makes things a lot more complicated in the backend otherwise because the schemas need to be aligned. What do you think about this?
If you are confident in the logic I'm of course not opposing the feature but if you feel this makes things complicated and brittle, I'd rather keep it simple and more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it actually make things more complicated, will remove related logic.
Class
In the discussion here, we assumed that upstream refers to the exact input type and downstream refers to the expected type for the task input. Did we misunderstand this? By the way, I’m also curious about the reason of supporting this superset matching. It will be helpful to decide our route.
The LGPL discussion applies to this PR as well, it is not mentioned because we are no longer using it. |
@Future-Outlier @mao3267 The latter would be slightly concerning to me. Glancing over the code gives me the impression that there is quite a bit of dataclass/pydantic logic we need to apply. I wonder whether this could be done in the respective flytekit type transformer so that the backend is agnostic to the respective type and as long as the type transformer provides the schema in the right way, the backend can make use of it. It would be really great if there was a tutorial in https://docs.flyte.org/en/latest/api/flytekit/types.extend.html in the end that documents how users need to provide the schema in their respective |
Oh got it, but this only works because there are defaults right? The original case you linked to should work because @dataclass
class A:
a: int
b: Optional[int] = None
c: str then this should not work correct? (because
I don't think of this as superset matching. I think of this as schema compatibility, which is why I was thinking we'd find some off-the-shelf library that can just do it for us. 'Is this schema compatible with this other schema?'
Is there a comment somewhere that explains why we no longer need, or can't use, that library or a library like it? Re @fg91's comments
I don't know the answer but it should be yes to the first question. It should be possible to easily provide a json schema and have everything on the backend just work. Isn't this the case @Future-Outlier? There should be no special logic for dataclasses or pydantic in the backend, at all. We should remove it if there is. |
Replying @wild-endeavor
Yes. This should not work.
No. I just decided to use another package that directly compares the JSON schema. This is not documented anywhere. Replying @fg91
To support any other custom types, could we limit the provided schema from a certain package like Mashumaro for the compatibility check? Without limitations, it will be hard to cover all kinds of scenarios. Or does anyone know possible solutions to support JSON schema from different versions and packages? |
This is exactly what I'm trying to say :)
Yes, I think restricting what kind of schema needs to be supplied is absolutely reasonable! I think it would be good to add a tutorial here https://docs.flyte.org/en/latest/api/flytekit/types.extend.html that states something like "if you want propeller to understand the schema of your type e.g. to distinguish in union types, you need to provide a schema in the to_literal_type method in this specific way". And I personally feel that the dataclass and pydantic type transformers should provide the schema in this general way so that the backend doesn't have to have type specific implementations for dataclasses/base models. (As a side note, as I described here, for cache checks we don't use the schema in metadata but the so-called type structure. Maybe it's difficult to fix this in hindsight but I kinda wished that there was a single unified way type transformers make the schema available to propeller that is used for everything, cache checks, union type checks, ...) |
@fg91 and @EngHabu mind chiming in on dataclass compatibility? Just thinking about it from the basics, if I have json schemas representing two dataclasses, let's say, @dataclass
class A2:
a: int
b: Optional[int] = None
c: str = "hello" and @dataclass
class A1:
a: int which of the following two are valid? def wf():
a1 = create_a1() # -> A1
use_a2(a2=a1) # a2: A2 Case 2 def wf():
a2 = create_a2() # -> A2
use_a1(a1=a2) # a1: A1 Just thinking about compatibility in the loosest sense, both should be valid. The reason is that in the first case, when calling the downstream task In the second case, the The implication here though is that if you have @task
def make_a1() -> A1: ...
@task
def use_either(a: typing.Union[A1, A2]): ... This this will fail
because A1 will match more than one variant. flytekit itself will not fail I think (right @Future-Outlier?) but we'll never get there because the compiler will fail. Should we just do exact matches only? Plus of two examples earlier (case 1 & 2), both will fail mypy type checking. |
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
9dc3fa6
to
b224a02
Compare
Signed-off-by: Future-Outlier <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just synced with @mao3267
- work
- same
- big -> small
- child -> parent
- not work
- different attribute
class A:
a: int
class B:
a: int
A -> B
- parent -> child
….com/mao3267/flyte into fix/flyteorg#5489-dataclass-mismatch
Signed-off-by: mao3267 <[email protected]>
…yteorg#5489-dataclass-mismatch
Signed-off-by: mao3267 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just synced with @mao3267 , here is my summary.
- if we support
child -> parent
case, we can't support 2 dataclass ambiguous error.
example:
@dataclass
class A:
a: int
@dataclass
class B:
a: int
# A -> B can pass and work
this is due to json schema's restriction.
we should use MRO to solve this or find other protocol.
Signed-off-by: mao3267 <[email protected]>
Signed-off-by: mao3267 <[email protected]>
"github.com/santhosh-tekuri/jsonschema" | ||
"github.com/wI2L/jsondiff" | ||
jscmp "gitlab.com/yvesf/json-schema-compare" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
santhosh-tekuri/jsonschema"
uses Apache License.
wI2L/jsondiff
uses MIT License.
yvesf/json-schema-compare
uses LGPL License.
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: mao3267 <[email protected]>
…failure Signed-off-by: mao3267 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, really enjoy to work with you.
Thank you for tackling this problem @mao3267 and @Future-Outlier 🙇 🙇
Would it be possible to document as proposed here how type transformers need to provide the schema? |
No problem! Will update the documents later. |
Summary
We only support json schema draft
Tracking issue
Related to #5489
Why are the changes needed?
When a function accepts a
Union
of two dataclasses as input, Flyte cannot distinguish which dataclass matches the user's input. This is because Flyte only compares the simple types, and both dataclasses are identified asflyte.SimpleType_STRUCT
in this scenario. As a result, there will be multiple matches, causing ambiguity and leading to an error.union_test_dataclass.py
What changes were proposed in this pull request?
To distinguish between different types using protobuf struct (dataclass, Pydantic.BaseModel), we compare their JSON schemas generated by
marshmallow_jsonschema.JSONSchema
(draft-07) ormashumaro.jsonschema.build_json_schema
(draft 2020-12) for dataclass and Pydantic.BaseModel for itself. To check equivalence, we compare the bytes from marshaling the json schemas if they are in the same draft version. For now, we only consider supporting the comparison of schemas with the same version.We plan to support superset matching for dataclass/Pydantic.BaseModel with schemas in draft 2020-12, meaning that class A and class supersetA can be a match in the following example: (Pydantic.BaseModel example is in the screenshot section)
Unit tests for different versions of json schema, including one-level, two-level, and strict subset examples in draft 2020-12 from mashumaro.
How was this patch tested? (my local testing repo here)
Setup process
Screenshots
Notes
Optional values in JSON Schemas
While handling
Optional
values in Python, bothNoneType
and the target type are accepted. However, when defining such values, default values must still be provided. This is whyOptional
properties without default values are marked asrequired
in JSON schemas.Ambiguity vs. Inheritance
Currently, we compare only the structure of the dataclass, excluding the title, as including it would block inheritance. This approach introduces ambiguity because it allows dataclasses with different names but the same structure to match. To resolve this, we may require additional information from Flytekit to determine whether two dataclasses have an inheritance relationship. The following example illustrates this challenge: B should match A, but C should not.
Json Schema Compare Package (LGPL License)
We are now using json-schem-compare package to do strict subset matching. However, since the underlying JSON Schema compiler only supports drafts up to draft-7, it does not handle the prefixItems field when tuples are used in a dataclass.
Check all the applicable boxes
Related PRs
None
Docs link
TODO