-
-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avro schemas containing a union where two subschemas have same fields but different name result in incorrect serialization #749
Comments
Didn't know it either, but as a reference: https://avro.apache.org/docs/1.11.1/specification/#complex-types-1 #742 is exactly the same, #584 is the same but for dataclasses indeed. It's not complete, but the serialization is correct whenever the type name is added as a tuple:
Maybe we can find the unions given the schema, alter the result of asdict with the type information and pass that to the serialization.serialize()? |
Nice, good to know! So definitely we should fix this problem and attach the I tried your suggestion using import dataclasses
from typing import Literal
from dataclasses_avroschema import AvroModel
@dataclasses.dataclass
class EventOne(AvroModel):
field1: str
schema_tag: Literal["EventOne"] = "EventOne"
@dataclasses.dataclass
class EventTwo(AvroModel):
field1: str
schema_tag: Literal["EventTwo"] = "EventTwo"
@dataclasses.dataclass
class Events(AvroModel):
event: EventOne | EventTwo
# It does not perform data validation with dataclasses
data = {'event': ('EventTwo', {'field1': 'hello world', 'schema_tag': 'EventTwo'})}
event = Events.parse_obj(data=data)
print(event)
# >>> Events(event=('EventTwo', {'field1': 'hello world', 'schema_tag': 'EventTwo'}))
serialized = event.serialize()
print(serialized)
# We have the proper byte
# >>> b'\x02\x16hello world\x10EventTwo'
# Then when deserialize
print(Events.deserialize(serialized))
# fastavro returns {'event': {'field1': 'hello world', 'schema_tag': 'EventTwo'}}
# THE WRONG ONE
# >>> Events(event=EventOne(field1='hello world', schema_tag='EventTwo')) The method deserialize is calling a wrapper on |
So I've just had a quick look at the deserialize method in serialization.py where a part looks like this (as reference): if serialization_type == "avro":
input_stream: typing.Union[io.BytesIO, io.StringIO] = io.BytesIO(data)
payload = fastavro.schemaless_reader(
input_stream,
writer_schema=writer_schema or schema,
reader_schema=schema,
) The def schemaless_reader(
fo: IO,
writer_schema: Schema,
reader_schema: Optional[Schema] = None,
return_record_name: bool = False,
return_record_name_override: bool = False,
handle_unicode_errors: str = "strict",
return_named_type: bool = False,
return_named_type_override: bool = False,
) -> AvroMessage: I've just set the print(Events.deserialize(serialized))
# Events(event=('EventTwo', {'field1': 'hello world', 'schema_tag': 'EventTwo'})) and the following result for pydantic: print(Events.deserialize(serialized))
# pydantic_core._pydantic_core.ValidationError: 1 validation error for Events
# event
# Input should be a valid dictionary or object to extract fields from [type=model_attributes_type, input_value=('EventOne', {'field1': '...chema_tag': 'EventTwo'}), input_type=tuple]
# For further information visit https://errors.pydantic.dev/2.8/v/model_attributes_type I think we need to do some extra processing in |
@mauro-palsgraaf with the latest version it is fixed: from typing import Literal
from pydantic import Field
from dataclasses_avroschema.pydantic.main import AvroBaseModel
class EventOne(AvroBaseModel):
field1: str = Field(...)
schema_tag: Literal["EventOne"] = Field(
default="EventOne"
)
class EventTwo(AvroBaseModel):
field1: str = Field(...)
schema_tag: Literal["EventTwo"] = Field(
default="EventTwo"
)
class Events(AvroBaseModel):
event: EventOne | EventTwo = Field(discriminator="schema_tag")
serialized = Events(event=EventTwo(field1="hello world")).serialize()
print(serialized)
# >>> b'\x02\x16hello world\x10EventTwo'
print(Events.deserialize(serialized))
# >>> event=EventTwo(field1='hello world', schema_tag='EventTwo') PS: It also works without |
Really nice, thank you for fixing this so quickly! Will happily test this out on Monday 🙂 |
Context
We use this library in combination with Kafka where we have a topic containing multiple type of events. All must be on the same topic since we need the ordering.
Describe the bug
Union types where subschemas have the same fields and types do not produce the correct binary output according to the avro specification. This results in losing compatability with other languages following the specification. The problem is being caused by the following code in
AvroBaseModel
:By transforming the pydantic object to a dict and passing it further, the typing information is lost. The fastavro library will determine the index by finding the type in the union that has the most fields in common. As the example below demonstrates, all fields are the same and it will just pick the first one.
A possible solution would be to iterate over the schema after turning it to a dict and then add the type information as the first item in the tuple. Fastavro will be able to handle that and will use the type information to determine the int for the binary output instead of the most fields. See the write_union in fastavro below:
To Reproduce
Consider the following schema (very minimal example):
An example that is incorrect would be:
The result is:
b'\x00\x16hello world\x10EventTwo'
, but the expected result would be:b'\x02\x16hello world\x10EventTwo'
according to the avro specification, since the int in the beginning indicates the index of the type of the union. Currently, the result of serializingEvents(event=EventOne(field1="hello world")).serialize()
andEvents(event=EventTwo(field1="hello world")).serialize()
is the same.Expected behavior
Events(event=EventTwo(field1="hello world")).serialize() should result in the following value:
b'\x02\x16hello world\x10EventTwo'
As soon as we figured out a solution and agree on how to fix it, i can help with the implementation if necessary
The text was updated successfully, but these errors were encountered: