-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Composite Decoder #179
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a couple of questions to improve my understanding
|
||
@dataclass | ||
class Parser(ABC): | ||
inner_parser: Optional["Parser"] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird to me that the interface exposes an inner_parser as a public field. Should this be removed from the interface and the parser implementations will decide whatever they use internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed from the interface and the parser implementations will decide whatever they use internally?
this was added to interface to use declarative approach and wrap parser-in-parser-in-parser as many times as it is needed, e.g.:
- decoder:
type: CompositeRawDecoder
parser:
type: GzipParser
inner_parser:
type: parser1
inner_parser:
type: parser2
...
In my understanding parent parsers does not know about inner_parser implementation and all they do is just pass transformed object down through the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that it'll be part of the implementations but not the interface. So for example in declarative_component_schema.yaml
, we will have:
DefaultDecoder:
title: Decoder with a parser
type: object
required:
- type
- parser
properties:
type:
type: string
enum: [DefaultDecoder]
parser:
anyOf:
- "$ref": "#/definitions/CsvParser"
- "$ref": "#/definitions/JsonlParser"
<...>
CsvParser:
title: CSV Parser
type: object
required:
- type
properties:
type:
type: string
enum: [CsvParser]
<probably some CSV parsing options that we have for file based>
GzipParser:
title: GZIP Parser
type: object
required:
- type
- underlying_parser
properties:
type:
type: string
enum: [GzipParser]
underlying_parser:
anyOf:
- "$ref": "#/definitions/CsvParser"
- "$ref": "#/definitions/JsonlParser"
<...>
Each of these parser components will have their instantiation method in the model_to_component_factory and if they require an inner parser (like the GzipParser), we will instante that parser at that point. So the fact that a parser uses a parser underneath does not need to be exposed as part of the Python interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, refactored
|
||
@abstractmethod | ||
def parse( | ||
self, data: BufferedDataInput | Iterable, *args, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this interface. Does that mean that all the users with have to check the data
parameter as such:
def parse(
self, data: BufferedDataInput | Iterable, *args, **kwargs
) -> Generator[MutableMapping[str, Any], None, None]:
if isinstance(data, BufferedDataInput):
< do X >
elif isinstance(data, Iterable):
< do Y >
else:
raise ValueError(f"Unexpected type {type(data)}")
I also don't see the usage of BufferedDataInput
in the sample parsers you provided below, maybe I missed it. Can you show me how this would be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that all the users with have to check the data parameter as such:
Usually not, cause I expect user to choose valid parser sequence.
I also don't see the usage of BufferedDataInput in the sample parsers you provided below, maybe I missed it. Can you show me how this would be used?
We start processing from passing response.raw
in
self.parser.parse(data=response.raw)
which in fact is readable (buffered) object.
any decompressor, like gzip or zip returns decompressor object , which in turn is also readable (buffered).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I really like the idea of having a buffered reader as a part of the interface. Maybe it should even be a by more meaning that the file-based CsvParser is relying on more methods than just "read". Therefore, should it be IOBase instead of BufferedDataInput
?
So I think my question now is: can we remove Iterable
from the interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess yes, refactored interface to BufferedIOBase
Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's cool! I think this is a good improvement
class Parser(ABC): | ||
@abstractmethod | ||
def parse( | ||
self, data: BufferedIOBase, *args, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand: Why do we have *args, **kwargs
in the interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason (or real use-case) for now, can be deleted.
Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
What
Resolving Source Amazon Seller Partner CDK migration
add composite decoder to apply
decompressors
|decoders
consequently