Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Composite Decoder #179

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

artem1205
Copy link
Contributor

@artem1205 artem1205 commented Dec 17, 2024

What

Resolving Source Amazon Seller Partner CDK migration

add composite decoder to apply decompressors | decoders consequently

@artem1205 artem1205 self-assigned this Dec 17, 2024
@github-actions github-actions bot added the enhancement New feature or request label Dec 17, 2024
Signed-off-by: Artem Inzhyyants <[email protected]>
@artem1205 artem1205 requested a review from maxi297 December 18, 2024 15:10
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a couple of questions to improve my understanding


@dataclass
class Parser(ABC):
inner_parser: Optional["Parser"] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me that the interface exposes an inner_parser as a public field. Should this be removed from the interface and the parser implementations will decide whatever they use internally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed from the interface and the parser implementations will decide whatever they use internally?

this was added to interface to use declarative approach and wrap parser-in-parser-in-parser as many times as it is needed, e.g.:

- decoder:
   type: CompositeRawDecoder
   parser: 
     type: GzipParser
     inner_parser:
       type: parser1
         inner_parser:
           type: parser2
...

In my understanding parent parsers does not know about inner_parser implementation and all they do is just pass transformed object down through the pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that it'll be part of the implementations but not the interface. So for example in declarative_component_schema.yaml, we will have:

  DefaultDecoder:
    title: Decoder with a parser
    type: object
    required:
      - type
      - parser
    properties:
      type:
        type: string
        enum: [DefaultDecoder]
      parser:
        anyOf:
          - "$ref": "#/definitions/CsvParser"
          - "$ref": "#/definitions/JsonlParser"
          <...>

  CsvParser:
    title: CSV Parser
    type: object
    required:
      - type
    properties:
      type:
        type: string
        enum: [CsvParser]
      <probably some CSV parsing options that we have for file based>
  GzipParser:
    title: GZIP Parser
    type: object
    required:
      - type
      - underlying_parser
    properties:
      type:
        type: string
        enum: [GzipParser]  
      underlying_parser:
        anyOf:
          - "$ref": "#/definitions/CsvParser"
          - "$ref": "#/definitions/JsonlParser"
          <...>

Each of these parser components will have their instantiation method in the model_to_component_factory and if they require an inner parser (like the GzipParser), we will instante that parser at that point. So the fact that a parser uses a parser underneath does not need to be exposed as part of the Python interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, refactored


@abstractmethod
def parse(
self, data: BufferedDataInput | Iterable, *args, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this interface. Does that mean that all the users with have to check the data parameter as such:

    def parse(
        self, data: BufferedDataInput | Iterable, *args, **kwargs
    ) -> Generator[MutableMapping[str, Any], None, None]:
        if isinstance(data, BufferedDataInput):
            < do X >
        elif isinstance(data, Iterable):
            < do Y >
        else:
            raise ValueError(f"Unexpected type {type(data)}")

I also don't see the usage of BufferedDataInput in the sample parsers you provided below, maybe I missed it. Can you show me how this would be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that all the users with have to check the data parameter as such:

Usually not, cause I expect user to choose valid parser sequence.

I also don't see the usage of BufferedDataInput in the sample parsers you provided below, maybe I missed it. Can you show me how this would be used?

We start processing from passing response.raw in

self.parser.parse(data=response.raw)

which in fact is readable (buffered) object.

any decompressor, like gzip or zip returns decompressor object , which in turn is also readable (buffered).

Copy link
Contributor

@maxi297 maxi297 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I really like the idea of having a buffered reader as a part of the interface. Maybe it should even be a by more meaning that the file-based CsvParser is relying on more methods than just "read". Therefore, should it be IOBase instead of BufferedDataInput?

So I think my question now is: can we remove Iterable from the interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess yes, refactored interface to BufferedIOBase

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cool! I think this is a good improvement

class Parser(ABC):
@abstractmethod
def parse(
self, data: BufferedIOBase, *args, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand: Why do we have *args, **kwargs in the interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason (or real use-case) for now, can be deleted.

Signed-off-by: Artem Inzhyyants <[email protected]>
Signed-off-by: Artem Inzhyyants <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants