Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: make full CDK codebase compliant with mypy #105

Merged
merged 37 commits into from
Dec 4, 2024

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Dec 3, 2024

Note:

  • This is a monster of a PR and to play it safe, I tried as much as possible to not modify any core behaviors. Initially I added guard statements (if foo is None: raise ValueError(...)) to help the type checker, but in some cases, these caused the code to fail in a different way then the unit tests expected. So I've reverted those, and I've instead used the strategy of using "ignore" statements when there's no other way to resolve.
  • Note that cast in Python is a no-op and just helps the type inference. Added cast statements tell the linter what the type is expected to be, but they don't change the value or add to the runtime cost.

Summary of Code Changes

  1. Install type stub packages when missing. These are added to the 'dev' group in pyproject.toml so they should not have any effect at runtime.
  2. Turn on mypy checks for the whole codebase, excluding (for now) the unit_tests directory.
  3. Delete incremental "changed files only" version of the mypy checks, replacing with full codebase check.
  4. Replace from orjson import orjson with simply import orjson since that is required for the type checker.
  5. Replace Deprecated library with type_extensions.deprecated. This is officially part of the Python language now, as of python 3.11, and the typing_extensions package backports it.
  6. Update @deprecated decorator syntax to match the Python language deprecated syntax.
  7. Fix all obvious missing types that mypy complained about. Also added a few cast statements to help the type checker outl
  8. Add "noqa" statements (actually: type: ignore [ERRTYPE]) for anything that can not be fixed with near-zero risk.
  9. Add comments on many - but not all - noqa statements, to help readers understand what the issues are.
  10. I took a non-exhaustive review at "unused" type ignore statements, via poetry run mypy airbyte_cdk --warn-unused-ignores. When the ignore was obviously not needed (even at stricter compliance levels), I removed those. I didn't remove all because some looked valid and might only be shown as "unused" if we use a more permissive ruleset around the use of Any.

Summary by CodeRabbit

  • Bug Fixes

    • Improved type handling and error management in the ConcurrentCursor and HttpResponseFilter classes.
  • Documentation

    • Updated comments and type hints across multiple files for better clarity and type safety.
  • Chores

    • Removed unnecessary type ignore comments in several methods to enhance code clarity.

@github-actions github-actions bot added the chore label Dec 3, 2024
Copy link
Contributor

coderabbitai bot commented Dec 3, 2024

📝 Walkthrough
📝 Walkthrough

The changes in this pull request involve multiple updates across various files, primarily focusing on improving code quality through enhanced type annotations, import statement adjustments, and minor formatting updates. The modifications aim to standardize the use of the orjson library for JSON handling, refine error handling, and enhance the clarity of method signatures. Additionally, several test files have been updated to ensure robust testing coverage, particularly in scenarios involving state management and partition handling.

Changes

File Path Change Summary
.github/dependabot.yml Formatting improvements for comments and indentation adjustments for update-types.
.github/workflows/publish_sdm_connector.yml Reformatted version input description for clarity; no logic changes.
.github/workflows/pytest_matrix.yml Changed string delimiters in paths section from single to double quotes.
.github/workflows/python_lint.yml MyPy checks now run unconditionally; minor comment correction.
airbyte_cdk/cli/source_declarative_manifest/_run.py Updated import to absolute, added type annotations, improved error handling for config.
airbyte_cdk/config_observation.py Changed import to general, added ObservedDict and ConfigObserver classes.
airbyte_cdk/connector_builder/main.py Updated import to absolute for orjson; no logic changes.
airbyte_cdk/destinations/destination.py Updated import to absolute for orjson; no logic changes.
airbyte_cdk/destinations/vector_db_based/writer.py Added error handling for record property in write method.
airbyte_cdk/entrypoint.py Updated orjson import, modified airbyte_message_to_string for performance, enhanced error handling in read.
airbyte_cdk/logger.py Updated import for orjson, improved serialization in AirbyteLogFormatter.
airbyte_cdk/sources/connector_state_manager.py Modified stream descriptor handling in _extract_from_state_message.
airbyte_cdk/sources/declarative/auth/selective_authenticator.py Reformatted dpath.get call for readability in __new__ method.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml Added new components and properties, refined existing definitions.
airbyte_cdk/sources/declarative/decoders/noop_decoder.py Updated decode method signature with # type: ignore[override].
airbyte_cdk/sources/declarative/requesters/request_options/interpolated_request_options_provider.py Updated import for deprecated.
airbyte_cdk/sources/declarative/retrievers/async_retriever.py Updated import for deprecated, marked class as deprecated.
airbyte_cdk/sources/file_based/... Multiple files updated to change import for orjson to absolute, enhanced type annotations, and improved method signatures.
airbyte_cdk/utils/... Various files updated for type annotations, import adjustments, and enhanced error handling.
bin/run-mypy-on-modified-files.sh File removed.
pyproject.toml Removed Deprecated dependency, added new type dependencies, updated type-check task command.
unit_tests/... Multiple test files updated to integrate orjson, enhance error handling, and improve coverage with additional assertions.

Possibly related issues

Possibly related PRs

Suggested labels

enhancement, dependencies

Suggested reviewers

  • maxi297
  • alafanechere

What do you think about these suggestions? Would you like to proceed with any of them?


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (18)
airbyte_cdk/sources/declarative/decoders/noop_decoder.py (1)

17-20: Could we adjust the method signature to align with the base class?

Adding # type: ignore[override] suppresses the type checker warning about method override. Perhaps we can modify the decode method's signature to match the base class in Decoder and avoid the need for the ignore comment. What do you think?

airbyte_cdk/utils/spec_schema_transformations.py (1)

7-7: LGTM! The type annotations look great! 🎉

The changes properly type the JSON schema operations using dict[str, Any]. Would you consider enhancing the docstring to include the type information as well? Something like:

def resolve_refs(schema: dict[str, Any]) -> dict[str, Any]:
    """
    Resolves JSON schema references within the provided schema.

    Args:
        schema: A JSON schema dictionary that may contain $ref references

    Returns:
        A new schema dictionary with all references resolved
    """

wdyt?

Also applies to: 12-12, 24-24

airbyte_cdk/sources/streams/http/requests_native_auth/abstract_token.py (1)

8-8: The type hints are looking good! One small suggestion though 🤔

Instead of returning Any, we could be more specific since we know this method returns the modified request. How about:

def __call__(self, request: requests.PreparedRequest) -> requests.PreparedRequest:

This would make it clearer to callers what they're getting back. wdyt?

Also applies to: 15-15

airbyte_cdk/utils/event_timing.py (2)

23-25: Consider more specific type annotations for collections

The type annotations for events and stack could be more specific. What do you think about:

  • events: dict[str, Event] since it only stores Event instances
  • stack: list[Event] since it manages Event objects
-        self.events: dict[str, Any] = {}
-        self.stack: list[Any] = []
+        self.events: dict[str, Event] = {}
+        self.stack: list[Event] = []

wdyt? This would make the types more precise and help mypy catch potential type errors.


80-80: Consider more specific Generator type parameters

The Generator type could be more specific about what it yields and sends. How about:

-def create_timer(name: str) -> Generator[EventTimer, Any, None]:
+def create_timer(name: str) -> Generator[EventTimer, None, None]:

Since the generator doesn't use the send() method, we can use None instead of Any for better type safety. wdyt?

airbyte_cdk/config_observation.py (2)

Line range hint 24-29: Consider using TypeVar for better type safety in ObservedDict

Instead of using Any for the mapping's key and value types, we could make the class generic. How about:

+from typing import TypeVar, Generic

+K = TypeVar('K')
+V = TypeVar('V')

-class ObservedDict(dict):  # type: ignore
+class ObservedDict(Generic[K, V], dict[K, V]):
     def __init__(
         self,
-        non_observed_mapping: MutableMapping[Any, Any],
+        non_observed_mapping: MutableMapping[K, V],
         observer: ConfigObserver,
         update_on_unchanged_value: bool = True,
     ) -> None:

This would provide better type safety while maintaining flexibility. wdyt?


Line range hint 89-89: Consider adding return type annotation to update method

The update method is missing a return type annotation. Should we add -> None for consistency?

-    def update(self):
+    def update(self) -> None:
airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py (1)

50-53: Great consistency in parameter ordering!

The consistent parameter ordering across methods (self first, followed by core parameters) improves readability. The explicit type hints for stream: AbstractFileBasedStream make the code more type-safe.

One thought: Would it make sense to add docstring type hints as well for better IDE support, wdyt? 🤔

Also applies to: 91-91, 111-114

airbyte_cdk/utils/traced_exception.py (1)

137-147: Consider consolidating type ignore comments

The repeated type ignore comments for union attributes could potentially be consolidated at the class level. Would you consider using a class-level type ignore if these union types are a known limitation of the model? This might make the code more readable. wdyt?

+ # type: ignore[union-attr]  # AirbyteMessage with MessageType.TRACE has AirbyteTraceMessage
  if error_message.trace.error.message:
-     error_message.trace.error.message = filter_secrets(  # type: ignore[union-attr] # AirbyteMessage with MessageType.TRACE has AirbyteTraceMessage
+     error_message.trace.error.message = filter_secrets(
          error_message.trace.error.message
      )
airbyte_cdk/sources/connector_state_manager.py (1)

Line range hint 134-138: Good type safety improvements in state extraction

The addition of type ignore comments with clear explanations helps mypy understand the code better. However, would it be worth considering creating a type guard function to validate these attributes at runtime? This could eliminate the need for type ignore comments. wdyt?

Example:

def has_stream_descriptor(state: AirbyteStateMessage) -> TypeGuard[AirbyteStateMessage]:
    return (hasattr(state, 'stream') and 
            hasattr(state.stream, 'stream_descriptor'))
airbyte_cdk/cli/source_declarative_manifest/_run.py (1)

163-170: Nice type safety improvements! Would you consider a small enhancement to the error message?

The type annotations and null check are great improvements. For the error message, what do you think about making it more actionable by including an example of a valid config structure? wdyt?

-                f"of the config but config only has keys: {list(config.keys() if config else [])}"
+                f"of the config but config only has keys: {list(config.keys() if config else [])}. " +
+                "Example valid config: {'__injected_declarative_manifest': {'streams': [...]}}"
airbyte_cdk/sources/utils/transform.py (1)

173-179: Consider adding a docstring for the resolve function?

The type annotations look great! Would you consider adding a brief docstring to explain what the resolve function does? wdyt?

     def resolve(subschema: dict[str, Any]) -> dict[str, Any]:
+        """
+        Resolves JSON Schema references and returns the resolved schema.
+        :param subschema: The schema that may contain a $ref
+        :return: The resolved schema
+        """
         if "$ref" in subschema:
airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py (1)

33-35: LGTM! The type hints look great.

Nice work updating to the modern union syntax with |. Would you consider adding a docstring to document these optional parameters and their default values? This would help other developers understand when to use them, wdyt?

Also applies to: 38-38

airbyte_cdk/entrypoint.py (2)

18-18: LGTM! Good performance improvement using orjson.

The switch to orjson should provide better performance. Would you consider adding a try-catch block around the orjson.dumps calls to handle potential serialization errors gracefully, wdyt?


Line range hint 246-249: LGTM! Type annotation for state parameter is added.

The list[Any] type annotation is good for backward compatibility. Would you consider using a more specific type for the state items if we know their structure, wdyt? This could help catch type-related issues earlier.

airbyte_cdk/sources/streams/http/http.py (2)

13-13: LGTM! Consider standardizing the deprecation message format.

The switch to typing_extensions.deprecated improves type checking. The deprecation messages are clear but have slight format inconsistencies - some have trailing spaces. Would you like to standardize them? wdyt?

-    "Deprecated as of CDK version 3.0.0. ",
-    "You should set error_handler explicitly in HttpStream.get_error_handler() instead.",
+    "Deprecated as of CDK version 3.0.0. You should set error_handler explicitly in HttpStream.get_error_handler() instead."

Also applies to: 124-125, 135-136, 146-147, 157-158, 606-607, 622-623


Line range hint 127-162: Consider adding migration examples in docstrings.

The deprecation messages clearly indicate what to use instead, but would it be helpful to add example code in the docstrings showing how to migrate from the old to the new approach? This could make the transition smoother for users. wdyt?

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2079-2079: Consider adding more context to the descriptions.

The descriptions for several OAuth-related fields could be enhanced with more context about when and why they would be used. For example:

  • extract_output: Could explain common scenarios where certain fields need to be extracted
  • state: Could explain why state parameter configuration is important for security
  • state_key: Could mention common variations in state parameter names across providers
  • auth_code_key: Could explain why some providers use different names for the authorization code

Also applies to: 2091-2091, 2124-2124, 2131-2131

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9587d4e and 8202ce5.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (53)
  • .github/dependabot.yml (2 hunks)
  • .github/workflows/publish_sdm_connector.yml (1 hunks)
  • .github/workflows/pytest_matrix.yml (1 hunks)
  • .github/workflows/python_lint.yml (1 hunks)
  • airbyte_cdk/cli/source_declarative_manifest/_run.py (2 hunks)
  • airbyte_cdk/config_observation.py (1 hunks)
  • airbyte_cdk/connector_builder/main.py (1 hunks)
  • airbyte_cdk/destinations/destination.py (1 hunks)
  • airbyte_cdk/destinations/vector_db_based/writer.py (1 hunks)
  • airbyte_cdk/entrypoint.py (1 hunks)
  • airbyte_cdk/logger.py (1 hunks)
  • airbyte_cdk/sources/connector_state_manager.py (1 hunks)
  • airbyte_cdk/sources/declarative/auth/selective_authenticator.py (1 hunks)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (4 hunks)
  • airbyte_cdk/sources/declarative/decoders/noop_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/request_options/interpolated_request_options_provider.py (1 hunks)
  • airbyte_cdk/sources/declarative/retrievers/async_retriever.py (1 hunks)
  • airbyte_cdk/sources/file_based/availability_strategy/abstract_file_based_availability_strategy.py (3 hunks)
  • airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py (4 hunks)
  • airbyte_cdk/sources/file_based/file_types/avro_parser.py (2 hunks)
  • airbyte_cdk/sources/file_based/file_types/csv_parser.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_types/excel_parser.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_types/jsonl_parser.py (1 hunks)
  • airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py (2 hunks)
  • airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (2 hunks)
  • airbyte_cdk/sources/streams/concurrent/abstract_stream.py (1 hunks)
  • airbyte_cdk/sources/streams/concurrent/adapters.py (1 hunks)
  • airbyte_cdk/sources/streams/concurrent/availability_strategy.py (1 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
  • airbyte_cdk/sources/streams/core.py (3 hunks)
  • airbyte_cdk/sources/streams/http/http.py (7 hunks)
  • airbyte_cdk/sources/streams/http/requests_native_auth/abstract_token.py (1 hunks)
  • airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py (6 hunks)
  • airbyte_cdk/sources/utils/transform.py (7 hunks)
  • airbyte_cdk/test/entrypoint_wrapper.py (1 hunks)
  • airbyte_cdk/utils/airbyte_secrets_utils.py (1 hunks)
  • airbyte_cdk/utils/event_timing.py (4 hunks)
  • airbyte_cdk/utils/message_utils.py (1 hunks)
  • airbyte_cdk/utils/spec_schema_transformations.py (2 hunks)
  • airbyte_cdk/utils/traced_exception.py (3 hunks)
  • bin/run-mypy-on-modified-files.sh (0 hunks)
  • pyproject.toml (2 hunks)
  • unit_tests/connector_builder/test_connector_builder_handler.py (1 hunks)
  • unit_tests/connector_builder/test_message_grouper.py (1 hunks)
  • unit_tests/destinations/test_destination.py (6 hunks)
  • unit_tests/sources/declarative/incremental/test_per_partition_cursor_integration.py (1 hunks)
  • unit_tests/sources/declarative/partition_routers/test_parent_state_stream.py (1 hunks)
  • unit_tests/sources/declarative/schema/source_test/SourceTest.py (1 hunks)
  • unit_tests/sources/declarative/test_concurrent_declarative_source.py (1 hunks)
  • unit_tests/sources/test_source.py (1 hunks)
  • unit_tests/test/test_entrypoint_wrapper.py (1 hunks)
  • unit_tests/test_entrypoint.py (1 hunks)
  • unit_tests/utils/test_traced_exception.py (1 hunks)
💤 Files with no reviewable changes (1)
  • bin/run-mypy-on-modified-files.sh
✅ Files skipped from review due to trivial changes (15)
  • .github/workflows/pytest_matrix.yml
  • airbyte_cdk/sources/declarative/requesters/request_options/interpolated_request_options_provider.py
  • airbyte_cdk/sources/file_based/file_types/jsonl_parser.py
  • airbyte_cdk/test/entrypoint_wrapper.py
  • airbyte_cdk/sources/file_based/file_types/csv_parser.py
  • airbyte_cdk/destinations/destination.py
  • .github/dependabot.yml
  • airbyte_cdk/sources/streams/concurrent/abstract_stream.py
  • unit_tests/test/test_entrypoint_wrapper.py
  • airbyte_cdk/sources/streams/concurrent/availability_strategy.py
  • airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py
  • unit_tests/utils/test_traced_exception.py
  • airbyte_cdk/connector_builder/main.py
  • airbyte_cdk/utils/airbyte_secrets_utils.py
  • .github/workflows/publish_sdm_connector.yml
🧰 Additional context used
📓 Learnings (1)
airbyte_cdk/cli/source_declarative_manifest/_run.py (1)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.273Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
🪛 actionlint (1.7.4)
.github/workflows/python_lint.yml

78-78: property "changed-py-files" is not defined in object type {}

(expression)

🔇 Additional comments (37)
unit_tests/sources/declarative/incremental/test_per_partition_cursor_integration.py (1)

8-8: LGTM!

The import of orjson is appropriate and enhances the efficiency of JSON serialization in the tests.

unit_tests/connector_builder/test_message_grouper.py (1)

9-9: LGTM!

Importing orjson is beneficial for improved performance in JSON handling within the test cases.

unit_tests/sources/declarative/partition_routers/test_parent_state_stream.py (1)

7-7: LGTM!

The addition of orjson import is justified as it's utilized for efficient serialization in state comparisons.

unit_tests/sources/declarative/schema/source_test/SourceTest.py (1)

7-7: LGTM!

Adding the -> None return type annotation to the __init__ method improves type clarity.

airbyte_cdk/utils/message_utils.py (1)

11-12: Nice work on the type ignore comments! 👍

The type ignore comments are well-documented with clear explanations. This is exactly how type ignores should be used - when we know more about the types than the type checker can infer.

Also applies to: 20-21

airbyte_cdk/sources/declarative/auth/selective_authenticator.py (1)

31-36: LGTM! Nice formatting improvement 👍

The multi-line formatting makes the code more readable, and the type ignore comment is well-documented explaining why it's needed (dpath's type hints being overly restrictive).

airbyte_cdk/sources/file_based/availability_strategy/abstract_file_based_availability_strategy.py (3)

25-29: LGTM! Good use of type ignore for the override

The type ignore comment is necessary here since the signature differs slightly from the parent class. The formatting also improves readability.


40-43: Nice type improvement! 🎯

The parameter type change from Stream to AbstractFileBasedStream makes the type more precise and matches the actual usage. This should help catch type-related issues earlier.


55-55: LGTM! Good addition of return type

Adding the -> None return type annotation to __init__ improves type safety and mypy compliance.

airbyte_cdk/logger.py (1)

Line range hint 75-75: Consider explicit return type annotation for orjson.dumps

The type ignore comment suggests orjson's return type isn't properly typed. Would it make sense to add a more explicit type hint?

✅ Verification successful

Let me check the orjson type hints in more detail to provide a better recommendation.


Let me try one more check to see the orjson usage pattern.


The type ignore comment seems appropriate here, wanna keep it? 🤔

Looking at the codebase usage, the type ignore comment is consistently used across multiple files when dealing with orjson.dumps().decode() operations. The comment accurately explains that it always returns a string, and this pattern is well-established throughout the project. Since there are no type stubs available for orjson (types-orjson not found), this is actually the cleanest approach to handle the typing.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if orjson has type stubs
pip show types-orjson 2>/dev/null || echo "No type stubs found for orjson"

# Check if orjson is imported in other files with type annotations
rg -l "import orjson" | xargs rg "# type: ignore.*orjson"

Length of output: 544


Script:

#!/bin/bash
# Check orjson usage and type hints in the codebase
ast-grep --pattern 'orjson.dumps($$$)' -f py

# Check the full context of the logger implementation
rg -B5 -A5 'orjson.dumps.*type: ignore' airbyte_cdk/logger.py

Length of output: 928


Script:

#!/bin/bash
# Check orjson usage pattern without -f flag
ast-grep --pattern 'orjson.dumps($$$)'

# Check if there are any type annotations for orjson.dumps in the codebase
rg "orjson\.dumps.*->.*:" 

Length of output: 10229

airbyte_cdk/destinations/vector_db_based/writer.py (1)

88-90: LGTM! Nice type safety improvement.

The null check for message.record before accessing it is a great addition. The error message is clear and helps with debugging. This change aligns well with making the codebase mypy compliant.

airbyte_cdk/sources/declarative/retrievers/async_retriever.py (1)

7-7: Good move on standardizing the deprecation import!

Using typing_extensions.deprecated instead of deprecated.classic is more aligned with Python's type system and helps with mypy compliance. This change maintains the same functionality while improving type safety.

airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py (1)

31-36: Nice type safety improvements!

The type: ignore[override] comment and parameter reordering make the code more mypy compliant while maintaining clear method signatures.

airbyte_cdk/utils/traced_exception.py (2)

107-108: Nice type safety improvement for variadic arguments!

Adding Any type to *args and **kwargs improves type safety. However, would it make sense to be more specific about the expected types here? For example, if we know what kinds of arguments might be passed, we could use a more restrictive type. wdyt?


133-134: Good addition of runtime null check!

Adding a null check for trace.error prevents potential runtime errors. This is a good defensive programming practice.

airbyte_cdk/sources/file_based/file_types/excel_parser.py (1)

10-10: Good choice using orjson!

Using orjson is a great choice for performance. It's typically much faster than the standard json library.

airbyte_cdk/cli/source_declarative_manifest/_run.py (1)

28-28: LGTM! Clean import statement.

The direct import of orjson aligns with the project's import style.

pyproject.toml (2)

101-104: LGTM! Comprehensive type stubs coverage.

The addition of type stubs for commonly used libraries will improve type checking accuracy.


146-146: LGTM! Simplified type checking command.

The switch to using poetry directly for type checking makes the process more straightforward.

airbyte_cdk/sources/utils/transform.py (3)

8-8: LGTM! Well-organized imports with proper type hints.

The imports are cleanly organized, and all necessary types are properly imported.

Also applies to: 10-10


70-71: LGTM! Clear type annotation for the callback.

The type annotation for the normalization callback makes the expected function signature clear to users.


144-148: LGTM! Properly typed generator function.

The type annotations for the normalizer generator function improve type safety.

unit_tests/destinations/test_destination.py (1)

9-9: LGTM! Great type safety improvements

Nice work on improving type safety with:

  • Added return type annotations
  • Added Path type hint for tmp_path
  • Consistent import style for orjson

Also applies to: 13-13, 133-133, 142-142, 153-157, 166-170, 215-220

airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)

10-10: LGTM! Better deprecation handling

Nice improvements:

  • Using typing_extensions for deprecation is more modern
  • More descriptive deprecation message format

Also applies to: 146-146

airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py (2)

93-93: LGTM! Good addition of return type annotation.

The explicit None return type annotation helps with code clarity and type checking.


109-109: LGTM! Property setter return type is correctly annotated.

The explicit None return type for the setter method follows best practices for type annotations.

airbyte_cdk/sources/streams/concurrent/adapters.py (1)

11-11: Verify the deprecated decorator functionality remains consistent.

Good move switching to typing_extensions.deprecated! Let's verify this change doesn't affect existing deprecation warnings.

✅ Verification successful

The previous search might have been too specific. Let's broaden our search to understand the usage of deprecated decorators and verify the change's impact. I'm checking both the general usage of the decorator and the specific changes in the file.


The deprecated decorator change looks good and consistent with existing usage!

The switch to typing_extensions.deprecated is safe as all current usages follow the same pattern, accepting a deprecation message and optional category parameter. The functionality remains identical across the codebase, including the experimental class warnings in concurrent streams.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any custom deprecation warning implementations that might need updates
rg -g '!*.pyc' -g '!*test*' "deprecated.*classic" .

Length of output: 53


Script:

#!/bin/bash
# Let's check for any usage of deprecated decorator in the codebase
rg -g '!*.pyc' "@deprecated" .

# Also check the specific file for the old import
rg "deprecated" airbyte_cdk/sources/streams/concurrent/adapters.py -B2 -A2

Length of output: 2654

unit_tests/sources/test_source.py (1)

11-11: LGTM! Nice performance optimization.

The switch to orjson for JSON serialization is a good choice as it's known to be more performant than the standard library's json module.

airbyte_cdk/sources/streams/concurrent/cursor.py (1)

221-223: LGTM! Good defensive programming.

The added null check for record.associated_slice prevents potential null pointer dereference. The condition is well-structured and maintains the existing logic while adding safety.

unit_tests/test_entrypoint.py (1)

14-14: LGTM! Consistent with the codebase-wide improvement.

The addition of orjson import aligns with the broader effort to standardize and optimize JSON handling across the codebase.

airbyte_cdk/sources/streams/core.py (1)

13-13: LGTM! The deprecation messages are clear and informative.

The changes align well with the mypy compliance objective. The deprecation messages provide clear guidance on migration paths.

Also applies to: 95-96, 433-436

unit_tests/connector_builder/test_connector_builder_handler.py (1)

14-14: Verify orjson compatibility with existing JSON usage patterns.

The switch to orjson for JSON handling is a good performance improvement. However, should we add tests to verify that orjson handles all edge cases the same way as the standard json module? wdyt?

Also applies to: 433-436

unit_tests/sources/declarative/test_concurrent_declarative_source.py (1)

13-13: LGTM! The import change aligns with making the codebase mypy compliant.

The change from deprecated.classic to typing_extensions is consistent with the PR objective of improving type safety. The deprecated decorator functionality should remain the same since both implementations follow the same interface.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

2055-2058: LGTM! Clear and helpful example for access_token_headers.

The example demonstrates how to use base64 encoding for client credentials in the Authorization header, which is a common pattern in OAuth implementations.


2067-2072: LGTM! Well-structured example for access_token_params.

The example clearly shows how to structure the parameters for token exchange, using placeholders that will be replaced with actual values during runtime.

airbyte_cdk/sources/file_based/file_types/avro_parser.py (2)

6-6: LGTM on the typing import! 👍

Clean addition of cast to the existing typing imports.


186-186: 🛠️ Refactor suggestion

Consider adding type validation before casting

The cast to dict helps with type safety, but I notice that field["type"] could potentially be a string based on the _convert_avro_type_to_json method. What do you think about adding a type check before the cast to prevent potential runtime errors? Something like this maybe?

-    field["name"]: cast(dict, field["type"]) for field in schema["fields"]
+    field["name"]: cast(dict, field["type"]) if isinstance(field["type"], dict) else field["type"] for field in schema["fields"]

Let's verify how field["type"] is used in the codebase:

.github/workflows/python_lint.yml Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
.github/workflows/python_lint.yml (2)

75-75: Nice catch on fixing the typo! 🔍

The comment now correctly reads "Job-specific". However, I notice this typo still exists in the other jobs. Would you like me to help fix those as well? wdyt?


Line range hint 1-78: Consider optimizing the workflow structure? 🤔

I notice there's quite a bit of duplication in the common steps across all three jobs. Would you be interested in using a composite action or reusable workflow to DRY this up? This could make maintenance easier and reduce the file size significantly. Here's what I'm thinking:

  1. Create a reusable workflow for the common setup steps
  2. Or create a composite action for the Python/Poetry setup

wdyt? Happy to provide an example if you're interested!

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8202ce5 and 5c65b46.

📒 Files selected for processing (2)
  • .github/workflows/python_lint.yml (1 hunks)
  • airbyte_cdk/sources/streams/core.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/streams/core.py
🔇 Additional comments (1)
.github/workflows/python_lint.yml (1)

77-78: Simplified MyPy execution looks good! 👍

The removal of the conditional check aligns with making the codebase MyPy compliant by ensuring all files are type-checked. This matches the previous review discussion about removing the condition.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

2068-2071: Consider using a more secure example for OAuth headers?

The current example shows encoding of client credentials in the Authorization header. While technically correct, would it be better to use a placeholder like {base64Encoder:CLIENT_ID:CLIENT_SECRET} to avoid any confusion about exposing real credentials in documentation? wdyt?

-                    "Authorization": "Basic {base64Encoder:{client_id}:{client_secret}}",
+                    "Authorization": "Basic {base64Encoder:CLIENT_ID:CLIENT_SECRET}",

2080-2085: How about improving the JSON formatting in the example?

The example is clear but the formatting could be more readable. Would you consider this more readable format? wdyt?

-              - access_token_params:
-                  {
-                    "{auth_code_key}": "{{auth_code_key}}",
-                    "{client_id_key}": "{{client_id_key}}",
-                    "{client_secret_key}": "{{client_secret_key}}",
-                  }
+              - access_token_params:
+                  {
+                    "{auth_code_key}": "{{auth_code_key}}",
+                    "{client_id_key}": "{{client_id_key}}",
+                    "{client_secret_key}": "{{client_secret_key}}"
+                  }
airbyte_cdk/sources/streams/concurrent/adapters.py (2)

53-56: Consider enhancing the deprecation message with migration guidance.

The experimental warning is clear, but would it be helpful to provide guidance on the recommended alternative or future direction? This could help users prepare for eventual migration, wdyt?

 @deprecated(
-    "This class is experimental. Use at your own risk.",
+    "This class is experimental and may change in future versions. Use at your own risk. "
+    "Consider using <alternative_approach> for new implementations.",
     category=ExperimentalClassWarning,
 )

Line range hint 261-270: Consider enriching error context with slice information.

When handling exceptions, including the slice information in the error context could make debugging easier. What do you think about enhancing the error handling like this?

         except Exception as e:
             display_message = self._stream.get_error_display_message(e)
             if display_message:
-                raise ExceptionWithDisplayMessage(display_message) from e
+                raise ExceptionWithDisplayMessage(
+                    f"{display_message} (slice: {self._slice})"
+                ) from e
             else:
                 raise e
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5c65b46 and 192e905.

📒 Files selected for processing (14)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (4 hunks)
  • airbyte_cdk/sources/declarative/decoders/noop_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/retrievers/async_retriever.py (2 hunks)
  • airbyte_cdk/sources/file_based/availability_strategy/abstract_file_based_availability_strategy.py (4 hunks)
  • airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py (5 hunks)
  • airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (3 hunks)
  • airbyte_cdk/sources/streams/concurrent/abstract_stream.py (2 hunks)
  • airbyte_cdk/sources/streams/concurrent/adapters.py (2 hunks)
  • airbyte_cdk/sources/streams/concurrent/availability_strategy.py (3 hunks)
  • airbyte_cdk/sources/streams/core.py (3 hunks)
  • airbyte_cdk/sources/streams/http/http.py (7 hunks)
  • airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py (6 hunks)
  • pyproject.toml (2 hunks)
  • unit_tests/sources/declarative/test_concurrent_declarative_source.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
  • airbyte_cdk/sources/declarative/decoders/noop_decoder.py
  • airbyte_cdk/sources/declarative/retrievers/async_retriever.py
  • unit_tests/sources/declarative/test_concurrent_declarative_source.py
  • airbyte_cdk/sources/file_based/stream/concurrent/adapters.py
  • airbyte_cdk/sources/streams/concurrent/abstract_stream.py
  • airbyte_cdk/sources/streams/core.py
  • airbyte_cdk/sources/streams/concurrent/availability_strategy.py
  • pyproject.toml
  • airbyte_cdk/sources/file_based/availability_strategy/abstract_file_based_availability_strategy.py
  • airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
  • airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py
🔇 Additional comments (3)
airbyte_cdk/sources/streams/concurrent/adapters.py (1)

11-11: LGTM! Good choice on the typing_extensions import.

The switch to typing_extensions.deprecated is a more lightweight and standardized approach for deprecation handling.

airbyte_cdk/sources/streams/http/http.py (2)

124-125: LGTM! Great improvement in deprecation messaging.

The updated deprecation messages are more informative and consistent, providing clear version information and migration guidance.

Also applies to: 135-136, 146-147, 157-158


606-607: LGTM! Consistent deprecation handling.

The deprecation messages for adapter classes maintain consistency with the property deprecations, providing clear migration paths.

Also applies to: 622-623

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (9)
airbyte_cdk/sources/embedded/base_integration.py (1)

55-57: LGTM! Quick thought about the type ignore comment

The formatting looks clean and the type ignore directive is appropriately scoped. Would you consider adding a brief inline comment explaining why we need the type ignore here? Something like:

# type: ignore[union-attr, arg-type]  # stream.source_defined_primary_key can be None

This would help future maintainers understand the type complexity, wdyt? 🤔

airbyte_cdk/sources/declarative/datetime/min_max_datetime.py (2)

44-52: Consider using Optional type for better type clarity

The type ignore comments suggest we're dealing with optional values. Would it make more sense to explicitly declare min_datetime and max_datetime as Optional[Union[InterpolatedString, str]] in the class definition, wdyt? This would better represent the actual type hierarchy and might eliminate the need for type ignores.


69-75: Consider adding runtime type check to satisfy mypy

Instead of using type ignore, we could add a runtime check to help mypy understand the type. What do you think about this approach?

-            str(
-                self.datetime.eval(  # type: ignore[union-attr] # str has no attribute "eval"
-                    config,
-                    **additional_parameters,
-                )
-            ),
+            str(
+                self.datetime.eval(config, **additional_parameters)
+                if isinstance(self.datetime, InterpolatedString)
+                else self.datetime
+            ),
airbyte_cdk/sources/http_logger.py (1)

45-45: Consider using TypedDict for better type safety? 🤔

Instead of using a type ignore, we could make the types more explicit. What do you think about defining a TypedDict for the log message structure? Something like:

from typing import TypedDict

class HttpSection(TypedDict, total=False):
    title: str
    description: str
    is_auxiliary: bool

class LogMessage(TypedDict):
    http: HttpSection
    log: dict[str, str]
    url: dict[str, str]

This would give us better type safety and eliminate the need for the type ignore. Thoughts?

airbyte_cdk/sources/declarative/interpolation/jinja.py (2)

123-123: Consider explicit typing to avoid type ignore

Instead of using a type ignore comment, what if we explicitly type the undeclared variable? This would make the code more type-safe and clearer. wdyt?

-            undeclared = self._find_undeclared_variables(s)
-            undeclared_not_in_context = {var for var in undeclared}  # type: ignore [attr-defined]
+            undeclared: set[str] = self._find_undeclared_variables(s)
+            undeclared_not_in_context = {var for var in undeclared}

Line range hint 31-49: Consider adding type-related documentation

The class documentation is great, but given the focus on type safety, would it be helpful to add some documentation about the expected types and None handling? This could help future maintainers understand the type-related decisions. wdyt?

For example:

class JinjaInterpolation(Interpolation):
    """
    Interpolation strategy using the Jinja2 template engine.
    
    Type handling:
    - Input strings can be None, which will be treated as empty strings
    - Template compilation results are cached for performance
    - Undeclared variables are returned as a set of strings
    
    [rest of existing docstring...]
    """
airbyte_cdk/sources/file_based/file_types/excel_parser.py (3)

73-76: Consider addressing the underlying type issue instead of using type ignore.

I notice we're using # type: ignore [index] here. Could we perhaps make this more type-safe by explicitly typing the return value of dtype_to_json_type? For example, we could use TypedDict to define the expected structure. What do you think? 🤔


193-193: Document why the type ignore is necessary here.

The # type: ignore [arg-type] seems to be handling a typing limitation with the calamine engine. Could we add a brief comment explaining why this is needed? Something like:

# type: ignore [arg-type] # calamine engine accepts Union[IOBase, str, Path] but type stubs don't reflect this

What do you think? 🤔


Line range hint 1-193: Consider creating type stub files for better type safety.

I notice we have a few type: ignore comments throughout the file. Would it make sense to create .pyi stub files for some of these third-party interactions? This could give us better type safety without the need for inline ignores. What are your thoughts on this approach? 🤔

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 192e905 and 93e4534.

📒 Files selected for processing (13)
  • airbyte_cdk/cli/source_declarative_manifest/_run.py (3 hunks)
  • airbyte_cdk/connector_builder/message_grouper.py (2 hunks)
  • airbyte_cdk/destinations/vector_db_based/writer.py (1 hunks)
  • airbyte_cdk/sources/declarative/datetime/min_max_datetime.py (2 hunks)
  • airbyte_cdk/sources/declarative/interpolation/jinja.py (2 hunks)
  • airbyte_cdk/sources/embedded/base_integration.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_types/avro_parser.py (3 hunks)
  • airbyte_cdk/sources/file_based/file_types/excel_parser.py (3 hunks)
  • airbyte_cdk/sources/http_logger.py (2 hunks)
  • airbyte_cdk/sources/streams/concurrent/adapters.py (3 hunks)
  • airbyte_cdk/sources/streams/core.py (4 hunks)
  • airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py (7 hunks)
  • airbyte_cdk/sources/utils/transform.py (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • airbyte_cdk/destinations/vector_db_based/writer.py
  • airbyte_cdk/sources/file_based/file_types/avro_parser.py
  • airbyte_cdk/sources/streams/core.py
  • airbyte_cdk/cli/source_declarative_manifest/_run.py
  • airbyte_cdk/sources/streams/concurrent/adapters.py
  • airbyte_cdk/sources/utils/transform.py
  • airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
🔇 Additional comments (6)
airbyte_cdk/connector_builder/message_grouper.py (2)

277-277: LGTM! Type ignore is appropriate here

The type ignore comment is correctly handling both the argument type and union attribute access issues. The actual runtime type is compatible, but mypy can't verify this due to the dynamic nature of protocol messages.


358-358: LGTM! Type ignore is appropriate here

The type ignore comment is correctly handling the argument type mismatch. The deepcopy operation is necessary to prevent mutation of the records, and the runtime type is compatible despite mypy's concerns.

airbyte_cdk/sources/http_logger.py (2)

17-17: Nice type hint improvement! 🎯

The change from bool to bool | None makes the type hint more precise by explicitly indicating that None is a valid value. This aligns perfectly with mypy's expectations, wdyt?


48-48: 🛠️ Refactor suggestion

Should we align the return type with actual usage? 🤔

The type ignore suggests a mismatch between the expected JsonType and the actual dict[str, object]. Would it make sense to:

  1. Update the return type annotation to match the actual structure, or
  2. Ensure the dictionary values conform to JsonType?

Let's verify the JsonType definition:

airbyte_cdk/sources/file_based/file_types/excel_parser.py (2)

10-10: LGTM! Nice standardization of orjson import.

The direct import of orjson aligns well with the codebase-wide standardization effort. This change makes the code more consistent with other files in the project.


Line range hint 116-123: Great improvements to datetime handling! Let's verify the changes.

The explicit date_format="iso" and date_unit="us" parameters ensure proper datetime serialization. The error handling is also more specific now.

Would you like me to help verify that this handles all datetime edge cases correctly?

✅ Verification successful

Let me check the test coverage for Excel parser specifically to verify datetime handling.


Great news! The datetime handling looks solid and well-tested.

Looking at the test files, I can confirm that:

  • The Excel parser correctly handles various datetime formats including dates, timestamps with milliseconds and microseconds
  • The date_format="iso" and date_unit="us" parameters are working as expected, with test coverage in excel_all_types_scenario showing proper serialization of:
    • Full datetime values (e.g., "2022-05-29T00:00:00.000000")
    • Time values with microsecond precision (e.g., "12:00:00.456789")

The changes align perfectly with the pandas datetime handling and ensure consistent serialization across the codebase. Nice work! 🎉

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for datetime-related test cases to ensure coverage
rg -l "datetime|timestamp" --type py | while read -r file; do
  echo "=== $file ==="
  rg "def test.*(?:datetime|timestamp)" "$file" || true
done

Length of output: 31841


Script:

#!/bin/bash
# Check Excel parser test file content
echo "=== Excel Parser Tests ==="
cat unit_tests/sources/file_based/file_types/test_excel_parser.py

# Look for datetime-related test data in Excel scenarios
echo -e "\n=== Excel Scenarios ==="
cat unit_tests/sources/file_based/scenarios/excel_scenarios.py

Length of output: 21362

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (11)
.github/workflows/python_lint.yml (1)

Line range hint 31-31: Should we fix the "Job-specifc" typo in other jobs too? 🤔

I notice we fixed the typo in the MyPy job's comment (from "Job-specifc" to "Job-specific"), but the same typo exists in the other jobs. Wanna fix those for consistency?

-      # Job-specifc step(s):
+      # Job-specific step(s):

Also applies to: 66-66

airbyte_cdk/destinations/vector_db_based/embedder.py (2)

110-113: Consider documenting the reason for type ignore

Hey! The type ignore comment helps with mypy compliance, but would it be helpful to add a brief comment explaining why it's needed? Something like:

# type: ignore [call-arg] # langchain's OpenAIEmbeddings has incomplete type hints

Also, have you considered creating a type stub file for OpenAIEmbeddings to properly type hint these arguments? This could help avoid the type ignore, wdyt? 🤔


Line range hint 121-131: Consider making the API version configurable

I notice we're using a hardcoded API version "2023-05-15". Would it make sense to move this to the config model to make it more maintainable? This way we can update it without code changes when Azure releases new versions.

Also, the comment about the 16 document limit is dated (20230927). Should we update this or maybe make it more generic? Something like "Azure OpenAI API has a limit on documents per request"?

What are your thoughts on these suggestions? 🤔

airbyte_cdk/sources/declarative/requesters/request_options/datetime_based_request_options_provider.py (1)

88-90: LGTM! Consider documenting type cast assumption

The type ignore comment is correctly added to handle the union attribute access. Would you consider adding a brief inline comment explaining why we're confident that field_name is always an interpolated string here? This could help future maintainers understand the type ignore rationale better, wdyt?

airbyte_cdk/sources/declarative/requesters/error_handlers/http_response_filter.py (2)

154-157: LGTM! Consider consolidating type ignores

The type ignores for [no-any-return, union-attr] are correctly added. Since we're ignoring multiple type issues on the same line, would it make sense to add a more comprehensive comment explaining why these are safe to ignore? This could help maintain clarity about our type assumptions, wdyt?


161-165: LGTM! Consider enhancing type safety

The type ignores for predicate evaluation are necessary here. However, would you be interested in exploring the possibility of creating a custom type for the response parameter to make the type checking more explicit? This could potentially reduce our reliance on type ignores in the future, wdyt?

airbyte_cdk/sources/declarative/incremental/datetime_based_cursor.py (3)

136-137: LGTM! Consider documenting state structure

The type ignore for union-attr is correctly added. Would it be helpful to add a docstring describing the expected structure of the stream_state? This could make it clearer why we're confident about the type cast, wdyt?


161-163: LGTM! Consider enhancing type safety for stream slices

The type ignores for arg-type are necessary here. However, would you be interested in creating a more specific type for stream slices that better represents their structure? This could potentially reduce the need for type ignores, wdyt?


373-376: LGTM! Consider consolidating similar patterns

The type ignores here match the pattern we see in the request options provider. Would it make sense to extract this common pattern into a helper method or base class to ensure consistent handling across the codebase, wdyt?

airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (2)

133-136: Consider refactoring to eliminate # type: ignore comments

Would it be possible to refactor the code to avoid the # type: ignore [union-attr] comment here? Perhaps using type casting or asserting that parent_config.request_option is not None might help mypy recognize the correct type. Wdyt?


300-300: Consider refining the assignment to satisfy type checking

Could we adjust the assignment of substream_state to adhere to mypy's expectations without using # type: ignore [assignment]? Maybe explicitly defining the type or restructuring the code could help. Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7f05353 and 3eff135.

📒 Files selected for processing (17)
  • .github/workflows/python_lint.yml (1 hunks)
  • airbyte_cdk/cli/source_declarative_manifest/_run.py (3 hunks)
  • airbyte_cdk/connector_builder/message_grouper.py (4 hunks)
  • airbyte_cdk/destinations/vector_db_based/embedder.py (2 hunks)
  • airbyte_cdk/entrypoint.py (2 hunks)
  • airbyte_cdk/sources/declarative/datetime/min_max_datetime.py (3 hunks)
  • airbyte_cdk/sources/declarative/incremental/datetime_based_cursor.py (3 hunks)
  • airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (5 hunks)
  • airbyte_cdk/sources/declarative/requesters/error_handlers/http_response_filter.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/paginators/default_paginator.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/request_options/datetime_based_request_options_provider.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_types/excel_parser.py (4 hunks)
  • airbyte_cdk/sources/file_based/stream/cursor/default_file_based_cursor.py (1 hunks)
  • airbyte_cdk/sources/streams/core.py (6 hunks)
  • airbyte_cdk/sources/streams/http/http.py (8 hunks)
  • airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py (10 hunks)
  • airbyte_cdk/test/entrypoint_wrapper.py (2 hunks)
✅ Files skipped from review due to trivial changes (3)
  • airbyte_cdk/sources/file_based/stream/cursor/default_file_based_cursor.py
  • airbyte_cdk/sources/declarative/requesters/paginators/default_paginator.py
  • airbyte_cdk/test/entrypoint_wrapper.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • airbyte_cdk/connector_builder/message_grouper.py
  • airbyte_cdk/entrypoint.py
  • airbyte_cdk/sources/streams/core.py
  • airbyte_cdk/cli/source_declarative_manifest/_run.py
  • airbyte_cdk/sources/file_based/file_types/excel_parser.py
  • airbyte_cdk/sources/declarative/datetime/min_max_datetime.py
  • airbyte_cdk/sources/streams/http/http.py
  • airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
🔇 Additional comments (5)
.github/workflows/python_lint.yml (1)

75-75: LGTM on the MyPy changes! 👍

The removal of the conditional check fixes the issue with the non-existent step output. The command is now more focused by targeting the airbyte_cdk directory specifically.

Also applies to: 77-78

airbyte_cdk/destinations/vector_db_based/embedder.py (1)

Line range hint 110-131: LGTM! The changes successfully address mypy compliance

The type ignore comments are appropriately placed and the functionality remains intact. The changes align well with the PR's objective of making the codebase mypy compliant. 👍

airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (3)

165-168: Same suggestion regarding # type: ignore comments

Again, could we refactor the code to eliminate the # type: ignore [union-attr] comment? Adjusting the type hints or using type assertions could help mypy understand the types better. What do you think?


195-198: Refactoring to address # type: ignore for argument types

Would adjusting the type annotations or using type casting help eliminate the # type: ignore [arg-type] comment here? It might make the code cleaner and more type-safe. Wdyt?


234-237: Addressing # type: ignore by refining types

Is there an opportunity to refactor this section to remove the # type: ignore [arg-type] comment? Refining the type annotations for parent_record or extra_field_path might help mypy perform accurate type checking. What are your thoughts?

@aaronsteers aaronsteers requested a review from bnchrch December 3, 2024 21:05
@aaronsteers
Copy link
Contributor Author

Note: Downstream connector test failures appear to be unrelated. Those would probably also be failing on main but they only run when certain subpackages are changed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (19)
airbyte_cdk/sources/streams/concurrent/cursor.py (2)

476-476: Nice type annotation addition! 👍

The explicit type annotation for record_cursor_value as CursorValueType improves type safety and readability. This aligns well with the PR's goal of making the codebase mypy compliant.

Hey, while we're at it, would you consider adding a docstring to the _extract_cursor_value method to document its return type? Something like:

def _extract_cursor_value(self, record: Record) -> CursorValueType:
    """Extract and parse the cursor value from the record.
    
    Args:
        record: The record to extract the cursor value from
        
    Returns:
        CursorValueType: The parsed cursor value
        
    Raises:
        ValueError: If the cursor field is not found in the record
    """

wdyt? 🤔


Line range hint 483-483: Consider adding None checks for better type safety? 🤔

The comparison self.start <= record_cursor_value <= self._end_provider() looks clean, but what if we add some guard clauses for edge cases? Something like:

-        return self.start <= record_cursor_value <= self._end_provider()
+        if self.start is None:
+            return record_cursor_value <= self._end_provider()
+        end = self._end_provider()
+        if end is None:
+            return self.start <= record_cursor_value
+        return self.start <= record_cursor_value <= end

This would make the code more resilient to edge cases where either boundary might be None. What do you think? 🎯

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

Line range hint 648-678: Consider adding validation for token_provider type.

The changes look good with improved error handling for token_provider and api_token. However, we might want to add type validation for token_provider to prevent runtime errors. WDYT?

    def create_session_token_authenticator(
        self, model: SessionTokenAuthenticatorModel, config: Config, name: str, **kwargs: Any
    ) -> Union[ApiKeyAuthenticator, BearerAuthenticator]:
+       if token_provider is not None and not isinstance(token_provider, TokenProvider):
+           raise ValueError(f"token_provider must be an instance of TokenProvider, got {type(token_provider)}")
        decoder = (
            self._create_component_from_model(model=model.decoder, config=config)
            if model.decoder
            else JsonDecoder(parameters={})
        )

1711-1711: Consider documenting the token expiry format requirements.

The addition of token_expiry_date_format and token_expiry_is_time_of_expiration improves token handling. Would it be helpful to add a docstring explaining the expected format and behavior? WDYT?

    def create_oauth_authenticator(
        self, model: OAuthAuthenticatorModel, config: Config, **kwargs: Any
    ) -> DeclarativeOauth2Authenticator:
+       """Creates an OAuth2 authenticator.
+       
+       Args:
+           token_expiry_date_format: The format string for parsing token expiry dates.
+                                    When provided, token_expiry_is_time_of_expiration is set to True.
+                                    Example format: "%Y-%m-%dT%H:%M:%S.%fZ"
+       """
airbyte_cdk/sources/utils/record_helper.py (2)

38-38: Consider adding a type-ignore comment for clarity? 🤔

Hey! I notice this line involves in-place data transformation. Since mypy might flag the unused return value, would it make sense to add a # type: ignore[func-returns-value] comment here? This would explicitly document that we're intentionally ignoring the return value, wdyt?

-            transformer.transform(data, schema)
+            transformer.transform(data, schema)  # type: ignore[func-returns-value] # Transform modifies data in-place

Line range hint 15-24: How about enhancing the docstring for better type clarity? 📝

The function looks great! Would you be interested in adding a more detailed docstring that explains the transformer's role and type transformation behavior? Something like this perhaps?

 def stream_data_to_airbyte_message(
     stream_name: str,
     data_or_message: StreamData,
     transformer: TypeTransformer = TypeTransformer(TransformConfig.NoTransform),
     schema: Optional[Mapping[str, Any]] = None,
     is_file_transfer_message: bool = False,
 ) -> AirbyteMessage:
+    """Convert stream data into an Airbyte message with optional type transformation.
+    
+    Args:
+        stream_name: Name of the stream this message belongs to
+        data_or_message: The data to convert or an existing message
+        transformer: Handles type transformations according to schema (modifies data in-place)
+        schema: Optional JSON schema for type validation/transformation
+        is_file_transfer_message: If True, creates a file transfer message instead of a record
+    
+    Returns:
+        AirbyteMessage: The converted message
+    
+    Raises:
+        ValueError: If data_or_message is of an unexpected type
+    """
airbyte_cdk/entrypoint.py (2)

251-254: Consider adding runtime validation for record attributes?

While the type ignores are valid (these fields are guaranteed by the protocol), we could make this more robust with runtime validation, wdyt?

+        if not hasattr(message.record, 'stream') or not hasattr(message.record, 'namespace'):
+            raise ValueError("Record message missing required attributes: stream and namespace")
         stream_message_count[
             HashableStreamDescriptor(
                 name=message.record.stream,  # type: ignore[union-attr]
                 namespace=message.record.namespace,  # type: ignore[union-attr]
             )
         ] += 1.0

260-262: How about making state stats initialization more explicit?

The current approach works well, but we could make it more explicit and type-safe, wdyt?

-                message.state.sourceStats = message.state.sourceStats or AirbyteStateStats()
+                if message.state.sourceStats is None:
+                    message.state.sourceStats = AirbyteStateStats()
                 message.state.sourceStats.recordCount = stream_message_count.get(
                     stream_descriptor, 0.0
                 )
airbyte_cdk/sources/streams/http/http_client.py (3)

147-147: Consider tracking the untyped cache.clear() method.

The code comment suggests that cache.clear() is still not typed in requests_cache. While removing the type ignore comment aligns with making the codebase mypy compliant, we might want to track this for future updates. What do you think about adding a TODO to revisit this when requests_cache adds proper typing?


Line range hint 338-338: Consider making the control flow more explicit for type checking.

The type ignore comment is necessary because mypy can't infer that the error paths always raise. What do you think about making this more explicit by either:

  1. Adding a final assert response is not None before the return, or
  2. Using a helper function that handles the None case explicitly?

This could help remove the type ignore comment and make the control flow clearer. WDYT?


Line range hint 391-402: Consider improving the stream status emission approach.

The current approach of printing stream status with \n and flush=True is noted as temporary. This could cause issues with message ordering or parsing. Would you be interested in tracking this as a separate improvement task to implement a proper message queuing system once the concurrent message repository is ready?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

Line range hint 206-206: Consider enhancing state manager type safety

I noticed this comment about using generics for the ConnectorStateManager. Since we're working on mypy compliance, would you like to tackle this improvement as well? We could make the state management more type-safe by introducing proper generic type parameters. Here's what I'm thinking:

- state_manager = ConnectorStateManager(state=self._state)  # type: ignore
+ class ConnectorStateManager(Generic[TState]):
+     def __init__(self, state: TState) -> None:
+         self._state = state
+ 
+ state_manager = ConnectorStateManager[TState](state=self._state)

What are your thoughts on this approach? 🤔

airbyte_cdk/sources/declarative/spec/spec.py (2)

Line range hint 24-24: How about adding advanced_auth to the docstring? 🤔

The class has great documentation, but the new advanced_auth attribute isn't documented in the Attributes section. Wdyt about adding it?

     """
     Returns a connection specification made up of information about the connector and how it can be configured
 
     Attributes:
         connection_specification (Mapping[str, Any]): information related to how a connector can be configured
         documentation_url (Optional[str]): The link the Airbyte documentation about this connector
+        advanced_auth (Optional[AuthFlow]): Advanced authentication configuration for the connector
     """

Line range hint 37-41: Type handling looks good, but shall we add a runtime check? 🤔

The type ignore comment is well documented, but we could make it even more robust. What do you think about adding a runtime assertion to complement the type ignore? This would help catch any potential issues early:

     if self.advanced_auth:
+        assert hasattr(self.advanced_auth, 'auth_flow_type'), "AuthFlow must have auth_flow_type attribute"
         self.advanced_auth.auth_flow_type = self.advanced_auth.auth_flow_type.value  # type: ignore # We know this is always assigned to an AuthFlow which has the auth_flow_type field
         # Map CDK AuthFlow model to protocol AdvancedAuth model
         obj["advanced_auth"] = self.advanced_auth.dict()
airbyte_cdk/sources/declarative/incremental/datetime_based_cursor.py (4)

136-137: Consider using a type-safe alternative for cursor_field?

Instead of using a type ignore comment, we could improve type safety by explicitly casting cursor_field to InterpolatedString in the method signature. wdyt?

def set_initial_state(self, stream_state: StreamState) -> None:
-    self._cursor = stream_state.get(self.cursor_field.eval(self.config)) if stream_state else None  # type: ignore [union-attr]
+    cursor_field = self.cursor_field
+    if not isinstance(cursor_field, InterpolatedString):
+        raise TypeError("cursor_field must be an InterpolatedString")
+    self._cursor = stream_state.get(cursor_field.eval(self.config)) if stream_state else None

161-164: How about adding type hints for stream_slice values?

The type ignores could be replaced with proper type hints and validation. This would make the code more maintainable and type-safe. Here's a suggestion:

def _is_within_daterange_boundaries(
    self,
    record: Record,
-    start_datetime_boundary: Union[datetime.datetime, str],
-    end_datetime_boundary: Union[datetime.datetime, str],
+    start_datetime_boundary: Union[datetime.datetime, str, None],
+    end_datetime_boundary: Union[datetime.datetime, str, None],
) -> bool:

And in the observe method:

if (
    self._is_within_daterange_boundaries(
        record,
-        stream_slice.get(start_field),  # type: ignore [arg-type]
-        stream_slice.get(end_field),  # type: ignore [arg-type]
+        stream_slice.get(start_field, None),
+        stream_slice.get(end_field, None),
    )
    and is_highest_observed_cursor_value
):

373-375: Should we add type validation for end_time_option.field_name?

Similar to the cursor_field suggestion earlier, we could make this more type-safe by adding validation. Here's a thought:

if self.end_time_option and self.end_time_option.inject_into == option_type:
+    field_name = self.end_time_option.field_name
+    if not isinstance(field_name, InterpolatedString):
+        raise TypeError("end_time_option.field_name must be an InterpolatedString")
-    options[self.end_time_option.field_name.eval(config=self.config)] = stream_slice.get(  # type: ignore [union-attr]
+    options[field_name.eval(config=self.config)] = stream_slice.get(
        self._partition_field_end.eval(self.config)
    )

Line range hint 1-447: Consider a more systematic approach to type safety?

The current changes make the code mypy compliant, which is great! However, we could take this opportunity to implement a more systematic approach to type safety. This could involve:

  1. Creating type guards for InterpolatedString and other common types
  2. Adding runtime type validation in post_init for all fields that should be InterpolatedString
  3. Using TypeVar and Generic types for better type inference

Would you be interested in exploring these improvements in a follow-up PR?

airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py (1)

Line range hint 184-214: Nice implementation of the custom format converter! A few thoughts for consideration:

  1. What do you think about adding input validation for datetime_format in the constructor? This could catch invalid format strings early, wdyt?

  2. For performance optimization in parse_timestamp, we could potentially cache successful format matches. This would be particularly beneficial when processing large datasets with consistent timestamp formats. Would you be interested in exploring this optimization?

  3. Consider adding docstring examples of expected format strings? This could help users understand the expected format patterns.

Here's a potential implementation for validation and caching:

 def __init__(
     self,
     datetime_format: str,
     input_datetime_formats: Optional[List[str]] = None,
     is_sequential_state: bool = True,
     cursor_granularity: Optional[timedelta] = None,
 ):
     super().__init__(
         is_sequential_state=is_sequential_state, cursor_granularity=cursor_granularity
     )
+    # Validate format string
+    try:
+        self._parser.format(datetime.now(), datetime_format)
+    except ValueError as e:
+        raise ValueError(f"Invalid datetime format '{datetime_format}': {str(e)}")
     self._datetime_format = datetime_format
     self._input_datetime_formats = input_datetime_formats if input_datetime_formats else []
     self._input_datetime_formats += [self._datetime_format]
     self._parser = DatetimeParser()
+    self._format_cache: Dict[str, str] = {}

 def parse_timestamp(self, timestamp: str) -> datetime:
+    if timestamp in self._format_cache:
+        return self._parser.parse(timestamp, self._format_cache[timestamp])
     for datetime_format in self._input_datetime_formats:
         try:
-            return self._parser.parse(timestamp, datetime_format)
+            result = self._parser.parse(timestamp, datetime_format)
+            self._format_cache[timestamp] = datetime_format
+            return result
         except ValueError:
             pass
     raise ValueError(f"No format in {self._input_datetime_formats} matching {timestamp}")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 59d585e and 39d66af.

📒 Files selected for processing (29)
  • airbyte_cdk/connector_builder/message_grouper.py (5 hunks)
  • airbyte_cdk/destinations/vector_db_based/writer.py (1 hunks)
  • airbyte_cdk/entrypoint.py (3 hunks)
  • airbyte_cdk/logger.py (2 hunks)
  • airbyte_cdk/sources/abstract_source.py (1 hunks)
  • airbyte_cdk/sources/config.py (1 hunks)
  • airbyte_cdk/sources/connector_state_manager.py (4 hunks)
  • airbyte_cdk/sources/declarative/auth/oauth.py (1 hunks)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/datetime_based_cursor.py (3 hunks)
  • airbyte_cdk/sources/declarative/interpolation/jinja.py (3 hunks)
  • airbyte_cdk/sources/declarative/interpolation/macros.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)
  • airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (5 hunks)
  • airbyte_cdk/sources/declarative/requesters/error_handlers/default_error_handler.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/error_handlers/http_response_filter.py (1 hunks)
  • airbyte_cdk/sources/declarative/requesters/request_options/datetime_based_request_options_provider.py (1 hunks)
  • airbyte_cdk/sources/declarative/spec/spec.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_types/csv_parser.py (3 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
  • airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py (2 hunks)
  • airbyte_cdk/sources/streams/core.py (7 hunks)
  • airbyte_cdk/sources/streams/http/http.py (10 hunks)
  • airbyte_cdk/sources/streams/http/http_client.py (2 hunks)
  • airbyte_cdk/sources/utils/record_helper.py (1 hunks)
  • airbyte_cdk/sources/utils/schema_helpers.py (1 hunks)
  • airbyte_cdk/test/entrypoint_wrapper.py (3 hunks)
  • airbyte_cdk/test/mock_http/response_builder.py (1 hunks)
  • airbyte_cdk/utils/traced_exception.py (3 hunks)
✅ Files skipped from review due to trivial changes (6)
  • airbyte_cdk/test/mock_http/response_builder.py
  • airbyte_cdk/sources/declarative/requesters/error_handlers/default_error_handler.py
  • airbyte_cdk/sources/utils/schema_helpers.py
  • airbyte_cdk/sources/abstract_source.py
  • airbyte_cdk/sources/config.py
  • airbyte_cdk/sources/declarative/auth/oauth.py
🚧 Files skipped from review as they are similar to previous changes (13)
  • airbyte_cdk/sources/declarative/requesters/request_options/datetime_based_request_options_provider.py
  • airbyte_cdk/destinations/vector_db_based/writer.py
  • airbyte_cdk/sources/file_based/file_types/csv_parser.py
  • airbyte_cdk/connector_builder/message_grouper.py
  • airbyte_cdk/sources/connector_state_manager.py
  • airbyte_cdk/logger.py
  • airbyte_cdk/utils/traced_exception.py
  • airbyte_cdk/test/entrypoint_wrapper.py
  • airbyte_cdk/sources/streams/core.py
  • airbyte_cdk/sources/declarative/interpolation/jinja.py
  • airbyte_cdk/sources/declarative/requesters/error_handlers/http_response_filter.py
  • airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
  • airbyte_cdk/sources/streams/http/http.py
🔇 Additional comments (12)
airbyte_cdk/sources/streams/concurrent/cursor.py (1)

Line range hint 477-482: Solid error handling! 💪

The error handling here is well thought out:

  • Catches ValueError gracefully
  • Uses a flag to prevent log spam
  • Returns True as a safe default
  • Clear warning message with good context
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

400-400: LGTM! Nice cleanup of message repository initialization.

The initialization is now more concise by evaluating the log level directly in the initialization line.


898-898: LGTM! Good addition of message repository to cursor.

Adding message repository to the cursor initialization improves logging capabilities and maintains consistency with the rest of the codebase.

airbyte_cdk/entrypoint.py (2)

18-18: LGTM! Clean import statement.

The direct import of orjson looks good. It's a cleaner approach than the previous version.


284-284: Nice performance optimization! 🚀

Using orjson.dumps directly is a great choice here! It's more efficient than the standard json library, and the implementation is clean and straightforward.

airbyte_cdk/sources/declarative/interpolation/macros.py (1)

119-119: LGTM! Clean type annotation improvement.

The removal of the type ignore comment is a good change as mypy can now properly infer the return type from parse_duration. The explicit return type annotation Union[datetime.timedelta, isodate.Duration] correctly matches the possible return types.

airbyte_cdk/sources/streams/http/http_client.py (1)

327-327: LGTM! Clean lambda type improvement.

The removal of the type ignore comment is safe here as the type is properly inferred from the formatter parameter.

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

131-131: Verify type safety of message_repository

Hey! I notice we're removing the type ignore comment here. This suggests that message_repository is now properly typed upstream. Would it be helpful to add a quick comment explaining how the type safety was achieved (e.g., through proper typing in the parent class or factory)? This could help future maintainers understand the typing guarantees, wdyt? 🤔

Let's verify the type safety with this script:

✅ Verification successful

Based on the search results, I can see that message_repository is properly typed across the codebase:

  1. In various files it's typed as MessageRepository or Optional[MessageRepository]
  2. The default implementation NoopMessageRepository() is used when not specified
  3. The type is consistently used in test files and production code
  4. The parameter is properly typed in parent classes and related components

Type safety is properly handled, no comment needed

The removal of the type ignore comment is justified as the message_repository parameter is properly typed throughout the codebase. The type system can now correctly verify the type safety without needing to ignore any checks. The parent classes and implementations all use proper type annotations, making the explicit type ignore unnecessary.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if message_repository is properly typed in parent classes and related files

# Look for message_repository type definitions
ast-grep --pattern 'message_repository: $_'

# Look for message_repository assignments or declarations
rg 'message_repository' -A 2 -B 2

Length of output: 78084

airbyte_cdk/sources/declarative/spec/spec.py (2)

12-12: LGTM! Clean import organization 🎯

The new import is properly grouped with related model imports.


Line range hint 34-34: Verify Python version compatibility

The use of the | operator in type annotations (dict[str, Mapping[str, Any] | str | AdvancedAuth]) requires Python 3.10+.

✅ Verification successful

The Python version requirement is compatible with the syntax! 🎉

Looking at the pyproject.toml, the package explicitly requires Python ^3.10,<3.13, which perfectly aligns with the use of the | operator in type annotations. The project's classifiers also confirm Python 3.10 support.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check minimum Python version requirement in setup.py or pyproject.toml
fd -e toml -e py -g "setup.py" -g "pyproject.toml" --exec grep -l "python"

Length of output: 172


Script:

#!/bin/bash
# Let's try a different approach to find Python version requirements
# First, let's check for setup.py
find . -name "setup.py" -type f -exec cat {} \;

# Also check for pyproject.toml
find . -name "pyproject.toml" -type f -exec cat {} \;

# As a backup, let's also check for Python version in requirements files
find . -name "*requirements*.txt" -type f -exec cat {} \;

Length of output: 8883

airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py (2)

144-144: Clean type handling improvement!

Nice work removing the type ignore comments. The direct return of dt_object with proper type checking is much cleaner and mypy-compliant. The error message is also very helpful for debugging.

Also applies to: 181-181


Line range hint 184-214: Security and performance considerations

A few thoughts on hardening this implementation:

  1. Should we consider adding a maximum limit to input_datetime_formats to prevent potential DoS through excessive format attempts?

  2. Have you considered validating all input formats in the constructor, similar to the output format? This could prevent potential issues with malformed format strings.

Let's check if there are any existing format string validation patterns in the codebase:

@aaronsteers aaronsteers changed the title chore: make codebase mypy compliant chore: make codebase compliant with mypy Dec 3, 2024
Copy link
Contributor

@bnchrch bnchrch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple questions but nothing blocking.

This PR leaves the codebase much better than before.

Well done 👏

a-man-in-a-suit-says-you-re-my-type

.github/workflows/connector-tests.yml Show resolved Hide resolved
airbyte_cdk/entrypoint.py Show resolved Hide resolved
@aaronsteers aaronsteers changed the title chore: make codebase compliant with mypy chore: make full CDK codebase compliant with mypy Dec 4, 2024
@aaronsteers aaronsteers merged commit c347760 into main Dec 4, 2024
19 checks passed
@aaronsteers aaronsteers deleted the aj/chore/make-mypy-compliant branch December 4, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants