-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: make full CDK codebase compliant with mypy
#105
Conversation
📝 Walkthrough📝 WalkthroughThe changes in this pull request involve multiple updates across various files, primarily focusing on improving code quality through enhanced type annotations, import statement adjustments, and minor formatting updates. The modifications aim to standardize the use of the Changes
Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
What do you think about these suggestions? Would you like to proceed with any of them? Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (18)
airbyte_cdk/sources/declarative/decoders/noop_decoder.py (1)
17-20
: Could we adjust the method signature to align with the base class?Adding
# type: ignore[override]
suppresses the type checker warning about method override. Perhaps we can modify thedecode
method's signature to match the base class inDecoder
and avoid the need for the ignore comment. What do you think?airbyte_cdk/utils/spec_schema_transformations.py (1)
7-7
: LGTM! The type annotations look great! 🎉The changes properly type the JSON schema operations using
dict[str, Any]
. Would you consider enhancing the docstring to include the type information as well? Something like:def resolve_refs(schema: dict[str, Any]) -> dict[str, Any]: """ Resolves JSON schema references within the provided schema. Args: schema: A JSON schema dictionary that may contain $ref references Returns: A new schema dictionary with all references resolved """wdyt?
Also applies to: 12-12, 24-24
airbyte_cdk/sources/streams/http/requests_native_auth/abstract_token.py (1)
8-8
: The type hints are looking good! One small suggestion though 🤔Instead of returning
Any
, we could be more specific since we know this method returns the modified request. How about:def __call__(self, request: requests.PreparedRequest) -> requests.PreparedRequest:This would make it clearer to callers what they're getting back. wdyt?
Also applies to: 15-15
airbyte_cdk/utils/event_timing.py (2)
23-25
: Consider more specific type annotations for collectionsThe type annotations for
events
andstack
could be more specific. What do you think about:
events: dict[str, Event]
since it only stores Event instancesstack: list[Event]
since it manages Event objects- self.events: dict[str, Any] = {} - self.stack: list[Any] = [] + self.events: dict[str, Event] = {} + self.stack: list[Event] = []wdyt? This would make the types more precise and help mypy catch potential type errors.
80-80
: Consider more specific Generator type parametersThe Generator type could be more specific about what it yields and sends. How about:
-def create_timer(name: str) -> Generator[EventTimer, Any, None]: +def create_timer(name: str) -> Generator[EventTimer, None, None]:Since the generator doesn't use the send() method, we can use
None
instead ofAny
for better type safety. wdyt?airbyte_cdk/config_observation.py (2)
Line range hint
24-29
: Consider using TypeVar for better type safety in ObservedDictInstead of using Any for the mapping's key and value types, we could make the class generic. How about:
+from typing import TypeVar, Generic +K = TypeVar('K') +V = TypeVar('V') -class ObservedDict(dict): # type: ignore +class ObservedDict(Generic[K, V], dict[K, V]): def __init__( self, - non_observed_mapping: MutableMapping[Any, Any], + non_observed_mapping: MutableMapping[K, V], observer: ConfigObserver, update_on_unchanged_value: bool = True, ) -> None:This would provide better type safety while maintaining flexibility. wdyt?
Line range hint
89-89
: Consider adding return type annotation to update methodThe
update
method is missing a return type annotation. Should we add-> None
for consistency?- def update(self): + def update(self) -> None:airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py (1)
50-53
: Great consistency in parameter ordering!The consistent parameter ordering across methods (
self
first, followed by core parameters) improves readability. The explicit type hints forstream: AbstractFileBasedStream
make the code more type-safe.One thought: Would it make sense to add docstring type hints as well for better IDE support, wdyt? 🤔
Also applies to: 91-91, 111-114
airbyte_cdk/utils/traced_exception.py (1)
137-147
: Consider consolidating type ignore commentsThe repeated type ignore comments for union attributes could potentially be consolidated at the class level. Would you consider using a class-level type ignore if these union types are a known limitation of the model? This might make the code more readable. wdyt?
+ # type: ignore[union-attr] # AirbyteMessage with MessageType.TRACE has AirbyteTraceMessage if error_message.trace.error.message: - error_message.trace.error.message = filter_secrets( # type: ignore[union-attr] # AirbyteMessage with MessageType.TRACE has AirbyteTraceMessage + error_message.trace.error.message = filter_secrets( error_message.trace.error.message )airbyte_cdk/sources/connector_state_manager.py (1)
Line range hint
134-138
: Good type safety improvements in state extractionThe addition of type ignore comments with clear explanations helps mypy understand the code better. However, would it be worth considering creating a type guard function to validate these attributes at runtime? This could eliminate the need for type ignore comments. wdyt?
Example:
def has_stream_descriptor(state: AirbyteStateMessage) -> TypeGuard[AirbyteStateMessage]: return (hasattr(state, 'stream') and hasattr(state.stream, 'stream_descriptor'))airbyte_cdk/cli/source_declarative_manifest/_run.py (1)
163-170
: Nice type safety improvements! Would you consider a small enhancement to the error message?The type annotations and null check are great improvements. For the error message, what do you think about making it more actionable by including an example of a valid config structure? wdyt?
- f"of the config but config only has keys: {list(config.keys() if config else [])}" + f"of the config but config only has keys: {list(config.keys() if config else [])}. " + + "Example valid config: {'__injected_declarative_manifest': {'streams': [...]}}"airbyte_cdk/sources/utils/transform.py (1)
173-179
: Consider adding a docstring for the resolve function?The type annotations look great! Would you consider adding a brief docstring to explain what the resolve function does? wdyt?
def resolve(subschema: dict[str, Any]) -> dict[str, Any]: + """ + Resolves JSON Schema references and returns the resolved schema. + :param subschema: The schema that may contain a $ref + :return: The resolved schema + """ if "$ref" in subschema:airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py (1)
33-35
: LGTM! The type hints look great.Nice work updating to the modern union syntax with
|
. Would you consider adding a docstring to document these optional parameters and their default values? This would help other developers understand when to use them, wdyt?Also applies to: 38-38
airbyte_cdk/entrypoint.py (2)
18-18
: LGTM! Good performance improvement using orjson.The switch to
orjson
should provide better performance. Would you consider adding a try-catch block around theorjson.dumps
calls to handle potential serialization errors gracefully, wdyt?
Line range hint
246-249
: LGTM! Type annotation for state parameter is added.The
list[Any]
type annotation is good for backward compatibility. Would you consider using a more specific type for the state items if we know their structure, wdyt? This could help catch type-related issues earlier.airbyte_cdk/sources/streams/http/http.py (2)
13-13
: LGTM! Consider standardizing the deprecation message format.The switch to
typing_extensions.deprecated
improves type checking. The deprecation messages are clear but have slight format inconsistencies - some have trailing spaces. Would you like to standardize them? wdyt?- "Deprecated as of CDK version 3.0.0. ", - "You should set error_handler explicitly in HttpStream.get_error_handler() instead.", + "Deprecated as of CDK version 3.0.0. You should set error_handler explicitly in HttpStream.get_error_handler() instead."Also applies to: 124-125, 135-136, 146-147, 157-158, 606-607, 622-623
Line range hint
127-162
: Consider adding migration examples in docstrings.The deprecation messages clearly indicate what to use instead, but would it be helpful to add example code in the docstrings showing how to migrate from the old to the new approach? This could make the transition smoother for users. wdyt?
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2079-2079
: Consider adding more context to the descriptions.The descriptions for several OAuth-related fields could be enhanced with more context about when and why they would be used. For example:
extract_output
: Could explain common scenarios where certain fields need to be extractedstate
: Could explain why state parameter configuration is important for securitystate_key
: Could mention common variations in state parameter names across providersauth_code_key
: Could explain why some providers use different names for the authorization codeAlso applies to: 2091-2091, 2124-2124, 2131-2131
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (53)
.github/dependabot.yml
(2 hunks).github/workflows/publish_sdm_connector.yml
(1 hunks).github/workflows/pytest_matrix.yml
(1 hunks).github/workflows/python_lint.yml
(1 hunks)airbyte_cdk/cli/source_declarative_manifest/_run.py
(2 hunks)airbyte_cdk/config_observation.py
(1 hunks)airbyte_cdk/connector_builder/main.py
(1 hunks)airbyte_cdk/destinations/destination.py
(1 hunks)airbyte_cdk/destinations/vector_db_based/writer.py
(1 hunks)airbyte_cdk/entrypoint.py
(1 hunks)airbyte_cdk/logger.py
(1 hunks)airbyte_cdk/sources/connector_state_manager.py
(1 hunks)airbyte_cdk/sources/declarative/auth/selective_authenticator.py
(1 hunks)airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(4 hunks)airbyte_cdk/sources/declarative/decoders/noop_decoder.py
(1 hunks)airbyte_cdk/sources/declarative/requesters/request_options/interpolated_request_options_provider.py
(1 hunks)airbyte_cdk/sources/declarative/retrievers/async_retriever.py
(1 hunks)airbyte_cdk/sources/file_based/availability_strategy/abstract_file_based_availability_strategy.py
(3 hunks)airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py
(4 hunks)airbyte_cdk/sources/file_based/file_types/avro_parser.py
(2 hunks)airbyte_cdk/sources/file_based/file_types/csv_parser.py
(1 hunks)airbyte_cdk/sources/file_based/file_types/excel_parser.py
(1 hunks)airbyte_cdk/sources/file_based/file_types/jsonl_parser.py
(1 hunks)airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py
(2 hunks)airbyte_cdk/sources/file_based/stream/concurrent/adapters.py
(2 hunks)airbyte_cdk/sources/streams/concurrent/abstract_stream.py
(1 hunks)airbyte_cdk/sources/streams/concurrent/adapters.py
(1 hunks)airbyte_cdk/sources/streams/concurrent/availability_strategy.py
(1 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(1 hunks)airbyte_cdk/sources/streams/core.py
(3 hunks)airbyte_cdk/sources/streams/http/http.py
(7 hunks)airbyte_cdk/sources/streams/http/requests_native_auth/abstract_token.py
(1 hunks)airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
(6 hunks)airbyte_cdk/sources/utils/transform.py
(7 hunks)airbyte_cdk/test/entrypoint_wrapper.py
(1 hunks)airbyte_cdk/utils/airbyte_secrets_utils.py
(1 hunks)airbyte_cdk/utils/event_timing.py
(4 hunks)airbyte_cdk/utils/message_utils.py
(1 hunks)airbyte_cdk/utils/spec_schema_transformations.py
(2 hunks)airbyte_cdk/utils/traced_exception.py
(3 hunks)bin/run-mypy-on-modified-files.sh
(0 hunks)pyproject.toml
(2 hunks)unit_tests/connector_builder/test_connector_builder_handler.py
(1 hunks)unit_tests/connector_builder/test_message_grouper.py
(1 hunks)unit_tests/destinations/test_destination.py
(6 hunks)unit_tests/sources/declarative/incremental/test_per_partition_cursor_integration.py
(1 hunks)unit_tests/sources/declarative/partition_routers/test_parent_state_stream.py
(1 hunks)unit_tests/sources/declarative/schema/source_test/SourceTest.py
(1 hunks)unit_tests/sources/declarative/test_concurrent_declarative_source.py
(1 hunks)unit_tests/sources/test_source.py
(1 hunks)unit_tests/test/test_entrypoint_wrapper.py
(1 hunks)unit_tests/test_entrypoint.py
(1 hunks)unit_tests/utils/test_traced_exception.py
(1 hunks)
💤 Files with no reviewable changes (1)
- bin/run-mypy-on-modified-files.sh
✅ Files skipped from review due to trivial changes (15)
- .github/workflows/pytest_matrix.yml
- airbyte_cdk/sources/declarative/requesters/request_options/interpolated_request_options_provider.py
- airbyte_cdk/sources/file_based/file_types/jsonl_parser.py
- airbyte_cdk/test/entrypoint_wrapper.py
- airbyte_cdk/sources/file_based/file_types/csv_parser.py
- airbyte_cdk/destinations/destination.py
- .github/dependabot.yml
- airbyte_cdk/sources/streams/concurrent/abstract_stream.py
- unit_tests/test/test_entrypoint_wrapper.py
- airbyte_cdk/sources/streams/concurrent/availability_strategy.py
- airbyte_cdk/sources/file_based/stream/abstract_file_based_stream.py
- unit_tests/utils/test_traced_exception.py
- airbyte_cdk/connector_builder/main.py
- airbyte_cdk/utils/airbyte_secrets_utils.py
- .github/workflows/publish_sdm_connector.yml
🧰 Additional context used
📓 Learnings (1)
airbyte_cdk/cli/source_declarative_manifest/_run.py (1)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.273Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
🪛 actionlint (1.7.4)
.github/workflows/python_lint.yml
78-78: property "changed-py-files" is not defined in object type {}
(expression)
🔇 Additional comments (37)
unit_tests/sources/declarative/incremental/test_per_partition_cursor_integration.py (1)
8-8
: LGTM!
The import of orjson
is appropriate and enhances the efficiency of JSON serialization in the tests.
unit_tests/connector_builder/test_message_grouper.py (1)
9-9
: LGTM!
Importing orjson
is beneficial for improved performance in JSON handling within the test cases.
unit_tests/sources/declarative/partition_routers/test_parent_state_stream.py (1)
7-7
: LGTM!
The addition of orjson
import is justified as it's utilized for efficient serialization in state comparisons.
unit_tests/sources/declarative/schema/source_test/SourceTest.py (1)
7-7
: LGTM!
Adding the -> None
return type annotation to the __init__
method improves type clarity.
airbyte_cdk/utils/message_utils.py (1)
11-12
: Nice work on the type ignore comments! 👍
The type ignore comments are well-documented with clear explanations. This is exactly how type ignores should be used - when we know more about the types than the type checker can infer.
Also applies to: 20-21
airbyte_cdk/sources/declarative/auth/selective_authenticator.py (1)
31-36
: LGTM! Nice formatting improvement 👍
The multi-line formatting makes the code more readable, and the type ignore comment is well-documented explaining why it's needed (dpath's type hints being overly restrictive).
airbyte_cdk/sources/file_based/availability_strategy/abstract_file_based_availability_strategy.py (3)
25-29
: LGTM! Good use of type ignore for the override
The type ignore comment is necessary here since the signature differs slightly from the parent class. The formatting also improves readability.
40-43
: Nice type improvement! 🎯
The parameter type change from Stream
to AbstractFileBasedStream
makes the type more precise and matches the actual usage. This should help catch type-related issues earlier.
55-55
: LGTM! Good addition of return type
Adding the -> None
return type annotation to __init__
improves type safety and mypy compliance.
airbyte_cdk/logger.py (1)
Line range hint 75-75
: Consider explicit return type annotation for orjson.dumps
The type ignore comment suggests orjson's return type isn't properly typed. Would it make sense to add a more explicit type hint?
✅ Verification successful
Let me check the orjson type hints in more detail to provide a better recommendation.
Let me try one more check to see the orjson usage pattern.
The type ignore comment seems appropriate here, wanna keep it? 🤔
Looking at the codebase usage, the type ignore comment is consistently used across multiple files when dealing with orjson.dumps().decode()
operations. The comment accurately explains that it always returns a string, and this pattern is well-established throughout the project. Since there are no type stubs available for orjson (types-orjson
not found), this is actually the cleanest approach to handle the typing.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if orjson has type stubs
pip show types-orjson 2>/dev/null || echo "No type stubs found for orjson"
# Check if orjson is imported in other files with type annotations
rg -l "import orjson" | xargs rg "# type: ignore.*orjson"
Length of output: 544
Script:
#!/bin/bash
# Check orjson usage and type hints in the codebase
ast-grep --pattern 'orjson.dumps($$$)' -f py
# Check the full context of the logger implementation
rg -B5 -A5 'orjson.dumps.*type: ignore' airbyte_cdk/logger.py
Length of output: 928
Script:
#!/bin/bash
# Check orjson usage pattern without -f flag
ast-grep --pattern 'orjson.dumps($$$)'
# Check if there are any type annotations for orjson.dumps in the codebase
rg "orjson\.dumps.*->.*:"
Length of output: 10229
airbyte_cdk/destinations/vector_db_based/writer.py (1)
88-90
: LGTM! Nice type safety improvement.
The null check for message.record
before accessing it is a great addition. The error message is clear and helps with debugging. This change aligns well with making the codebase mypy compliant.
airbyte_cdk/sources/declarative/retrievers/async_retriever.py (1)
7-7
: Good move on standardizing the deprecation import!
Using typing_extensions.deprecated
instead of deprecated.classic
is more aligned with Python's type system and helps with mypy compliance. This change maintains the same functionality while improving type safety.
airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py (1)
31-36
: Nice type safety improvements!
The type: ignore[override]
comment and parameter reordering make the code more mypy compliant while maintaining clear method signatures.
airbyte_cdk/utils/traced_exception.py (2)
107-108
: Nice type safety improvement for variadic arguments!
Adding Any
type to *args
and **kwargs
improves type safety. However, would it make sense to be more specific about the expected types here? For example, if we know what kinds of arguments might be passed, we could use a more restrictive type. wdyt?
133-134
: Good addition of runtime null check!
Adding a null check for trace.error
prevents potential runtime errors. This is a good defensive programming practice.
airbyte_cdk/sources/file_based/file_types/excel_parser.py (1)
10-10
: Good choice using orjson!
Using orjson
is a great choice for performance. It's typically much faster than the standard json library.
airbyte_cdk/cli/source_declarative_manifest/_run.py (1)
28-28
: LGTM! Clean import statement.
The direct import of orjson
aligns with the project's import style.
pyproject.toml (2)
101-104
: LGTM! Comprehensive type stubs coverage.
The addition of type stubs for commonly used libraries will improve type checking accuracy.
146-146
: LGTM! Simplified type checking command.
The switch to using poetry directly for type checking makes the process more straightforward.
airbyte_cdk/sources/utils/transform.py (3)
8-8
: LGTM! Well-organized imports with proper type hints.
The imports are cleanly organized, and all necessary types are properly imported.
Also applies to: 10-10
70-71
: LGTM! Clear type annotation for the callback.
The type annotation for the normalization callback makes the expected function signature clear to users.
144-148
: LGTM! Properly typed generator function.
The type annotations for the normalizer generator function improve type safety.
unit_tests/destinations/test_destination.py (1)
9-9
: LGTM! Great type safety improvements
Nice work on improving type safety with:
- Added return type annotations
- Added Path type hint for tmp_path
- Consistent import style for orjson
Also applies to: 13-13, 133-133, 142-142, 153-157, 166-170, 215-220
airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)
10-10
: LGTM! Better deprecation handling
Nice improvements:
- Using typing_extensions for deprecation is more modern
- More descriptive deprecation message format
Also applies to: 146-146
airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py (2)
93-93
: LGTM! Good addition of return type annotation.
The explicit None
return type annotation helps with code clarity and type checking.
109-109
: LGTM! Property setter return type is correctly annotated.
The explicit None
return type for the setter method follows best practices for type annotations.
airbyte_cdk/sources/streams/concurrent/adapters.py (1)
11-11
: Verify the deprecated decorator functionality remains consistent.
Good move switching to typing_extensions.deprecated
! Let's verify this change doesn't affect existing deprecation warnings.
✅ Verification successful
The previous search might have been too specific. Let's broaden our search to understand the usage of deprecated decorators and verify the change's impact. I'm checking both the general usage of the decorator and the specific changes in the file.
The deprecated decorator change looks good and consistent with existing usage!
The switch to typing_extensions.deprecated
is safe as all current usages follow the same pattern, accepting a deprecation message and optional category parameter. The functionality remains identical across the codebase, including the experimental class warnings in concurrent streams.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any custom deprecation warning implementations that might need updates
rg -g '!*.pyc' -g '!*test*' "deprecated.*classic" .
Length of output: 53
Script:
#!/bin/bash
# Let's check for any usage of deprecated decorator in the codebase
rg -g '!*.pyc' "@deprecated" .
# Also check the specific file for the old import
rg "deprecated" airbyte_cdk/sources/streams/concurrent/adapters.py -B2 -A2
Length of output: 2654
unit_tests/sources/test_source.py (1)
11-11
: LGTM! Nice performance optimization.
The switch to orjson
for JSON serialization is a good choice as it's known to be more performant than the standard library's json module.
airbyte_cdk/sources/streams/concurrent/cursor.py (1)
221-223
: LGTM! Good defensive programming.
The added null check for record.associated_slice
prevents potential null pointer dereference. The condition is well-structured and maintains the existing logic while adding safety.
unit_tests/test_entrypoint.py (1)
14-14
: LGTM! Consistent with the codebase-wide improvement.
The addition of orjson
import aligns with the broader effort to standardize and optimize JSON handling across the codebase.
airbyte_cdk/sources/streams/core.py (1)
13-13
: LGTM! The deprecation messages are clear and informative.
The changes align well with the mypy compliance objective. The deprecation messages provide clear guidance on migration paths.
Also applies to: 95-96, 433-436
unit_tests/connector_builder/test_connector_builder_handler.py (1)
14-14
: Verify orjson compatibility with existing JSON usage patterns.
The switch to orjson
for JSON handling is a good performance improvement. However, should we add tests to verify that orjson
handles all edge cases the same way as the standard json
module? wdyt?
Also applies to: 433-436
unit_tests/sources/declarative/test_concurrent_declarative_source.py (1)
13-13
: LGTM! The import change aligns with making the codebase mypy compliant.
The change from deprecated.classic
to typing_extensions
is consistent with the PR objective of improving type safety. The deprecated
decorator functionality should remain the same since both implementations follow the same interface.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)
2055-2058
: LGTM! Clear and helpful example for access_token_headers.
The example demonstrates how to use base64 encoding for client credentials in the Authorization header, which is a common pattern in OAuth implementations.
2067-2072
: LGTM! Well-structured example for access_token_params.
The example clearly shows how to structure the parameters for token exchange, using placeholders that will be replaced with actual values during runtime.
airbyte_cdk/sources/file_based/file_types/avro_parser.py (2)
6-6
: LGTM on the typing import! 👍
Clean addition of cast
to the existing typing imports.
186-186
: 🛠️ Refactor suggestion
Consider adding type validation before casting
The cast to dict
helps with type safety, but I notice that field["type"]
could potentially be a string based on the _convert_avro_type_to_json
method. What do you think about adding a type check before the cast to prevent potential runtime errors? Something like this maybe?
- field["name"]: cast(dict, field["type"]) for field in schema["fields"]
+ field["name"]: cast(dict, field["type"]) if isinstance(field["type"], dict) else field["type"] for field in schema["fields"]
Let's verify how field["type"] is used in the codebase:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
.github/workflows/python_lint.yml (2)
75-75
: Nice catch on fixing the typo! 🔍The comment now correctly reads "Job-specific". However, I notice this typo still exists in the other jobs. Would you like me to help fix those as well? wdyt?
Line range hint
1-78
: Consider optimizing the workflow structure? 🤔I notice there's quite a bit of duplication in the common steps across all three jobs. Would you be interested in using a composite action or reusable workflow to DRY this up? This could make maintenance easier and reduce the file size significantly. Here's what I'm thinking:
- Create a reusable workflow for the common setup steps
- Or create a composite action for the Python/Poetry setup
wdyt? Happy to provide an example if you're interested!
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
.github/workflows/python_lint.yml
(1 hunks)airbyte_cdk/sources/streams/core.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/streams/core.py
🔇 Additional comments (1)
.github/workflows/python_lint.yml (1)
77-78
: Simplified MyPy execution looks good! 👍
The removal of the conditional check aligns with making the codebase MyPy compliant by ensuring all files are type-checked. This matches the previous review discussion about removing the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)
2068-2071
: Consider using a more secure example for OAuth headers?The current example shows encoding of client credentials in the Authorization header. While technically correct, would it be better to use a placeholder like
{base64Encoder:CLIENT_ID:CLIENT_SECRET}
to avoid any confusion about exposing real credentials in documentation? wdyt?- "Authorization": "Basic {base64Encoder:{client_id}:{client_secret}}", + "Authorization": "Basic {base64Encoder:CLIENT_ID:CLIENT_SECRET}",
2080-2085
: How about improving the JSON formatting in the example?The example is clear but the formatting could be more readable. Would you consider this more readable format? wdyt?
- - access_token_params: - { - "{auth_code_key}": "{{auth_code_key}}", - "{client_id_key}": "{{client_id_key}}", - "{client_secret_key}": "{{client_secret_key}}", - } + - access_token_params: + { + "{auth_code_key}": "{{auth_code_key}}", + "{client_id_key}": "{{client_id_key}}", + "{client_secret_key}": "{{client_secret_key}}" + }airbyte_cdk/sources/streams/concurrent/adapters.py (2)
53-56
: Consider enhancing the deprecation message with migration guidance.The experimental warning is clear, but would it be helpful to provide guidance on the recommended alternative or future direction? This could help users prepare for eventual migration, wdyt?
@deprecated( - "This class is experimental. Use at your own risk.", + "This class is experimental and may change in future versions. Use at your own risk. " + "Consider using <alternative_approach> for new implementations.", category=ExperimentalClassWarning, )
Line range hint
261-270
: Consider enriching error context with slice information.When handling exceptions, including the slice information in the error context could make debugging easier. What do you think about enhancing the error handling like this?
except Exception as e: display_message = self._stream.get_error_display_message(e) if display_message: - raise ExceptionWithDisplayMessage(display_message) from e + raise ExceptionWithDisplayMessage( + f"{display_message} (slice: {self._slice})" + ) from e else: raise e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (14)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(4 hunks)airbyte_cdk/sources/declarative/decoders/noop_decoder.py
(1 hunks)airbyte_cdk/sources/declarative/retrievers/async_retriever.py
(2 hunks)airbyte_cdk/sources/file_based/availability_strategy/abstract_file_based_availability_strategy.py
(4 hunks)airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py
(5 hunks)airbyte_cdk/sources/file_based/stream/concurrent/adapters.py
(3 hunks)airbyte_cdk/sources/streams/concurrent/abstract_stream.py
(2 hunks)airbyte_cdk/sources/streams/concurrent/adapters.py
(2 hunks)airbyte_cdk/sources/streams/concurrent/availability_strategy.py
(3 hunks)airbyte_cdk/sources/streams/core.py
(3 hunks)airbyte_cdk/sources/streams/http/http.py
(7 hunks)airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
(6 hunks)pyproject.toml
(2 hunks)unit_tests/sources/declarative/test_concurrent_declarative_source.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (11)
- airbyte_cdk/sources/declarative/decoders/noop_decoder.py
- airbyte_cdk/sources/declarative/retrievers/async_retriever.py
- unit_tests/sources/declarative/test_concurrent_declarative_source.py
- airbyte_cdk/sources/file_based/stream/concurrent/adapters.py
- airbyte_cdk/sources/streams/concurrent/abstract_stream.py
- airbyte_cdk/sources/streams/core.py
- airbyte_cdk/sources/streams/concurrent/availability_strategy.py
- pyproject.toml
- airbyte_cdk/sources/file_based/availability_strategy/abstract_file_based_availability_strategy.py
- airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
- airbyte_cdk/sources/file_based/availability_strategy/default_file_based_availability_strategy.py
🔇 Additional comments (3)
airbyte_cdk/sources/streams/concurrent/adapters.py (1)
11-11
: LGTM! Good choice on the typing_extensions import.
The switch to typing_extensions.deprecated
is a more lightweight and standardized approach for deprecation handling.
airbyte_cdk/sources/streams/http/http.py (2)
124-125
: LGTM! Great improvement in deprecation messaging.
The updated deprecation messages are more informative and consistent, providing clear version information and migration guidance.
Also applies to: 135-136, 146-147, 157-158
606-607
: LGTM! Consistent deprecation handling.
The deprecation messages for adapter classes maintain consistency with the property deprecations, providing clear migration paths.
Also applies to: 622-623
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (9)
airbyte_cdk/sources/embedded/base_integration.py (1)
55-57
: LGTM! Quick thought about the type ignore commentThe formatting looks clean and the type ignore directive is appropriately scoped. Would you consider adding a brief inline comment explaining why we need the type ignore here? Something like:
# type: ignore[union-attr, arg-type] # stream.source_defined_primary_key can be None
This would help future maintainers understand the type complexity, wdyt? 🤔
airbyte_cdk/sources/declarative/datetime/min_max_datetime.py (2)
44-52
: Consider using Optional type for better type clarityThe type ignore comments suggest we're dealing with optional values. Would it make more sense to explicitly declare
min_datetime
andmax_datetime
asOptional[Union[InterpolatedString, str]]
in the class definition, wdyt? This would better represent the actual type hierarchy and might eliminate the need for type ignores.
69-75
: Consider adding runtime type check to satisfy mypyInstead of using type ignore, we could add a runtime check to help mypy understand the type. What do you think about this approach?
- str( - self.datetime.eval( # type: ignore[union-attr] # str has no attribute "eval" - config, - **additional_parameters, - ) - ), + str( + self.datetime.eval(config, **additional_parameters) + if isinstance(self.datetime, InterpolatedString) + else self.datetime + ),airbyte_cdk/sources/http_logger.py (1)
45-45
: Consider using TypedDict for better type safety? 🤔Instead of using a type ignore, we could make the types more explicit. What do you think about defining a TypedDict for the log message structure? Something like:
from typing import TypedDict class HttpSection(TypedDict, total=False): title: str description: str is_auxiliary: bool class LogMessage(TypedDict): http: HttpSection log: dict[str, str] url: dict[str, str]This would give us better type safety and eliminate the need for the type ignore. Thoughts?
airbyte_cdk/sources/declarative/interpolation/jinja.py (2)
123-123
: Consider explicit typing to avoid type ignoreInstead of using a type ignore comment, what if we explicitly type the
undeclared
variable? This would make the code more type-safe and clearer. wdyt?- undeclared = self._find_undeclared_variables(s) - undeclared_not_in_context = {var for var in undeclared} # type: ignore [attr-defined] + undeclared: set[str] = self._find_undeclared_variables(s) + undeclared_not_in_context = {var for var in undeclared}
Line range hint
31-49
: Consider adding type-related documentationThe class documentation is great, but given the focus on type safety, would it be helpful to add some documentation about the expected types and None handling? This could help future maintainers understand the type-related decisions. wdyt?
For example:
class JinjaInterpolation(Interpolation): """ Interpolation strategy using the Jinja2 template engine. Type handling: - Input strings can be None, which will be treated as empty strings - Template compilation results are cached for performance - Undeclared variables are returned as a set of strings [rest of existing docstring...] """airbyte_cdk/sources/file_based/file_types/excel_parser.py (3)
73-76
: Consider addressing the underlying type issue instead of using type ignore.I notice we're using
# type: ignore [index]
here. Could we perhaps make this more type-safe by explicitly typing the return value ofdtype_to_json_type
? For example, we could useTypedDict
to define the expected structure. What do you think? 🤔
193-193
: Document why the type ignore is necessary here.The
# type: ignore [arg-type]
seems to be handling a typing limitation with the calamine engine. Could we add a brief comment explaining why this is needed? Something like:# type: ignore [arg-type] # calamine engine accepts Union[IOBase, str, Path] but type stubs don't reflect this
What do you think? 🤔
Line range hint
1-193
: Consider creating type stub files for better type safety.I notice we have a few
type: ignore
comments throughout the file. Would it make sense to create.pyi
stub files for some of these third-party interactions? This could give us better type safety without the need for inline ignores. What are your thoughts on this approach? 🤔
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (13)
airbyte_cdk/cli/source_declarative_manifest/_run.py
(3 hunks)airbyte_cdk/connector_builder/message_grouper.py
(2 hunks)airbyte_cdk/destinations/vector_db_based/writer.py
(1 hunks)airbyte_cdk/sources/declarative/datetime/min_max_datetime.py
(2 hunks)airbyte_cdk/sources/declarative/interpolation/jinja.py
(2 hunks)airbyte_cdk/sources/embedded/base_integration.py
(1 hunks)airbyte_cdk/sources/file_based/file_types/avro_parser.py
(3 hunks)airbyte_cdk/sources/file_based/file_types/excel_parser.py
(3 hunks)airbyte_cdk/sources/http_logger.py
(2 hunks)airbyte_cdk/sources/streams/concurrent/adapters.py
(3 hunks)airbyte_cdk/sources/streams/core.py
(4 hunks)airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
(7 hunks)airbyte_cdk/sources/utils/transform.py
(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- airbyte_cdk/destinations/vector_db_based/writer.py
- airbyte_cdk/sources/file_based/file_types/avro_parser.py
- airbyte_cdk/sources/streams/core.py
- airbyte_cdk/cli/source_declarative_manifest/_run.py
- airbyte_cdk/sources/streams/concurrent/adapters.py
- airbyte_cdk/sources/utils/transform.py
- airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
🔇 Additional comments (6)
airbyte_cdk/connector_builder/message_grouper.py (2)
277-277
: LGTM! Type ignore is appropriate here
The type ignore comment is correctly handling both the argument type and union attribute access issues. The actual runtime type is compatible, but mypy can't verify this due to the dynamic nature of protocol messages.
358-358
: LGTM! Type ignore is appropriate here
The type ignore comment is correctly handling the argument type mismatch. The deepcopy operation is necessary to prevent mutation of the records, and the runtime type is compatible despite mypy's concerns.
airbyte_cdk/sources/http_logger.py (2)
17-17
: Nice type hint improvement! 🎯
The change from bool
to bool | None
makes the type hint more precise by explicitly indicating that None
is a valid value. This aligns perfectly with mypy's expectations, wdyt?
48-48
: 🛠️ Refactor suggestion
Should we align the return type with actual usage? 🤔
The type ignore suggests a mismatch between the expected JsonType
and the actual dict[str, object]
. Would it make sense to:
- Update the return type annotation to match the actual structure, or
- Ensure the dictionary values conform to
JsonType
?
Let's verify the JsonType
definition:
airbyte_cdk/sources/file_based/file_types/excel_parser.py (2)
10-10
: LGTM! Nice standardization of orjson import.
The direct import of orjson aligns well with the codebase-wide standardization effort. This change makes the code more consistent with other files in the project.
Line range hint 116-123
: Great improvements to datetime handling! Let's verify the changes.
The explicit date_format="iso"
and date_unit="us"
parameters ensure proper datetime serialization. The error handling is also more specific now.
Would you like me to help verify that this handles all datetime edge cases correctly?
✅ Verification successful
Let me check the test coverage for Excel parser specifically to verify datetime handling.
Great news! The datetime handling looks solid and well-tested.
Looking at the test files, I can confirm that:
- The Excel parser correctly handles various datetime formats including dates, timestamps with milliseconds and microseconds
- The
date_format="iso"
anddate_unit="us"
parameters are working as expected, with test coverage inexcel_all_types_scenario
showing proper serialization of:- Full datetime values (e.g.,
"2022-05-29T00:00:00.000000"
) - Time values with microsecond precision (e.g.,
"12:00:00.456789"
)
- Full datetime values (e.g.,
The changes align perfectly with the pandas datetime handling and ensure consistent serialization across the codebase. Nice work! 🎉
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for datetime-related test cases to ensure coverage
rg -l "datetime|timestamp" --type py | while read -r file; do
echo "=== $file ==="
rg "def test.*(?:datetime|timestamp)" "$file" || true
done
Length of output: 31841
Script:
#!/bin/bash
# Check Excel parser test file content
echo "=== Excel Parser Tests ==="
cat unit_tests/sources/file_based/file_types/test_excel_parser.py
# Look for datetime-related test data in Excel scenarios
echo -e "\n=== Excel Scenarios ==="
cat unit_tests/sources/file_based/scenarios/excel_scenarios.py
Length of output: 21362
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (11)
.github/workflows/python_lint.yml (1)
Line range hint
31-31
: Should we fix the "Job-specifc" typo in other jobs too? 🤔I notice we fixed the typo in the MyPy job's comment (from "Job-specifc" to "Job-specific"), but the same typo exists in the other jobs. Wanna fix those for consistency?
- # Job-specifc step(s): + # Job-specific step(s):Also applies to: 66-66
airbyte_cdk/destinations/vector_db_based/embedder.py (2)
110-113
: Consider documenting the reason for type ignoreHey! The type ignore comment helps with mypy compliance, but would it be helpful to add a brief comment explaining why it's needed? Something like:
# type: ignore [call-arg] # langchain's OpenAIEmbeddings has incomplete type hints
Also, have you considered creating a type stub file for OpenAIEmbeddings to properly type hint these arguments? This could help avoid the type ignore, wdyt? 🤔
Line range hint
121-131
: Consider making the API version configurableI notice we're using a hardcoded API version "2023-05-15". Would it make sense to move this to the config model to make it more maintainable? This way we can update it without code changes when Azure releases new versions.
Also, the comment about the 16 document limit is dated (20230927). Should we update this or maybe make it more generic? Something like "Azure OpenAI API has a limit on documents per request"?
What are your thoughts on these suggestions? 🤔
airbyte_cdk/sources/declarative/requesters/request_options/datetime_based_request_options_provider.py (1)
88-90
: LGTM! Consider documenting type cast assumptionThe type ignore comment is correctly added to handle the union attribute access. Would you consider adding a brief inline comment explaining why we're confident that
field_name
is always an interpolated string here? This could help future maintainers understand the type ignore rationale better, wdyt?airbyte_cdk/sources/declarative/requesters/error_handlers/http_response_filter.py (2)
154-157
: LGTM! Consider consolidating type ignoresThe type ignores for [no-any-return, union-attr] are correctly added. Since we're ignoring multiple type issues on the same line, would it make sense to add a more comprehensive comment explaining why these are safe to ignore? This could help maintain clarity about our type assumptions, wdyt?
161-165
: LGTM! Consider enhancing type safetyThe type ignores for predicate evaluation are necessary here. However, would you be interested in exploring the possibility of creating a custom type for the response parameter to make the type checking more explicit? This could potentially reduce our reliance on type ignores in the future, wdyt?
airbyte_cdk/sources/declarative/incremental/datetime_based_cursor.py (3)
136-137
: LGTM! Consider documenting state structureThe type ignore for union-attr is correctly added. Would it be helpful to add a docstring describing the expected structure of the stream_state? This could make it clearer why we're confident about the type cast, wdyt?
161-163
: LGTM! Consider enhancing type safety for stream slicesThe type ignores for arg-type are necessary here. However, would you be interested in creating a more specific type for stream slices that better represents their structure? This could potentially reduce the need for type ignores, wdyt?
373-376
: LGTM! Consider consolidating similar patternsThe type ignores here match the pattern we see in the request options provider. Would it make sense to extract this common pattern into a helper method or base class to ensure consistent handling across the codebase, wdyt?
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (2)
133-136
: Consider refactoring to eliminate# type: ignore
commentsWould it be possible to refactor the code to avoid the
# type: ignore [union-attr]
comment here? Perhaps using type casting or asserting thatparent_config.request_option
is notNone
might help mypy recognize the correct type. Wdyt?
300-300
: Consider refining the assignment to satisfy type checkingCould we adjust the assignment of
substream_state
to adhere to mypy's expectations without using# type: ignore [assignment]
? Maybe explicitly defining the type or restructuring the code could help. Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (17)
.github/workflows/python_lint.yml
(1 hunks)airbyte_cdk/cli/source_declarative_manifest/_run.py
(3 hunks)airbyte_cdk/connector_builder/message_grouper.py
(4 hunks)airbyte_cdk/destinations/vector_db_based/embedder.py
(2 hunks)airbyte_cdk/entrypoint.py
(2 hunks)airbyte_cdk/sources/declarative/datetime/min_max_datetime.py
(3 hunks)airbyte_cdk/sources/declarative/incremental/datetime_based_cursor.py
(3 hunks)airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(5 hunks)airbyte_cdk/sources/declarative/requesters/error_handlers/http_response_filter.py
(1 hunks)airbyte_cdk/sources/declarative/requesters/paginators/default_paginator.py
(1 hunks)airbyte_cdk/sources/declarative/requesters/request_options/datetime_based_request_options_provider.py
(1 hunks)airbyte_cdk/sources/file_based/file_types/excel_parser.py
(4 hunks)airbyte_cdk/sources/file_based/stream/cursor/default_file_based_cursor.py
(1 hunks)airbyte_cdk/sources/streams/core.py
(6 hunks)airbyte_cdk/sources/streams/http/http.py
(8 hunks)airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
(10 hunks)airbyte_cdk/test/entrypoint_wrapper.py
(2 hunks)
✅ Files skipped from review due to trivial changes (3)
- airbyte_cdk/sources/file_based/stream/cursor/default_file_based_cursor.py
- airbyte_cdk/sources/declarative/requesters/paginators/default_paginator.py
- airbyte_cdk/test/entrypoint_wrapper.py
🚧 Files skipped from review as they are similar to previous changes (8)
- airbyte_cdk/connector_builder/message_grouper.py
- airbyte_cdk/entrypoint.py
- airbyte_cdk/sources/streams/core.py
- airbyte_cdk/cli/source_declarative_manifest/_run.py
- airbyte_cdk/sources/file_based/file_types/excel_parser.py
- airbyte_cdk/sources/declarative/datetime/min_max_datetime.py
- airbyte_cdk/sources/streams/http/http.py
- airbyte_cdk/sources/streams/http/requests_native_auth/oauth.py
🔇 Additional comments (5)
.github/workflows/python_lint.yml (1)
75-75
: LGTM on the MyPy changes! 👍
The removal of the conditional check fixes the issue with the non-existent step output. The command is now more focused by targeting the airbyte_cdk
directory specifically.
Also applies to: 77-78
airbyte_cdk/destinations/vector_db_based/embedder.py (1)
Line range hint 110-131
: LGTM! The changes successfully address mypy compliance
The type ignore comments are appropriately placed and the functionality remains intact. The changes align well with the PR's objective of making the codebase mypy compliant. 👍
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (3)
165-168
: Same suggestion regarding # type: ignore
comments
Again, could we refactor the code to eliminate the # type: ignore [union-attr]
comment? Adjusting the type hints or using type assertions could help mypy understand the types better. What do you think?
195-198
: Refactoring to address # type: ignore
for argument types
Would adjusting the type annotations or using type casting help eliminate the # type: ignore [arg-type]
comment here? It might make the code cleaner and more type-safe. Wdyt?
234-237
: Addressing # type: ignore
by refining types
Is there an opportunity to refactor this section to remove the # type: ignore [arg-type]
comment? Refining the type annotations for parent_record
or extra_field_path
might help mypy perform accurate type checking. What are your thoughts?
Note: Downstream connector test failures appear to be unrelated. Those would probably also be failing on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (19)
airbyte_cdk/sources/streams/concurrent/cursor.py (2)
476-476
: Nice type annotation addition! 👍The explicit type annotation for
record_cursor_value
asCursorValueType
improves type safety and readability. This aligns well with the PR's goal of making the codebase mypy compliant.Hey, while we're at it, would you consider adding a docstring to the
_extract_cursor_value
method to document its return type? Something like:def _extract_cursor_value(self, record: Record) -> CursorValueType: """Extract and parse the cursor value from the record. Args: record: The record to extract the cursor value from Returns: CursorValueType: The parsed cursor value Raises: ValueError: If the cursor field is not found in the record """wdyt? 🤔
Line range hint
483-483
: Consider adding None checks for better type safety? 🤔The comparison
self.start <= record_cursor_value <= self._end_provider()
looks clean, but what if we add some guard clauses for edge cases? Something like:- return self.start <= record_cursor_value <= self._end_provider() + if self.start is None: + return record_cursor_value <= self._end_provider() + end = self._end_provider() + if end is None: + return self.start <= record_cursor_value + return self.start <= record_cursor_value <= endThis would make the code more resilient to edge cases where either boundary might be None. What do you think? 🎯
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
Line range hint
648-678
: Consider adding validation for token_provider type.The changes look good with improved error handling for token_provider and api_token. However, we might want to add type validation for token_provider to prevent runtime errors. WDYT?
def create_session_token_authenticator( self, model: SessionTokenAuthenticatorModel, config: Config, name: str, **kwargs: Any ) -> Union[ApiKeyAuthenticator, BearerAuthenticator]: + if token_provider is not None and not isinstance(token_provider, TokenProvider): + raise ValueError(f"token_provider must be an instance of TokenProvider, got {type(token_provider)}") decoder = ( self._create_component_from_model(model=model.decoder, config=config) if model.decoder else JsonDecoder(parameters={}) )
1711-1711
: Consider documenting the token expiry format requirements.The addition of token_expiry_date_format and token_expiry_is_time_of_expiration improves token handling. Would it be helpful to add a docstring explaining the expected format and behavior? WDYT?
def create_oauth_authenticator( self, model: OAuthAuthenticatorModel, config: Config, **kwargs: Any ) -> DeclarativeOauth2Authenticator: + """Creates an OAuth2 authenticator. + + Args: + token_expiry_date_format: The format string for parsing token expiry dates. + When provided, token_expiry_is_time_of_expiration is set to True. + Example format: "%Y-%m-%dT%H:%M:%S.%fZ" + """airbyte_cdk/sources/utils/record_helper.py (2)
38-38
: Consider adding a type-ignore comment for clarity? 🤔Hey! I notice this line involves in-place data transformation. Since mypy might flag the unused return value, would it make sense to add a
# type: ignore[func-returns-value]
comment here? This would explicitly document that we're intentionally ignoring the return value, wdyt?- transformer.transform(data, schema) + transformer.transform(data, schema) # type: ignore[func-returns-value] # Transform modifies data in-place
Line range hint
15-24
: How about enhancing the docstring for better type clarity? 📝The function looks great! Would you be interested in adding a more detailed docstring that explains the transformer's role and type transformation behavior? Something like this perhaps?
def stream_data_to_airbyte_message( stream_name: str, data_or_message: StreamData, transformer: TypeTransformer = TypeTransformer(TransformConfig.NoTransform), schema: Optional[Mapping[str, Any]] = None, is_file_transfer_message: bool = False, ) -> AirbyteMessage: + """Convert stream data into an Airbyte message with optional type transformation. + + Args: + stream_name: Name of the stream this message belongs to + data_or_message: The data to convert or an existing message + transformer: Handles type transformations according to schema (modifies data in-place) + schema: Optional JSON schema for type validation/transformation + is_file_transfer_message: If True, creates a file transfer message instead of a record + + Returns: + AirbyteMessage: The converted message + + Raises: + ValueError: If data_or_message is of an unexpected type + """airbyte_cdk/entrypoint.py (2)
251-254
: Consider adding runtime validation for record attributes?While the type ignores are valid (these fields are guaranteed by the protocol), we could make this more robust with runtime validation, wdyt?
+ if not hasattr(message.record, 'stream') or not hasattr(message.record, 'namespace'): + raise ValueError("Record message missing required attributes: stream and namespace") stream_message_count[ HashableStreamDescriptor( name=message.record.stream, # type: ignore[union-attr] namespace=message.record.namespace, # type: ignore[union-attr] ) ] += 1.0
260-262
: How about making state stats initialization more explicit?The current approach works well, but we could make it more explicit and type-safe, wdyt?
- message.state.sourceStats = message.state.sourceStats or AirbyteStateStats() + if message.state.sourceStats is None: + message.state.sourceStats = AirbyteStateStats() message.state.sourceStats.recordCount = stream_message_count.get( stream_descriptor, 0.0 )airbyte_cdk/sources/streams/http/http_client.py (3)
147-147
: Consider tracking the untyped cache.clear() method.The code comment suggests that
cache.clear()
is still not typed in requests_cache. While removing the type ignore comment aligns with making the codebase mypy compliant, we might want to track this for future updates. What do you think about adding a TODO to revisit this when requests_cache adds proper typing?
Line range hint
338-338
: Consider making the control flow more explicit for type checking.The type ignore comment is necessary because mypy can't infer that the error paths always raise. What do you think about making this more explicit by either:
- Adding a final
assert response is not None
before the return, or- Using a helper function that handles the None case explicitly?
This could help remove the type ignore comment and make the control flow clearer. WDYT?
Line range hint
391-402
: Consider improving the stream status emission approach.The current approach of printing stream status with
\n
andflush=True
is noted as temporary. This could cause issues with message ordering or parsing. Would you be interested in tracking this as a separate improvement task to implement a proper message queuing system once the concurrent message repository is ready?airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
Line range hint
206-206
: Consider enhancing state manager type safetyI noticed this comment about using generics for the ConnectorStateManager. Since we're working on mypy compliance, would you like to tackle this improvement as well? We could make the state management more type-safe by introducing proper generic type parameters. Here's what I'm thinking:
- state_manager = ConnectorStateManager(state=self._state) # type: ignore + class ConnectorStateManager(Generic[TState]): + def __init__(self, state: TState) -> None: + self._state = state + + state_manager = ConnectorStateManager[TState](state=self._state)What are your thoughts on this approach? 🤔
airbyte_cdk/sources/declarative/spec/spec.py (2)
Line range hint
24-24
: How about adding advanced_auth to the docstring? 🤔The class has great documentation, but the new
advanced_auth
attribute isn't documented in the Attributes section. Wdyt about adding it?""" Returns a connection specification made up of information about the connector and how it can be configured Attributes: connection_specification (Mapping[str, Any]): information related to how a connector can be configured documentation_url (Optional[str]): The link the Airbyte documentation about this connector + advanced_auth (Optional[AuthFlow]): Advanced authentication configuration for the connector """
Line range hint
37-41
: Type handling looks good, but shall we add a runtime check? 🤔The type ignore comment is well documented, but we could make it even more robust. What do you think about adding a runtime assertion to complement the type ignore? This would help catch any potential issues early:
if self.advanced_auth: + assert hasattr(self.advanced_auth, 'auth_flow_type'), "AuthFlow must have auth_flow_type attribute" self.advanced_auth.auth_flow_type = self.advanced_auth.auth_flow_type.value # type: ignore # We know this is always assigned to an AuthFlow which has the auth_flow_type field # Map CDK AuthFlow model to protocol AdvancedAuth model obj["advanced_auth"] = self.advanced_auth.dict()
airbyte_cdk/sources/declarative/incremental/datetime_based_cursor.py (4)
136-137
: Consider using a type-safe alternative for cursor_field?Instead of using a type ignore comment, we could improve type safety by explicitly casting cursor_field to InterpolatedString in the method signature. wdyt?
def set_initial_state(self, stream_state: StreamState) -> None: - self._cursor = stream_state.get(self.cursor_field.eval(self.config)) if stream_state else None # type: ignore [union-attr] + cursor_field = self.cursor_field + if not isinstance(cursor_field, InterpolatedString): + raise TypeError("cursor_field must be an InterpolatedString") + self._cursor = stream_state.get(cursor_field.eval(self.config)) if stream_state else None
161-164
: How about adding type hints for stream_slice values?The type ignores could be replaced with proper type hints and validation. This would make the code more maintainable and type-safe. Here's a suggestion:
def _is_within_daterange_boundaries( self, record: Record, - start_datetime_boundary: Union[datetime.datetime, str], - end_datetime_boundary: Union[datetime.datetime, str], + start_datetime_boundary: Union[datetime.datetime, str, None], + end_datetime_boundary: Union[datetime.datetime, str, None], ) -> bool:And in the observe method:
if ( self._is_within_daterange_boundaries( record, - stream_slice.get(start_field), # type: ignore [arg-type] - stream_slice.get(end_field), # type: ignore [arg-type] + stream_slice.get(start_field, None), + stream_slice.get(end_field, None), ) and is_highest_observed_cursor_value ):
373-375
: Should we add type validation for end_time_option.field_name?Similar to the cursor_field suggestion earlier, we could make this more type-safe by adding validation. Here's a thought:
if self.end_time_option and self.end_time_option.inject_into == option_type: + field_name = self.end_time_option.field_name + if not isinstance(field_name, InterpolatedString): + raise TypeError("end_time_option.field_name must be an InterpolatedString") - options[self.end_time_option.field_name.eval(config=self.config)] = stream_slice.get( # type: ignore [union-attr] + options[field_name.eval(config=self.config)] = stream_slice.get( self._partition_field_end.eval(self.config) )
Line range hint
1-447
: Consider a more systematic approach to type safety?The current changes make the code mypy compliant, which is great! However, we could take this opportunity to implement a more systematic approach to type safety. This could involve:
- Creating type guards for InterpolatedString and other common types
- Adding runtime type validation in post_init for all fields that should be InterpolatedString
- Using TypeVar and Generic types for better type inference
Would you be interested in exploring these improvements in a follow-up PR?
airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py (1)
Line range hint
184-214
: Nice implementation of the custom format converter! A few thoughts for consideration:
What do you think about adding input validation for
datetime_format
in the constructor? This could catch invalid format strings early, wdyt?For performance optimization in
parse_timestamp
, we could potentially cache successful format matches. This would be particularly beneficial when processing large datasets with consistent timestamp formats. Would you be interested in exploring this optimization?Consider adding docstring examples of expected format strings? This could help users understand the expected format patterns.
Here's a potential implementation for validation and caching:
def __init__( self, datetime_format: str, input_datetime_formats: Optional[List[str]] = None, is_sequential_state: bool = True, cursor_granularity: Optional[timedelta] = None, ): super().__init__( is_sequential_state=is_sequential_state, cursor_granularity=cursor_granularity ) + # Validate format string + try: + self._parser.format(datetime.now(), datetime_format) + except ValueError as e: + raise ValueError(f"Invalid datetime format '{datetime_format}': {str(e)}") self._datetime_format = datetime_format self._input_datetime_formats = input_datetime_formats if input_datetime_formats else [] self._input_datetime_formats += [self._datetime_format] self._parser = DatetimeParser() + self._format_cache: Dict[str, str] = {} def parse_timestamp(self, timestamp: str) -> datetime: + if timestamp in self._format_cache: + return self._parser.parse(timestamp, self._format_cache[timestamp]) for datetime_format in self._input_datetime_formats: try: - return self._parser.parse(timestamp, datetime_format) + result = self._parser.parse(timestamp, datetime_format) + self._format_cache[timestamp] = datetime_format + return result except ValueError: pass raise ValueError(f"No format in {self._input_datetime_formats} matching {timestamp}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (29)
airbyte_cdk/connector_builder/message_grouper.py
(5 hunks)airbyte_cdk/destinations/vector_db_based/writer.py
(1 hunks)airbyte_cdk/entrypoint.py
(3 hunks)airbyte_cdk/logger.py
(2 hunks)airbyte_cdk/sources/abstract_source.py
(1 hunks)airbyte_cdk/sources/config.py
(1 hunks)airbyte_cdk/sources/connector_state_manager.py
(4 hunks)airbyte_cdk/sources/declarative/auth/oauth.py
(1 hunks)airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/datetime_based_cursor.py
(3 hunks)airbyte_cdk/sources/declarative/interpolation/jinja.py
(3 hunks)airbyte_cdk/sources/declarative/interpolation/macros.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(5 hunks)airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
(5 hunks)airbyte_cdk/sources/declarative/requesters/error_handlers/default_error_handler.py
(1 hunks)airbyte_cdk/sources/declarative/requesters/error_handlers/http_response_filter.py
(1 hunks)airbyte_cdk/sources/declarative/requesters/request_options/datetime_based_request_options_provider.py
(1 hunks)airbyte_cdk/sources/declarative/spec/spec.py
(1 hunks)airbyte_cdk/sources/file_based/file_types/csv_parser.py
(3 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(1 hunks)airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py
(2 hunks)airbyte_cdk/sources/streams/core.py
(7 hunks)airbyte_cdk/sources/streams/http/http.py
(10 hunks)airbyte_cdk/sources/streams/http/http_client.py
(2 hunks)airbyte_cdk/sources/utils/record_helper.py
(1 hunks)airbyte_cdk/sources/utils/schema_helpers.py
(1 hunks)airbyte_cdk/test/entrypoint_wrapper.py
(3 hunks)airbyte_cdk/test/mock_http/response_builder.py
(1 hunks)airbyte_cdk/utils/traced_exception.py
(3 hunks)
✅ Files skipped from review due to trivial changes (6)
- airbyte_cdk/test/mock_http/response_builder.py
- airbyte_cdk/sources/declarative/requesters/error_handlers/default_error_handler.py
- airbyte_cdk/sources/utils/schema_helpers.py
- airbyte_cdk/sources/abstract_source.py
- airbyte_cdk/sources/config.py
- airbyte_cdk/sources/declarative/auth/oauth.py
🚧 Files skipped from review as they are similar to previous changes (13)
- airbyte_cdk/sources/declarative/requesters/request_options/datetime_based_request_options_provider.py
- airbyte_cdk/destinations/vector_db_based/writer.py
- airbyte_cdk/sources/file_based/file_types/csv_parser.py
- airbyte_cdk/connector_builder/message_grouper.py
- airbyte_cdk/sources/connector_state_manager.py
- airbyte_cdk/logger.py
- airbyte_cdk/utils/traced_exception.py
- airbyte_cdk/test/entrypoint_wrapper.py
- airbyte_cdk/sources/streams/core.py
- airbyte_cdk/sources/declarative/interpolation/jinja.py
- airbyte_cdk/sources/declarative/requesters/error_handlers/http_response_filter.py
- airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py
- airbyte_cdk/sources/streams/http/http.py
🔇 Additional comments (12)
airbyte_cdk/sources/streams/concurrent/cursor.py (1)
Line range hint 477-482
: Solid error handling! 💪
The error handling here is well thought out:
- Catches ValueError gracefully
- Uses a flag to prevent log spam
- Returns True as a safe default
- Clear warning message with good context
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
400-400
: LGTM! Nice cleanup of message repository initialization.
The initialization is now more concise by evaluating the log level directly in the initialization line.
898-898
: LGTM! Good addition of message repository to cursor.
Adding message repository to the cursor initialization improves logging capabilities and maintains consistency with the rest of the codebase.
airbyte_cdk/entrypoint.py (2)
18-18
: LGTM! Clean import statement.
The direct import of orjson
looks good. It's a cleaner approach than the previous version.
284-284
: Nice performance optimization! 🚀
Using orjson.dumps
directly is a great choice here! It's more efficient than the standard json library, and the implementation is clean and straightforward.
airbyte_cdk/sources/declarative/interpolation/macros.py (1)
119-119
: LGTM! Clean type annotation improvement.
The removal of the type ignore comment is a good change as mypy can now properly infer the return type from parse_duration
. The explicit return type annotation Union[datetime.timedelta, isodate.Duration]
correctly matches the possible return types.
airbyte_cdk/sources/streams/http/http_client.py (1)
327-327
: LGTM! Clean lambda type improvement.
The removal of the type ignore comment is safe here as the type is properly inferred from the formatter parameter.
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
131-131
: Verify type safety of message_repository
Hey! I notice we're removing the type ignore comment here. This suggests that message_repository
is now properly typed upstream. Would it be helpful to add a quick comment explaining how the type safety was achieved (e.g., through proper typing in the parent class or factory)? This could help future maintainers understand the typing guarantees, wdyt? 🤔
Let's verify the type safety with this script:
✅ Verification successful
Based on the search results, I can see that message_repository
is properly typed across the codebase:
- In various files it's typed as
MessageRepository
orOptional[MessageRepository]
- The default implementation
NoopMessageRepository()
is used when not specified - The type is consistently used in test files and production code
- The parameter is properly typed in parent classes and related components
Type safety is properly handled, no comment needed
The removal of the type ignore comment is justified as the message_repository
parameter is properly typed throughout the codebase. The type system can now correctly verify the type safety without needing to ignore any checks. The parent classes and implementations all use proper type annotations, making the explicit type ignore unnecessary.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if message_repository is properly typed in parent classes and related files
# Look for message_repository type definitions
ast-grep --pattern 'message_repository: $_'
# Look for message_repository assignments or declarations
rg 'message_repository' -A 2 -B 2
Length of output: 78084
airbyte_cdk/sources/declarative/spec/spec.py (2)
12-12
: LGTM! Clean import organization 🎯
The new import is properly grouped with related model imports.
Line range hint 34-34
: Verify Python version compatibility
The use of the |
operator in type annotations (dict[str, Mapping[str, Any] | str | AdvancedAuth]
) requires Python 3.10+.
✅ Verification successful
The Python version requirement is compatible with the syntax! 🎉
Looking at the pyproject.toml
, the package explicitly requires Python ^3.10,<3.13
, which perfectly aligns with the use of the |
operator in type annotations. The project's classifiers also confirm Python 3.10 support.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check minimum Python version requirement in setup.py or pyproject.toml
fd -e toml -e py -g "setup.py" -g "pyproject.toml" --exec grep -l "python"
Length of output: 172
Script:
#!/bin/bash
# Let's try a different approach to find Python version requirements
# First, let's check for setup.py
find . -name "setup.py" -type f -exec cat {} \;
# Also check for pyproject.toml
find . -name "pyproject.toml" -type f -exec cat {} \;
# As a backup, let's also check for Python version in requirements files
find . -name "*requirements*.txt" -type f -exec cat {} \;
Length of output: 8883
airbyte_cdk/sources/streams/concurrent/state_converters/datetime_stream_state_converter.py (2)
144-144
: Clean type handling improvement!
Nice work removing the type ignore comments. The direct return of dt_object
with proper type checking is much cleaner and mypy-compliant. The error message is also very helpful for debugging.
Also applies to: 181-181
Line range hint 184-214
: Security and performance considerations
A few thoughts on hardening this implementation:
-
Should we consider adding a maximum limit to
input_datetime_formats
to prevent potential DoS through excessive format attempts? -
Have you considered validating all input formats in the constructor, similar to the output format? This could prevent potential issues with malformed format strings.
Let's check if there are any existing format string validation patterns in the codebase:
mypy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy
mypy
Note:
if foo is None: raise ValueError(...)
) to help the type checker, but in some cases, these caused the code to fail in a different way then the unit tests expected. So I've reverted those, and I've instead used the strategy of using "ignore" statements when there's no other way to resolve.cast
in Python is a no-op and just helps the type inference. Addedcast
statements tell the linter what the type is expected to be, but they don't change the value or add to the runtime cost.Summary of Code Changes
pyproject.toml
so they should not have any effect at runtime.mypy
checks for the whole codebase, excluding (for now) theunit_tests
directory.from orjson import orjson
with simplyimport orjson
since that is required for the type checker.Deprecated
library withtype_extensions.deprecated
. This is officially part of the Python language now, as of python 3.11, and thetyping_extensions
package backports it.@deprecated
decorator syntax to match the Python languagedeprecated
syntax.cast
statements to help the type checker outltype: ignore [ERRTYPE]
) for anything that can not be fixed with near-zero risk.poetry run mypy airbyte_cdk --warn-unused-ignores
. When the ignore was obviously not needed (even at stricter compliance levels), I removed those. I didn't remove all because some looked valid and might only be shown as "unused" if we use a more permissive ruleset around the use ofAny
.Summary by CodeRabbit
Bug Fixes
ConcurrentCursor
andHttpResponseFilter
classes.Documentation
Chores