Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(low-code cdk): add dynamic schema loader #104

Merged
merged 25 commits into from
Dec 12, 2024

Conversation

lazebnyi
Copy link
Contributor

@lazebnyi lazebnyi commented Dec 3, 2024

What

This PR add dynamic schema loader that can dynamically generate schema based on response.|
Fixed: https://github.com/airbytehq/airbyte-internal-issues/issues/10782

How

Described schema, models and data classes for DynamicSchemaLoader, SchemaTypeIdentifier and TypesPair. Defined behavior for schema resolving. Integrated the dynamic schema handling into the existing framework.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced three new components for enhanced dynamic schema handling: TypesMap, SchemaTypeIdentifier, and DynamicSchemaLoader.
    • Expanded type mappings in the manifest to support the new components.
  • Bug Fixes

    • Improved error handling in the DynamicSchemaLoader for invalid keys and types.
  • Tests

    • Added comprehensive unit tests for the DynamicSchemaLoader to validate functionality and error handling.

@github-actions github-actions bot added the enhancement New feature or request label Dec 3, 2024
@lazebnyi lazebnyi marked this pull request as ready for review December 3, 2024 03:50
@lazebnyi
Copy link
Contributor Author

lazebnyi commented Dec 3, 2024

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

coderabbitai bot commented Dec 3, 2024

Warning

Rate limit exceeded

@lazebnyi has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 2 minutes and 59 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between d7e4873 and 0dbea5d.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py (1 hunks)
📝 Walkthrough
📝 Walkthrough
📝 Walkthrough

Walkthrough

This pull request introduces three new components—TypesMap, SchemaTypeIdentifier, and DynamicSchemaLoader—to the declarative_component_schema.yaml and corresponding classes in declarative_component_schema.py. It also updates the DEFAULT_MODEL_TYPES mapping in manifest_component_transformer.py and enhances the ModelToComponentFactory with methods for creating these new components. Minor formatting changes were made to existing components for clarity. Additionally, a new file for unit tests of the DynamicSchemaLoader was added, ensuring robust testing of its functionality.

Changes

File Change Summary
airbyte_cdk/sources/declarative/declarative_component_schema.yaml Added definitions for TypesMap, SchemaTypeIdentifier, and DynamicSchemaLoader.
airbyte_cdk/sources/declarative/models/declarative_component_schema.py Introduced new classes: TypesMap, SchemaTypeIdentifier, and DynamicSchemaLoader. Minor updates to OAuthConfigSpecification descriptions.
airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py Updated DEFAULT_MODEL_TYPES with new mappings for DynamicSchemaLoader.retriever and SchemaTypeIdentifier.types_map.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Added methods: create_dynamic_schema_loader, create_schema_type_identifier, and create_types_map. Refactored _merge_stream_slicers.
airbyte_cdk/sources/declarative/schema/__init__.py Added imports for DynamicSchemaLoader, TypesMap, and SchemaTypeIdentifier. Updated __all__ list.
airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py Introduced classes and methods for dynamic schema processing, including get_json_schema and various helper methods.
unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py Created unit tests for DynamicSchemaLoader, including tests for valid and invalid scenarios.

Possibly related PRs

Suggested reviewers

  • natikgadzhi
  • maxi297

Would you like to consider any additional reviewers or related PRs? Wdyt?


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Experiment)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (7)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)

1575-1581: Adjust default values for name and primary_key fields

Would you consider setting the default values of name and primary_key to None instead of an empty string? This can help avoid confusion with optional fields and ensure consistency. Wdyt?

Proposed change:

name: Optional[str] = Field(
-   "", description="The stream name.", example=["Users"], title="Name"
+   None, description="The stream name.", example="Users", title="Name"
)
primary_key: Optional[PrimaryKey] = Field(
-   "", description="The primary key of the stream.", title="Primary Key"
+   None, description="The primary key of the stream.", title="Primary Key"
)

654-679: Suggestion on is_nullable field in SchemaTypeIdentifier

Would it be clearer to use is_nullable: bool = Field(True, ...) instead of Optional[bool] since it has a default value? This can improve type clarity. Wdyt?

Proposed change:

- is_nullable: Optional[bool] = Field(
+ is_nullable: bool = Field(
    True,
    description="Add null to defined field type. This field is automatically set by the CDK.",
    title="Is Nullable",
)
unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py (2)

12-48: Consider adding docstrings to fixtures?

The fixtures are well-structured, but adding docstrings would make their purpose and setup more clear to other developers. Here's a suggested format, wdyt?

 @pytest.fixture
 def mock_retriever():
+    """
+    Creates a mock retriever that returns predefined schema records.
+    Returns:
+        MagicMock: A mock retriever with read_records method returning sample schema data.
+    """
     retriever = MagicMock()

51-128: The test cases look great! Consider adding a docstring?

The parameterized test cases provide excellent coverage. Adding a docstring to explain the test's purpose and parameters would enhance documentation, wdyt?

 def test_dynamic_schema_loader(dynamic_schema_loader, retriever_data, expected_schema):
+    """
+    Tests DynamicSchemaLoader's schema generation for various input scenarios.
+    
+    Args:
+        dynamic_schema_loader: Fixture providing the loader instance
+        retriever_data: Input data simulating API responses
+        expected_schema: Expected JSON schema output
+    """
     dynamic_schema_loader.retriever.read_records.return_value = retriever_data
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)

1666-1686: Consider adding examples for TypesPair?

The schema is well-defined, but adding examples would help users understand the expected format, wdyt?

       current_type:
         anyOf:
           - type: string
           - type: array
             items:
               type: string
+        examples:
+          - "string"
+          - ["string", "integer"]

1687-1732: Consider adding pattern validation for pointers?

The schema pointers could benefit from regex patterns to ensure valid paths, wdyt?

       schema_pointer:
         title: Schema Path
         description: List of potentially nested fields describing the full path of the field to extract.
         type: array
         items:
           - type: string
+          pattern: "^[a-zA-Z_][a-zA-Z0-9_]*$"

1733-1756: Consider enhancing experimental status documentation?

The experimental status warning could be more detailed about potential risks and limitations, wdyt?

-    description: (This component is experimental. Use at your own risk.) Loads a schema by extracting data from retrieved records.
+    description: |
+      (This component is experimental. Use at your own risk.) 
+      Loads a schema by extracting data from retrieved records.
+      
+      Note: As an experimental feature:
+      - The interface may change in non-backward compatible ways
+      - Support is limited
+      - Not recommended for production use without thorough testing
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9587d4e and edb52e7.

📒 Files selected for processing (8)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (5 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (11 hunks)
  • airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (9 hunks)
  • airbyte_cdk/sources/declarative/partition_routers/__init__.py (1 hunks)
  • airbyte_cdk/sources/declarative/schema/__init__.py (1 hunks)
  • airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py (1 hunks)
  • unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py (1 hunks)
🔇 Additional comments (6)
airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py (1)

106-128: Overall implementation looks solid

The get_json_schema method effectively constructs the JSON Schema dynamically. Good job on handling the iterations and data extraction properly.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

650-653: Definition of TypesPair looks good

The TypesPair class correctly models the mapping between current and target types. Nice addition!

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1570-1629: Implementation of new factory methods

The methods create_types_pair, create_schema_type_identifier, and create_dynamic_schema_loader are well-implemented and integrate seamlessly with the factory pattern. Great work!

airbyte_cdk/sources/declarative/schema/__init__.py (1)

9-11: Exporting new schema components

Including DynamicSchemaLoader, TypesPair, and SchemaTypeIdentifier in __all__ makes sense. This ensures they are available for import elsewhere.

airbyte_cdk/sources/declarative/partition_routers/__init__.py (1)

9-11: Adding PartitionRouter to exports

Adding PartitionRouter to the __all__ list appropriately exposes it as a public interface. Good job!

airbyte_cdk/sources/declarative/parsers/manifest_component_transformer.py (1)

61-64: LGTM! The new mappings look well-structured. 🎯

The additions follow the existing pattern nicely and align with the dynamic schema loader feature. Just to double-check our integration - would it make sense to verify that these components are properly registered in their respective definition files? wdyt?

✅ Verification successful

Looking good! The component mappings are correctly aligned with their implementations. 👍

The verification shows that both mappings in manifest_component_transformer.py are valid:

  • DynamicSchemaLoader.retriever maps to SimpleRetriever, which is a valid option in the component schema
  • SchemaTypeIdentifier.types_map maps to TypesPair, which is correctly defined and used in the schema type identifier class

Both components are properly defined in the codebase with their respective dependencies, and the mappings follow the established patterns. The experimental nature of these components is also consistently documented.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the component definitions exist and match the mapped types

# Check for DynamicSchemaLoader definition and SimpleRetriever type
echo "Checking DynamicSchemaLoader definition..."
rg -l "class DynamicSchemaLoader.*SimpleRetriever" 

# Check for SchemaTypeIdentifier definition and TypesPair type
echo "Checking SchemaTypeIdentifier definition..."
rg -l "class SchemaTypeIdentifier.*TypesPair"

Length of output: 292


Script:

#!/bin/bash
# Let's try a different approach with ast-grep to find these class definitions
# and their relationships with the mapped types

echo "Checking DynamicSchemaLoader definition..."
ast-grep --pattern 'class DynamicSchemaLoader {
  $$$
}'

echo -e "\nChecking SchemaTypeIdentifier definition..."
ast-grep --pattern 'class SchemaTypeIdentifier {
  $$$
}'

echo -e "\nChecking for TypesPair usage..."
rg "TypesPair" -A 2 -B 2

echo -e "\nChecking for SimpleRetriever usage..."
rg "SimpleRetriever" -A 2 -B 2

Length of output: 66058


Script:

#!/bin/bash
# Let's check for the DynamicSchemaLoader and SchemaTypeIdentifier class definitions
# and their relationship with the mapped types in the schema files

echo "Checking DynamicSchemaLoader in schema files..."
rg "DynamicSchemaLoader" airbyte_cdk/sources/declarative/schema/ -A 5 -B 5

echo -e "\nChecking SchemaTypeIdentifier in schema files..."
rg "SchemaTypeIdentifier" airbyte_cdk/sources/declarative/schema/ -A 5 -B 5

echo -e "\nChecking component schema definitions..."
rg "DynamicSchemaLoader|SchemaTypeIdentifier" airbyte_cdk/sources/declarative/declarative_component_schema.yaml -A 5 -B 5

Length of output: 7591

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

851-851: Minor grammatical corrections in descriptions.

Consider decapitalizing "Specific" in the descriptions and refining wording for clarity. For example, changing "The DeclarativeOAuth Specific list..." to "The DeclarativeOAuth specific list...". WDYT?

You could apply these diffs:

For line 851:

-        description="The DeclarativeOAuth Specific list of strings to indicate which keys should be extracted and returned back to the input config.",
+        description="The DeclarativeOAuth specific list of strings indicating which keys should be extracted and returned to the input config.",

For line 857:

-        description="The DeclarativeOAuth Specific object to provide the criteria of how the `state` query param should be constructed,\nincluding length and complexity.",
+        description="The DeclarativeOAuth specific object providing the criteria for how the `state` query parameter should be constructed, including length and complexity.",

For line 881:

-        description="The DeclarativeOAuth Specific optional override to provide the custom `state` key name, if required by data-provider.",
+        description="The DeclarativeOAuth specific optional override to provide the custom `state` key name, if required by the data provider.",

For line 887:

-        description="The DeclarativeOAuth Specific optional override to provide the custom `code` key name to something like `auth_code` or `custom_auth_code`, if required by data-provider.",
+        description="The DeclarativeOAuth specific optional override to provide the custom `code` key name, such as `auth_code` or `custom_auth_code`, if required by the data provider.",

Also applies to: 857-857, 881-881, 887-887

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between edb52e7 and 7387131.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5 hunks)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

648-650: LGTM!

The TypesPair class definition looks correct and follows code conventions.

…tehq/airbyte-python-cdk into lazebnyi/add-dynamic-schema-loader
@lazebnyi lazebnyi requested a review from maxi297 December 3, 2024 04:45
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from sync discussion

@aldogonzalez8 can you have a look a this? You'll have to use this for google sheets

@lazebnyi lazebnyi requested a review from maxi297 December 3, 2024 17:20
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1585-1610: Consider simplifying the pointer list comprehensions.

The list comprehensions for schema_pointer, key_pointer, and type_pointer could be made more readable. WDYT about:

-        model_schema_pointer: List[Union[InterpolatedString, str]] = (
-            [x for x in model.schema_pointer] if model.schema_pointer else []
-        )
-        model_key_pointer: List[Union[InterpolatedString, str]] = [x for x in model.key_pointer]
-        model_type_pointer: Optional[List[Union[InterpolatedString, str]]] = (
-            [x for x in model.type_pointer] if model.type_pointer else None
-        )
+        model_schema_pointer = list(model.schema_pointer) if model.schema_pointer else []
+        model_key_pointer = list(model.key_pointer)
+        model_type_pointer = list(model.type_pointer) if model.type_pointer else None
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

653-656: Consider adding docstrings for better documentation?

The TypesMap class looks good, but would it be helpful to add docstrings explaining the purpose of target_type and current_type fields and their relationship? This could help other developers understand how to use this mapping. WDYT?

airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py (3)

112-121: Consider handling the case when no records are retrieved.

Currently, if retrieved_record is None, raw_schema is set to an empty list, resulting in an empty properties in the generated JSON Schema. Would you consider adding error handling or logging to inform users when no records are retrieved? Wdyt?

Possible code change:

 def get_json_schema(self) -> Mapping[str, Any]:
     """
     Constructs a JSON Schema based on retrieved data.
     """
     properties = {}
     retrieved_record = next(self.retriever.read_records({}), None)

+    if retrieved_record is None:
+        # Log a warning or raise an exception
+        raise ValueError("No records retrieved. Cannot generate schema.")

     raw_schema = (
         self._extract_data(
             retrieved_record,
             self.schema_type_identifier.schema_pointer,
         )
         if retrieved_record
         else []
     )

175-177: Consider improving the error message for invalid data types.

The error message might be unclear to users. Would you consider rephrasing it to be more descriptive? For example:

raise ValueError(
-    f"Invalid data type. Available string or two items list of string. Got {mapped_field_type}."
+    f"Invalid data type '{mapped_field_type}'. Expected a string or a list of two strings."
)

Wdyt?


199-199: Question about the necessity of deepcopy.

In _get_airbyte_type, we are performing a deepcopy on the AIRBYTE_DATA_TYPES entry. If the returned mapping is not modified elsewhere, would it be acceptable to return the original mapping to improve performance? Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fd44be1 and 05e4f74.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (5 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (9 hunks)
  • airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py (1 hunks)
🔇 Additional comments (9)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

Line range hint 1303-1382: LGTM! Nice refactoring of the stream slicer logic.

The decomposition into smaller, focused methods improves readability and maintainability. The separation of concerns between partition router and paginator logic is particularly clean.


1581-1583: LGTM! Simple and straightforward implementation.

The create_types_map method is clean and follows the factory pattern well.


1612-1636: LGTM! Well-structured dynamic schema loader creation.

The method follows the established pattern for retriever-based components and properly handles the stream slicer creation.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)

658-677: LGTM! Well-structured class with clear field definitions.

The SchemaTypeIdentifier class has appropriate field types and good defaults. The optional fields provide flexibility while maintaining required functionality through the mandatory key_pointer field.


1734-1743: LGTM! Clean implementation of the dynamic schema loader.

The DynamicSchemaLoader follows the established pattern of other schema loaders while introducing new dynamic schema loading capabilities. The required fields ensure all necessary components are provided.


1928-1928: LGTM! Proper handling of forward references.

The addition of DynamicSchemaLoader to the forward references ensures proper model resolution.

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)

1679-1699: LGTM! Clear schema definition with appropriate experimental warning.

The schema for TypesMap is well-defined and matches the Python implementation. The experimental warning is a good practice for managing user expectations.


1700-1741: LGTM! Comprehensive schema definition with clear field descriptions.

The schema for SchemaTypeIdentifier is well-documented with clear descriptions for each field. The experimental warning and field requirements match the Python implementation.


1742-1764: LGTM! Well-defined schema with appropriate requirements.

The schema for DynamicSchemaLoader properly defines the component's structure with clear requirements and descriptions. The experimental warning is consistently applied.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

1613-1637: Should we provide a more descriptive name for the retriever?

In the create_dynamic_schema_loader method, the name parameter for the retriever is set to an empty string. Providing a descriptive name might help with debugging and logging. Perhaps we could use something like "dynamic_schema_retriever"? WDYT?


1303-1337: Consider refactoring duplicate logic in slicer methods?

I noticed that _build_stream_slicer_from_partition_router and _build_resumable_cursor_from_paginator have similar patterns and are used together when building stream slicers. Do you think it might be beneficial to refactor this logic into a shared helper function to reduce duplication and improve maintainability? WDYT?

Also applies to: 1382-1382

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 05e4f74 and d41f54c.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (9 hunks)
  • airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py (1 hunks)
  • unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py
  • airbyte_cdk/sources/declarative/schema/dynamic_schema_loader.py
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

1581-1637: Great addition of the dynamic schema components!

The implementation of create_types_map, create_schema_type_identifier, and create_dynamic_schema_loader looks solid and aligns well with the existing factory patterns. This should enhance the schema management capabilities effectively.


1615-1618: Is combined_slicers correctly assigned?

In create_dynamic_schema_loader, after calling _build_resumable_cursor_from_paginator, combined_slicers might end up being None if there's no paginator. Should we handle this case to ensure that the stream_slicer passed to the retriever is valid? Perhaps adding a check or default value could prevent potential issues. WDYT?

@aldogonzalez8
Copy link
Contributor

The changes make sense to me. Just one thing, can we include a test with a manifest and the new dynamic schema loader setup? It would be nice to have like the ones in this other PR.

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only thing for me that would be missing is a very basic test that shows how it works in a manifest. Something similar to the test_concurrent_declarative_source. It doesn't need to be exhaustive but mostly to test that the discover method will return the dynamic stream

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)

905-919: Consider improving the description formatting for better readability.

The description text in oauth_user_input_from_connector_config_specification could be more readable with proper line breaks and indentation. WDYT about this format?

-            description="OAuth specific blob. This is a Json Schema used to validate Json configurations used as input to OAuth.\nMust be a valid non-nested JSON that refers to properties from ConnectorSpecification.connectionSpecification\nusing special annotation 'path_in_connector_config'.\nThese are input values the user is entering through the UI to authenticate to the connector, that might also shared\nas inputs for syncing data via the connector.\nExamples:\nif no connector values is shared during oauth flow, oauth_user_input_from_connector_config_specification=[]\nif connector values such as 'app_id' inside the top level are used to generate the API url for the oauth flow,\n  oauth_user_input_from_connector_config_specification={\n    app_id: {\n      type: string\n      path_in_connector_config: ['app_id']\n    }\n  }\nif connector values such as 'info.app_id' nested inside another object are used to generate the API url for the oauth flow,\n  oauth_user_input_from_connector_config_specification={\n    app_id: {\n      type: string\n      path_in_connector_config: ['info', 'app_id']\n    }\n  }",
+            description="""
+                OAuth specific blob. This is a Json Schema used to validate Json configurations used as input to OAuth.
+                Must be a valid non-nested JSON that refers to properties from ConnectorSpecification.connectionSpecification
+                using special annotation 'path_in_connector_config'.
+                
+                These are input values the user is entering through the UI to authenticate to the connector, that might also shared
+                as inputs for syncing data via the connector.
+                
+                Examples:
+                if no connector values is shared during oauth flow:
+                    oauth_user_input_from_connector_config_specification=[]
+                
+                if connector values such as 'app_id' inside the top level are used to generate the API url for the oauth flow:
+                    oauth_user_input_from_connector_config_specification={
+                        app_id: {
+                            type: string
+                            path_in_connector_config: ['app_id']
+                        }
+                    }
+                
+                if connector values such as 'info.app_id' nested inside another object are used to generate the API url for the oauth flow:
+                    oauth_user_input_from_connector_config_specification={
+                        app_id: {
+                            type: string
+                            path_in_connector_config: ['info', 'app_id']
+                        }
+                    }
+                """,

660-678: Consider adding more descriptive field documentation.

The SchemaTypeIdentifier fields could benefit from more detailed descriptions to help users understand their purpose. For example:

  • schema_pointer: What kind of schema fields are expected?
  • key_pointer: What kind of keys should be extracted?
  • type_pointer: How is this different from schema_pointer?

WDYT about adding more context to help users understand these fields better?

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

1688-1708: Consider adding examples to TypesMap properties.

The TypesMap definition could benefit from examples showing how to map different types. WDYT about adding examples like:

examples:
  target_type: ["string", "integer"]  # or "string"
  current_type: "text"  # or ["varchar", "text"]

1709-1749: Consider improving the field descriptions in SchemaTypeIdentifier.

The descriptions could be more specific about the relationship between these fields. For example:

  • How do schema_pointer and type_pointer interact?
  • What's the expected format of the types_mapping array?

WDYT about adding this context to help users understand the component better?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b87978 and 3da287a.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (12 hunks)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)

655-658: LGTM! The TypesMap model looks good.

The model is well-defined with required fields and appropriate type annotations.


1820-1828: Add Field(...) to schema_type_identifier to specify it as a required field.

Currently, schema_type_identifier is missing the Field(...) wrapper indicating it's a required field. Adding Field(...) will make it consistent with other required fields and ensure proper Pydantic validation. WDYT?

-    schema_type_identifier: SchemaTypeIdentifier
+    schema_type_identifier: SchemaTypeIdentifier = Field(
+        ...,
+        description="Specifies how to identify types within the schema.",
+        title="Schema Type Identifier",
+    )
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

1750-1773: LGTM! The DynamicSchemaLoader definition looks good.

The component is well-defined with clear descriptions and required fields.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (5)
unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py (3)

12-48: Consider expanding test fixtures with more edge cases?

The fixtures look good but could we enhance them with additional test cases? For example:

  • Nested objects
  • Arrays
  • Null values
  • Custom types that need mapping

This would help ensure the schema loader handles all possible scenarios. WDYT?


50-126: How about adding more test scenarios?

The parameterized tests look solid! Would you consider adding these scenarios:

  • Multiple records with conflicting types for the same field
  • Very large numbers that might need special handling
  • Date/time fields with various formats
  • Fields with special characters in names

This would help catch edge cases early. WDYT?


318-363: Consider enhancing manifest integration test?

The manifest integration test looks good! Would you consider adding:

  1. Verification of the types_mapping conversion (e.g., singleLineText → string)
  2. Assertions for schema metadata (e.g., $schema version)
  3. Validation that the schema matches the actual records

This would provide more confidence in the integration. WDYT?

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (2)

653-656: Would you like to add more documentation for TypesMap?

The class structure looks good! Consider adding:

  • Field descriptions explaining the purpose of target_type and current_type
  • Examples showing both single type and list type mappings
  • Notes about type conversion behavior

This would make it easier for other developers to use the class correctly. WDYT?


658-677: Consider adding validation for pointer fields?

The SchemaTypeIdentifier looks well-designed! Would you consider:

  1. Adding min_length validation for key_pointer to ensure it's not empty
  2. Adding examples for each pointer field to show valid paths
  3. Adding a validator to ensure type_pointer is provided if types_mapping is used

This would help prevent configuration errors. WDYT?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3da287a and 507afb6.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (6 hunks)
  • unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py (1 hunks)
🔇 Additional comments (2)
unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py (1)

146-153: Error message doesn't match the invalid type scenario

The error message "Expected key to be a string. Got None" doesn't align with testing an invalid type. Should it be more specific about the type validation failure?

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)

1808-1817: Add Field(...) to schema_type_identifier?

The DynamicSchemaLoader looks good! Adding Field(...) would make it consistent with other required fields and ensure proper validation. WDYT?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

1688-1708: The TypesMap component could use more documentation, wdyt?

While the component structure is solid, consider enhancing the documentation with:

  1. Examples of type mappings
  2. Use cases for when you'd use string vs array of strings
  3. Common scenarios where this component would be useful

1709-1749: The field descriptions look good, but could we enhance them further?

The component is well-structured, but consider:

  1. Adding examples for each pointer type to show their differences in practice
  2. Documenting the relationship between schema_pointer, key_pointer, and type_pointer
  3. Adding examples of types_mapping usage with TypesMap
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 507afb6 and d7e4873.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

1221-1221: LGTM! The DynamicSchemaLoader is properly integrated

The addition of DynamicSchemaLoader as a schema_loader option in DeclarativeStream is well-placed and aligns with the PR's objective to support dynamic schema loading.


1750-1773: The DynamicSchemaLoader looks good, but let's verify its integration

The component structure is solid. However, we should verify:

  1. That all supported retriever types work correctly with dynamic schema loading
  2. The interaction between the retriever and schema_type_identifier
✅ Verification successful

The DynamicSchemaLoader implementation looks solid and well-tested! 🎯

The verification shows:

  • Comprehensive test coverage including various scenarios (valid types, missing types, nested types, empty records)
  • Integration with all retriever types is properly defined in the schema
  • Real-world usage example in the manifest flow test
  • Proper error handling for invalid keys and types

The component works as expected with the SimpleRetriever and has appropriate test coverage. The experimental status is properly marked in both the implementation and schema definition.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for existing implementations and test coverage
echo "Checking for implementations and tests..."
rg -g "test_*.py" "DynamicSchemaLoader" -A 5

Length of output: 1983


Script:

#!/bin/bash
# Let's check the implementation details and test coverage more thoroughly
echo "Checking test coverage for DynamicSchemaLoader..."
rg -g "test_*.py" -B 5 -A 10 "def test.*DynamicSchemaLoader"

echo -e "\nChecking implementation details..."
ast-grep --pattern 'class DynamicSchemaLoader {
  $$$
}'

echo -e "\nChecking for integration with different retriever types..."
rg "DynamicSchemaLoader.*retriever.*SimpleRetriever|DynamicSchemaLoader.*retriever.*AsyncRetriever|DynamicSchemaLoader.*retriever.*CustomRetriever" -A 3

Length of output: 1034


Script:

#!/bin/bash
# Let's look at the actual test file content to understand test coverage
echo "Checking test file content..."
cat unit_tests/sources/declarative/schema/test_dynamic_schema_loader.py

echo -e "\nChecking for actual implementation class..."
rg -g "!test_*" "class DynamicSchemaLoader" -B 2 -A 10

Length of output: 14970

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'm wondering if @aldogonzalez8 has used it in source-google-sheets and can approve this as well

@aldogonzalez8
Copy link
Contributor

aldogonzalez8 commented Dec 12, 2024

LGTM! I'm wondering if @aldogonzalez8 has used it in source-google-sheets and can approve this as well

I have some advance and I confirm I've been able to use the DynamicStreams and DynamicSchema for google-sheets with a pre-dev cdk with minor changes that I think @lazebnyi has already added in last commits.

@aldogonzalez8
Copy link
Contributor

LGTM! I'm wondering if @aldogonzalez8 has used it in source-google-sheets and can approve this as well

I have some advance and I confirm I've been able to use the DynamicStreams and DynamicSchema for google-sheets with a pre-dev cdk with minor changes that I think @lazebnyi had already added in last commits.

Actualy is weird that the compare doesn't show me the latest changes in this PR but I assume is some bad cache or I did something wrong in the compare as I can see these options have already been added.

image

This is auto-generated file, so we don't need change manually anything here.

@lazebnyi, what is the command/action that re-generated the file?

Copy link
Contributor

@aldogonzalez8 aldogonzalez8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVED

@lazebnyi
Copy link
Contributor Author

lazebnyi commented Dec 12, 2024

@aldogonzalez8 poetry run poe build will re-generate models based on yaml schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants