Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to configure a connection with duplicate streams from different namespaces #105

Open
Grintas opened this issue May 20, 2024 · 6 comments

Comments

@Grintas
Copy link

Grintas commented May 20, 2024

I'm configuring a connection with Terraform from Postgres to S3. The database has multiple identical schemas (namespaces), so the table names (stream_names) are not unique:

- postgres_db
    - schema_1
        - table_1
        - table_2
    - schema_2
        - table_1
        - table_2

In the destination config I have s3_path_format set as ${STREAM_NAME}/${NAMESPACE}/${YEAR}_${MONTH}_${DAY}_, so the data in S3 would be partitioned by the source schema.
However, I cannot configure the connection resource properly since the configurations.streams block does not have a schema/namespace attribute. If I include just one entry for table_1 like this:

 configurations = {
    streams = [
        {
            name = "table_1"
            cursor_field = ["_ab_cdc_lsn"]
            sync_mode = "incremental_append"
        }
    ]
}

then once deployed the connector has just one stream from one of the source schemas.
When I enabled two streams for table_1 in the UI and ran the apply again, the plan showed me this:

Terraform will perform the following actions:

  # airbyte_connection.rds_to_s3 will be updated in-place
  ~ resource "airbyte_connection" "rds_to_s3" {
      ~ configurations                       = {
          ~ streams = [
              - {
                  - cursor_field = [
                      - "_ab_cdc_lsn",
                    ] -> null
                  - name         = "table_1" -> null
                  - primary_key  = [
                      - [
                          - "id",
                        ],
                    ] -> null
                  - sync_mode    = "incremental_append" -> null
                },
                # (1 unchanged element hidden)
            ]
        }
        # (9 unchanged attributes hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

So I added a duplicate stream for table_1 and then the plan passed, but apply failed with an error
Status 400 │ {"detail":"The body of the request contains an invalid connection configuration. Duplicate stream found in configuration for: │ table_1.","type":"https://reference.airbyte.com/reference/errors","title":"bad-request","status":400}

How can I address this? Did I miss any configuration in the docs or is this not supported?

@Grintas Grintas changed the title The body of the request contains an invalid connection configuration. Duplicate stream found in configuration for <stream_name> Unable to configure a connection with duplicate streams from different namespaces May 20, 2024
@lugonthier
Copy link

I had the exact same problem with Snowflake to BigQuery where the snowflake schema is the namespace.

In my case, the workaround I found is to create a source per namespace by specifying the schema param in the airbyte_source_snowflake. Therefore create one connection per source..

It makes the provider mostly unusable at scale.

@Grintas
Copy link
Author

Grintas commented Jun 12, 2024

I had the exact same problem with Snowflake to BigQuery where the snowflake schema is the namespace.

In my case, the workaround I found is to create a source per namespace by specifying the schema param in the airbyte_source_snowflake. Therefore create one connection per source..

It makes the provider mostly unusable at scale.

Yep, we came to the same conclusion. But some of our source databases have 100+ schemas, and if I'm not mistaken, creating a connection per schema requires 100 replication slots, which essentially kills the source cluster.

@Grintas Grintas closed this as completed Jun 12, 2024
@Grintas Grintas reopened this Jun 12, 2024
@jasonmaddernstudylink
Copy link

I’ve hit the exact same problem as I have circa 50 identical schemas in Postgres to sync :(

@jmaddern-fw
Copy link

I would like to add that this was doable through Octavia (as I had many identical Postgres streams all being synced through the one pipeline). Yes, the yaml file had some repetition but it was actionable.

Is there any plans to introduce this functionality?

@gingeard
Copy link

gingeard commented Oct 19, 2024

It seems that there is a limitation of public-api.
While the deprecated Configuration API (server-api) has the field "namespace" included to the "stream" object:

https://airbyte-public-api-docs.s3.us-east-2.amazonaws.com/rapidoc-api-docs.html#post-/v1/connections/create

{
  ...
  "syncCatalog": {
    "streams": [
      {
        "stream": {
          "name": "...",
          "jsonSchema": {...},
          ...,
          "namespace": "string",

        },
        "config": {
          ...
  },
  ...
}

and you're able to send it to the backend. For example, in the browser, I can see that the POST payload to the:
http://AIRBYTE_WEBAPP/api/v1/web_backend/connections/create

looks like this:

image

At the same time, the public API has no same parameters for that:
https://reference.airbyte.com/reference/createconnection

The object "configurations[].streams[]" has "name", "syncMode", "cursorField", "primaryKey" and selectedFields parameters only. Have no idea why the public-api is cut compared with server-api.

Have to submit this issue to the Airbyte Platform and its public-api functionality.

@jmaddern-fw
Copy link

Thanks so much for your help, and clear response, @gingeard

Ticket raised with Airbyte: airbytehq/airbyte#47140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants