Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

sgoley · 2021-10-01T17:48:35Z

Describe the feature

Use alternative delimiters for seeds besides "," (comma)

Describe alternatives you've considered

Currently, the only option is to load a non-csv seed into database as a temp table and then reference that as a "source".

Additional context

No, not database specific.

Who will this benefit?

Anyone who uses seeds.

jtcohen6 · 2021-10-04T12:25:54Z

@sgoley When you say "non-csv file types," are you talking only about tabular data with non-comma delimiters (e.g. TSVs)? Or are you also thinking about other formats for structured data (JSON, XML, ...)?

If it's just a question of delimiters, could you provide a bit more background on the difficulty you're encountering? Is there a blocker to pre-processing your files, so as to switch from tab/semicolon/other delimiter to commas?

We've got another issue already open for newline-delimited JSON support: #2365

sgoley · 2021-10-05T02:10:14Z

Yes, in my case I am specifically talking about non-comma delimiters like TSVs, semi-colon delimited (common in some european countries given the numerical comma standard), or "|" ( 'VERTICAL LINE' U+007C ) delimited files (infrequent but used within US financial systems).

I do completely understand that pre-processing is the only current workable solution, just opening the issue here since I searched and no truly similar issue has been raised or closed.

After reading more into the agate csv reader function, it looks like that particular api does not support a "sep" / "delimiter" parameter which is why I assume this was not supported natively?

jtcohen6 · 2021-10-05T13:50:43Z

@sgoley You're right, we just call agate's from_csv method for this:
https://github.com/dbt-labs/dbt/blob/f7680379fca80f653e8e7e2d45d7165a0fd864da/core/dbt/clients/agate_helper.py#L146

The good news: in that method, agate itself is happy to pass through any/all keyword arguments supported by python's built-in reader. It accepts **kwargs and passes them into csv.reader().

So I think this could be as simple as:

adding a delimiter arg to agate_helper.from_csv
adding a delimiter config to seeds (right around here), with default value either , or None (i.e. comma)
adjusting the load_agate_table context method

        column_types = self.model.config.column_types
        delimiter = self.model.config.delimiter
        try:
            table = agate_helper.from_csv(path, text_columns=column_types, delimiter=delimiter)

So, I'm pretty close to tagging this as a good first issue. I have just a few more questions:

What if you need custom quoting, escape characters, etc? Should we seek to add generalized support for all csv.reader() kwargs as configs? Or do we think a configurable delimiter covers 90% of the bases?
Would you expect to be able to define these seed files with other file extensions (e.g. *.tsv)? That change would need to happen in a different part of the codebase.
Would we be better off saying "no" to all of the above, keep seed support very simple (dbt isn't a data loader!), even at the cost of pre-processing?

jameseon · 2022-01-17T13:25:11Z

Hello @jtcohen6

Following this issue along with JSON related (following #2365 )
As a dbt community member, If I may answer your questions to @sgoley

I understand the poor use cases outlined for dbt here https://docs.getdbt.com/docs/building-a-dbt-project/seeds - for valid/good use cases seeding small amounts of data as part of a build process (sampling ) - writing tests, documentation etc

What if you need custom quoting, escape characters, etc? Should we seek to add generalized support for all csv.reader() kwargs as configs? Or do we think a configurable delimiter covers 90% of the bases?

-- I believe the community would benefit from adding generalized support for csv - COMMA is a very common character and shows up everywhere. If the data is not escaped or quoted, the data may shift even for simple files.

support for different delimiters, text enclosure and escape character at a minimum

Would you expect to be able to define these seed files with other file extensions (e.g. *.tsv)? That change would need to happen in a different part of the codebase.

-- Another community benefit, sometimes we need to work with small datasets from different formats. I believe support for CSV (if it has the configurable delimiter with text enclosure and escape character; this covers any character delimited file .txt,.tsv, pipe delimited etc - csv.reader() kwargs should handle this beautifully), json , xml, parquet and ORC: these are becoming common.

Would we be better off saying "no" to all of the above, keep seed support very simple (dbt isn't a data loader!), even at the cost of pre-processing? keeping it simple is great but this limits the entry point to dbt as only comma separated files. Going through

---- As the community grows this is going to be a growing need. Now, dbt seed is still not ideal for loading data into a warehouse, as part of the build process for a data project, sampling (maybe find size limitations to avoid functionality abuse) needs to be recommended as part of best practice to keep the functionality light as is. Yes (dbt is not a a data loader), but can it be used to build an ELT pipeline based on small samples of different file formats? Mocking the (EL) process for common data formats, build pipeline based on a small amount of data. Let's not say "no", say yes 😄 - the community needs these functionalities. The preprocess of a file before entry to dbt is a little hectic and expensive for a modern day tool.

Thank you for considering this. Keep up the great work.

github-actions · 2022-07-17T02:14:19Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

github-actions · 2022-07-24T02:14:40Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest; add a comment to notify the maintainers.

…labs#3990)

ramonvermeulen · 2023-03-17T13:55:08Z

I opened a pull request that adds this feature into dbt-core: #7186

jtcohen6 · 2023-03-20T08:38:18Z

Thanks @ramonvermeulen! Reopening, given you've picked this up

… (#7186) * Support configurable delimiter for seed files, default to comma (#3990) * Update Features-20230317-144957.yaml * Moved "delimiter" to seed config instead of node config * Update core/dbt/clients/agate_helper.py Co-authored-by: Cor <[email protected]> * Update test_contracts_graph_parsed.py * fixed integration tests * Added functional tests for seed files with a unique delimiter * Added docstrings * Added a test for an empty string configured delimiter value * whitespace * ran black * updated changie entry * Update Features-20230317-144957.yaml --------- Co-authored-by: Cor <[email protected]>

Ram-Dev7 · 2024-11-03T21:38:21Z

Hello team, when this issue will be fixed? we have both csv and tsv files in the seeds, seems like tsv is not working in seeds - I'm in version dbt-databricks 1.8.0, Thanks

ramonvermeulen · 2024-11-04T16:23:52Z

Hello team, when this issue will be fixed? we have both csv and tsv files in the seeds, seems like tsv is not working in seeds - I'm in version dbt-databricks 1.8.0, Thanks

@Ram-Dev7
So for my understanding, the issue is that tab separator still does not work as delimiter after my delimiter support contribution last year? Could you provide me with some more context, e.g. the configuration that you currently use? I assume you should configure with a \t as delimiter. Most likely if \t does not work, this has to do with the underlaying library (agate) that is being used and is not supporting this.

UPDATE: I did some tests, and this seems to be the case.

See call with delimiter being passed as argument:

dbt-core/core/dbt/context/providers.py

Line 1055 in bdf28d7

    
           table = agate_helper.from_csv(path, text_columns=column_types, delimiter=delimiter)

And function definition in dbt-common:
https://github.com/dbt-labs/dbt-common/blob/65d7ba5d71f4ebea80605408824708bf0e5a99af/dbt_common/clients/agate_helper.py#L152

In that case I think it might be a bit harder to fix, e.g. this needs to be implemented in agate, or a very big change has to happen in how dbt is handling seeds (e.g. using python internals or another library instead of agate?).

But let's first see if we can have a reproducable example, to confirm that this is really the case.

sgoley added enhancement New feature or request triage labels Oct 1, 2021

jtcohen6 removed the triage label Oct 4, 2021

jtcohen6 added the seeds Issues related to dbt's seed functionality label Oct 5, 2021

jtcohen6 added the Team: Execution label Nov 18, 2021

github-actions bot added the stale Issues that have gone stale label Jul 17, 2022

github-actions bot closed this as completed Jul 24, 2022

ramonvermeulen added a commit to ramonvermeulen/dbt-core that referenced this issue Mar 17, 2023

Support configurable delimiter for seed files, default to comma (dbt-…

596856b

…labs#3990)

ramonvermeulen mentioned this issue Mar 17, 2023

Support configurable delimiter for seed files, default to comma (#3990) #7186

Merged

6 tasks

jtcohen6 reopened this Mar 20, 2023

jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Team:Language and removed stale Issues that have gone stale Team:Execution labels Mar 20, 2023

jtcohen6 removed the Team:Language label Jul 19, 2023

QMalcolm closed this as completed in #7186 Aug 1, 2023

tlento mentioned this issue Nov 6, 2023

Update typing-extensions version to >=4.4 #9012

Merged

5 tasks

tlento mentioned this issue Feb 26, 2024

Update dbt-semantic-interfaces dependency to compatible range #9671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

sgoley commented Oct 1, 2021

jtcohen6 commented Oct 4, 2021

sgoley commented Oct 5, 2021

jtcohen6 commented Oct 5, 2021

jameseon commented Jan 17, 2022 •

edited

Loading

github-actions bot commented Jul 17, 2022

github-actions bot commented Jul 24, 2022

ramonvermeulen commented Mar 17, 2023

jtcohen6 commented Mar 20, 2023

Ram-Dev7 commented Nov 3, 2024

ramonvermeulen commented Nov 4, 2024 •

edited

Loading

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

Introduce "delimiter" parameter for seeds? Allow non-csv file types #3990

Comments

sgoley commented Oct 1, 2021

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

jtcohen6 commented Oct 4, 2021

sgoley commented Oct 5, 2021

jtcohen6 commented Oct 5, 2021

jameseon commented Jan 17, 2022 • edited Loading

github-actions bot commented Jul 17, 2022

github-actions bot commented Jul 24, 2022

ramonvermeulen commented Mar 17, 2023

jtcohen6 commented Mar 20, 2023

Ram-Dev7 commented Nov 3, 2024

ramonvermeulen commented Nov 4, 2024 • edited Loading

jameseon commented Jan 17, 2022 •

edited

Loading

ramonvermeulen commented Nov 4, 2024 •

edited

Loading