Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using multiple delimiters #321

Open
Olgierd-Jankovski opened this issue Aug 15, 2024 · 2 comments
Open

Using multiple delimiters #321

Olgierd-Jankovski opened this issue Aug 15, 2024 · 2 comments

Comments

@Olgierd-Jankovski
Copy link

Hi @tototoshi!
It's really great you have created the parser and constantly adding updates on fixing/updating it's functionality!

As an employee of the company, we have run into certain issues, where multiple clients are sending csv files containing the specific data in predefined format, however, the only thing to take into account - are the delimiters, since we are able to assign only one delimiter (default one: is a ,).

Thus, we are unable to parse client's imported document, since each of them may contain , or # or ; delimiter/separator characters (of course, it parses data wrongly, or even worse, results in crash, e.g.: while trying to parse the following line:

"hello world";123,321

I have seen an example of assigning(overriding default one) custom delimiter character:

implicit object MyFormat extends DefaultCSVFormat {
  override val delimiter = '#'
}

However, as I can understand, the parser does not support the functionality of supporting multiple delimiters, e.g.:

implicit object MyFormat extends DefaultCSVFormat {
  override val delimiterHashset = HashSet(";", ",", "#")
}

I would appreciate if we could discuss about possible solutions for solving that issue!
Thank you for your time! Looking for your prompt reply!

Best Regards,
Olgierd Jankovski

@tototoshi
Copy link
Owner

Hi @Olgierd-Jankovski

It seems that supporting multiple delimiters in the parser would be challenging. Delimiters are treated in a special way, and allowing for multiple ones would require significant changes to the parser’s implementation, likely affecting its performance and potentially impacting other functionalities like CSV writing.

I believe that a format with multiple delimiters might differ from the standard CSV format, which is why I tend to think that support for such a feature may not be necessary for a general-purpose CSV library. However, I recognize that this is just my perspective, and there may be more situations where such formats are commonly used.

If there are CSV libraries that support multiple delimiters, I would be very interested in learning more about them as a reference.

@Olgierd-Jankovski
Copy link
Author

Thank you for your response!

True, supporting multiple delimiter parsing at a time could be challenging, and even worse, it may lead to the performance bottleneck. What came to my mind... I was thinking of the alternative way: of automatically detecting delimiter (assuming that he is unknown, but it is one of ", ; # |" symbols for sure), thus, after detecting the delimiter - the only thing left is to simply execute the current flow of parsing.

Of course, there are multiple problems that arise - how to satisfy the condition, that the delimiter is detected (e.g. sucessfully parsed a file that contains x amount of rows, each of them contains the same amount of elements)? Should we scan entire file or only a chunk of it (for delimiter detection)?

Indeed, this realization could prove challenging, but that feature, I believe, could make the parser to stand out the most.
Moving into the examples, e.g.: where and how that feature persists, for now, I have met only few of them:
https://github.com/nietras/Sep - written in C# and
https://github.com/uniVocity/univocity-parsers - Java

However, even though they support delimiter detection functionality, it still unclear are the parse results valid, do they expect to satisfy a condition, e.g.: to parse a file, so the first row will contain a fixed amount of headers. I have not dived deeply into the implementation.

Thank you for your time!
Best Regards,

Olgierd Jankovski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants