-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV dialect detection: implementation without third party libraries #2247
Comments
Thanks @ws-garcia ! This is very timely as I was dreading taking on the csv-sniffer python port, thus the lack of activity. Your step-by-step "new path" breakdown is certainly easier to digest than the paper :) Will be sure to loop you in as we mark progress... |
You can use the paper only to implement some logic if you're confused at porting the Python code. So, look at the research as a backup reference to dive in into the implementation. |
Hi @ws-garcia , just wanted to let you know that I'm thinking of implementing your paper as a Rust library given the utility of CSV dialect detection, as other developers may want to use your CSV dialect detection algorithm, and qsv is a command-line utility. As the name I will deprecate the existing Thoughts? |
Hey @jqnatividad, I am honored that you have the idea of adding my name to the library. But there is a name that would sound great and promote the amazing product that is I continue to think that adding a high-precision dialect detector to qsv would be a great milestone for the project. So, go ahead with the library and its implementation! |
Great! 🎉 Will keep you posted as we mark progress on implementing the library and integrating it into qsv and qsv pro. |
The research paper methodology will be soon published as Open Access under Creative Commons Attribution License (CC BY 4.0). You only need to give the copyright ©️. Let's go make qsv as infalible as posible! |
Discussed in #2246
Originally posted by ws-garcia October 25, 2024
Problem overview
Currently, this project does not have a stable alternative that allows detecting CSV file configuration. An example of this is raised in #1719, where the utility fails to detect the configuration for the given files.
Details
At the moment, @jqnatividad has begun digging into the problem and claiming
He pointed
The work path to go, until now, is outlined in jqnatividad/qsv-sniffer#14. Currently, all tasks are under study but not completed.
New path
In this I will discuss a new approach to implement dialect detection in qsv using trivial elements:
With this approach the dialect detection is reliable as the CleverCSV one, being able to obtain results with greater certainty. The process is as follows:
A Python implementation of this exact approach is described in a GitHub repository. The evaluation of this methods gives:
CSVsniffer
CleverCSV
csv.Sniffer
This sheds light over one point: the presented approach is clearly outperforming
csv.Sniffer
and alsoCleverCSV
in the research datasets.Hoping this can help this wonderful project!
Edit:
Code snippet will be presented in the discussion.
The text was updated successfully, but these errors were encountered: