Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search function and stopwords #3104

Open
stcoats opened this issue Nov 15, 2024 · 2 comments
Open

search function and stopwords #3104

stcoats opened this issue Nov 15, 2024 · 2 comments

Comments

@stcoats
Copy link

stcoats commented Nov 15, 2024

It seems that the dataset-viewer search function returns no hits if one searches for terms such as “what”, “can”, “which”, and so on. Has the indexing function removed stopwords like this? The rows are returned if one uses the SQL console, but the returned rows in the SQL console don’t give access to the column with audio, for a dataset that includes audio files. Is there a way to search for stop words like this in the default datasets Viewer? It would be really useful if all of the textual content in a column could be searchable.

@AndreaFrancis
Copy link
Contributor

Has the indexing function removed stopwords like this?

Yes, we use a default list of stopwords, which contains 571 words, including "what," "can," and "which." You can view the complete list here:
DuckDB English Stopwords List.

But, as @severo mentioned in this discussion, we now support language-specific stemmers for monolingual datasets. Using a default English stopwords list for all languages no longer makes sense. However, DuckDB currently lacks a straightforward way to assign stopwords based on language as it does for stemmers (we would need to seed a stopwords table for non-English datasets). Therefore, for now, the best approach is to set the stopwords parameter to 'none'. cc. @lhoestq

If users want to remove stopwords for specific monolingual datasets (e.g., English), this could be a candidate for a custom configuration at the dataset card level. Keep in mind that removing stopwords like "what," "can," or "which" helps focus on more meaningful terms, improving search relevance. It also reduces the size of the search index and speeds up queries, which is crucial for performance in the Datasets Viewer.

@stcoats
Copy link
Author

stcoats commented Nov 20, 2024

@AndreaFrancis thanks for the reply. It is true that content words are typically more meaningful than are prepositions, articles, and so on, but it is very much the case that a researcher may be interested in filtering a dataset for function words or, especially, collocations containing function words. The way the plugin works at the moment, a search string like "never experience" is equivalent to "experience".
In addition, and as you note, the current configuration actually removes words in datasets in other languages if they are homographs with English stopwords. So, for example Swedish "by" (village) is unsearchable in a Swedish-language dataset with the current viewer parameters.
I understand that it affects speed, but I think it would be great if no stopwords are used and/or custom configurations could be made available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants