-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
search function and stopwords #3104
Comments
Yes, we use a default list of stopwords, which contains 571 words, including "what," "can," and "which." You can view the complete list here: But, as @severo mentioned in this discussion, we now support language-specific stemmers for monolingual datasets. Using a default English stopwords list for all languages no longer makes sense. However, DuckDB currently lacks a straightforward way to assign stopwords based on language as it does for stemmers (we would need to seed a stopwords table for non-English datasets). Therefore, for now, the best approach is to set the stopwords parameter to 'none'. cc. @lhoestq If users want to remove stopwords for specific monolingual datasets (e.g., English), this could be a candidate for a custom configuration at the dataset card level. Keep in mind that removing stopwords like "what," "can," or "which" helps focus on more meaningful terms, improving search relevance. It also reduces the size of the search index and speeds up queries, which is crucial for performance in the Datasets Viewer. |
@AndreaFrancis thanks for the reply. It is true that content words are typically more meaningful than are prepositions, articles, and so on, but it is very much the case that a researcher may be interested in filtering a dataset for function words or, especially, collocations containing function words. The way the plugin works at the moment, a search string like "never experience" is equivalent to "experience". |
It seems that the dataset-viewer search function returns no hits if one searches for terms such as “what”, “can”, “which”, and so on. Has the indexing function removed stopwords like this? The rows are returned if one uses the SQL console, but the returned rows in the SQL console don’t give access to the column with audio, for a dataset that includes audio files. Is there a way to search for stop words like this in the default datasets Viewer? It would be really useful if all of the textual content in a column could be searchable.
The text was updated successfully, but these errors were encountered: