search function and stopwords #3104

stcoats · 2024-11-15T20:11:56Z

It seems that the dataset-viewer search function returns no hits if one searches for terms such as “what”, “can”, “which”, and so on. Has the indexing function removed stopwords like this? The rows are returned if one uses the SQL console, but the returned rows in the SQL console don’t give access to the column with audio, for a dataset that includes audio files. Is there a way to search for stop words like this in the default datasets Viewer? It would be really useful if all of the textual content in a column could be searchable.

AndreaFrancis · 2024-11-18T11:33:26Z

Has the indexing function removed stopwords like this?

Yes, we use a default list of stopwords, which contains 571 words, including "what," "can," and "which." You can view the complete list here:
DuckDB English Stopwords List.

But, as @severo mentioned in this discussion, we now support language-specific stemmers for monolingual datasets. Using a default English stopwords list for all languages no longer makes sense. However, DuckDB currently lacks a straightforward way to assign stopwords based on language as it does for stemmers (we would need to seed a stopwords table for non-English datasets). Therefore, for now, the best approach is to set the stopwords parameter to 'none'. cc. @lhoestq

If users want to remove stopwords for specific monolingual datasets (e.g., English), this could be a candidate for a custom configuration at the dataset card level. Keep in mind that removing stopwords like "what," "can," or "which" helps focus on more meaningful terms, improving search relevance. It also reduces the size of the search index and speeds up queries, which is crucial for performance in the Datasets Viewer.

stcoats · 2024-11-20T08:53:08Z

@AndreaFrancis thanks for the reply. It is true that content words are typically more meaningful than are prepositions, articles, and so on, but it is very much the case that a researcher may be interested in filtering a dataset for function words or, especially, collocations containing function words. The way the plugin works at the moment, a search string like "never experience" is equivalent to "experience".
In addition, and as you note, the current configuration actually removes words in datasets in other languages if they are homographs with English stopwords. So, for example Swedish "by" (village) is unsearchable in a Swedish-language dataset with the current viewer parameters.
I understand that it affects speed, but I think it would be great if no stopwords are used and/or custom configurations could be made available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

search function and stopwords #3104

search function and stopwords #3104

stcoats commented Nov 15, 2024

AndreaFrancis commented Nov 18, 2024

stcoats commented Nov 20, 2024

search function and stopwords #3104

search function and stopwords #3104

Comments

stcoats commented Nov 15, 2024

AndreaFrancis commented Nov 18, 2024

stcoats commented Nov 20, 2024