You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be ideal for the indexer to output integer valued document vectors with term frequencies. These can be optionally written to disk in a compressed format using https://github.com/lemire/FastPFOR to allow for easy experimentation with different representation approaches. It would also output term collection statistics.
This would allow quick processing to convert the vectors to weighted vectors with TF-IDF, BM25, etc, or conversion to signatures.
It would be ideal for the indexer to output integer valued document vectors with term frequencies. These can be optionally written to disk in a compressed format using https://github.com/lemire/FastPFOR to allow for easy experimentation with different representation approaches. It would also output term collection statistics.
This would allow quick processing to convert the vectors to weighted vectors with TF-IDF, BM25, etc, or conversion to signatures.
It would be great to do the same with bi-grams and invent or reuse a weighting scheme that uses pointwise mutual information (http://nlpwp.org/book/chap-ngrams.xhtml#chap-ngrams-bigrams) in the weighting calculation.
The text was updated successfully, but these errors were encountered: