Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicating of nodes via similarity & community detection #119

Open
12 tasks
cassinius opened this issue Apr 28, 2020 · 0 comments
Open
12 tasks

De-duplicating of nodes via similarity & community detection #119

cassinius opened this issue Apr 28, 2020 · 0 comments

Comments

@cassinius
Copy link
Collaborator

Use something along the line of the ingredient de-duplicating pipeline demonstrated in a neo4j tutorial using the BBC goodfood ingredients

Pipeline

  • download the goodfood dataset (scraping required !) - or something equivalend
  • normal NLP preprocessing steps
    • character encodings
    • tokenization
    • stemming (plurals)
    • stopwords / length etc.
  • connect tokens to an ingredient
    • ingredient: cherry tomato => parts: cherry and tomato
  • Use string distance to create similarity edges
    • sorensenDiceSimilarity ??
  • Use phonetic similarity to create similarity edges
    • doubleMetaphone ??
  • Run a community detection algorithm (like Louvain) to cluster similar ingredients together
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant