Skip to content

Crawler service

sreenidhithallam edited this page Dec 14, 2016 · 1 revision

Crawler-Service

  1. Input : URL,Term,Intent,Domain

    (This JSON is output of function generator)

  2. Get request to the url to fetch the data (May be done by library)

    Note: (internet required, timeout, check and update the status about url)

  3. Filterout unwanted content from the fetched data.

    StopWords (We should be able to customise the stopwords list)

  4. Search for the words we are interested in, get the term density for interested terms

    (Can configure for the words irrespective of case- sensitivity)

  5. Search for synonyms (which improves the accuracy of crawling)

  6. Index the url in neo4j

    • Create the node for web document

    • Create the relationship with concept graph terms (ensure that concept term is related to domain up, update the relation property with the term density)

    • If any terms found that are not related to domain then make relationship to url

  7. Goto mongoDB and update the document with terms found and terms not found and terms found not related to domain

Clone this wiki locally