-
Notifications
You must be signed in to change notification settings - Fork 6
Crawler service
-
Input : URL,Term,Intent,Domain
(This JSON is output of function generator)
-
Get request to the url to fetch the data (May be done by library)
Note: (internet required, timeout, check and update the status about url)
-
Filterout unwanted content from the fetched data.
StopWords (We should be able to customise the stopwords list)
-
Search for the words we are interested in, get the term density for interested terms
(Can configure for the words irrespective of case- sensitivity)
-
Search for synonyms (which improves the accuracy of crawling)
-
Index the url in neo4j
-
Create the node for web document
-
Create the relationship with concept graph terms (ensure that concept term is related to domain up, update the relation property with the term density)
-
If any terms found that are not related to domain then make relationship to url
-
-
Goto mongoDB and update the document with terms found and terms not found and terms found not related to domain