This is an API python library which is developed for detecting stop phrases.
NLP (Natural Language Processing) techniques is very helpful in various applications such as sentiment analysis, chatbots and other areas. For developing NLP models a need for a large & clean corpus for learning words relations is indisputable. One of the challanges in achieving a clean corpus is stop phrases. Stop phrases usually does not contain much information about the text and so must be identified and removed from the text.
This is the aim of this repo to provide a structure for processing HTML pages (which are a valuable source of text for all languages) and finding a certain number of possible combinations of words and using human input for identifying stop phrases.
-
Make sure you have
docker
,docker-compose
andpython 3.8
and above installed. -
create a
.env
file with desired values based on.env.example
file. -
After cloning the project, go to the project directory and run below command.
docker-compose -f docker-compose-dev.yml build
- After the images are built successfully, run below command for starting the project.
docker-compose -f docker-compose-dev.yml up -d
- We need to create a database and collection in mongo in order to use the API. First run mongo bash.
docker exec -it db bash
- Authenticate in mongo container.
mongo -u ${MONGO_INITDB_ROOT_USERNAME} -p ${MONGO_INITDB_ROOT_PASSWORD} -- authenticationDatabase admin
- Create the database and collection based on
MONGO_PHRASE_DB
andMONGO_PHRASE_COL
names you provided in step2
.
use phrasedb; # Database creation
db.createCollection("common_phrase"); # Collection creation
- Now you're ready yo use the API section.
This API has three endpoints.
Here you can pass a HTML text in request body to process it.
The process stages are:
-
Fetching all H1-H6 and p tags
-
Cleaning text
-
Finding bags (from 1 to 5 bags of word)
-
Counting the number of occurences in text
-
Integrating results in database (Updating count field of the phrase if already exists, otherwise inserting a new record)
Updates statuses.
Changing the status of a phrase to either stop or highlight.
Fetching data from database based on the statuses. Here you can fetch phrases based on 4 different situation for statuses:
-
Stop phrases
-
Highlight phrases
-
Phrases that have status (either stop or highlight)
-
Phrases which statuses are not yet determined
- API Base URL
127.0.0.1:8000
- API Swagger UI
127.0.0.1:8000/docs
For futher details and how to make request to each endpoint refer to the swagger of the API.