A system able to estimate the relevance of an arbitrary content towards the learned categories.
The system is able to score the unseen document by its content (and potentially other attributes) based on its contextual similarity to the seen ones. It also contains the score tuning mechanism enabling the direct use of the documents' relevance scores by a search engine filtering the relevant/irrelevant results by a single fixed threshold and easily reaching the optimal performance.
The system will be integrated into RH content search services using DCP content indexing tool.
It might also be further extended to provide a smart content-based recommender system for web portals with sufficient amount of training documents (regardless of categorization).
The project currently contains two main components:
- Deployable search service providing intuitive REST API for scoring an arbitrary content towards the trained categories:
Request:
{
"sys_meta": false,
"doc": {
"id": "DOC_123",
"title": "One smart doc",
"content": "This is one dull piece of text."
}
}
Response:
{
"scoring": {
"softwarecollections": 0.000060932611962771777,
"brms": 0.00080337037910394038,
"bpmsuite": 0.00026477703963384558,
"...": "..."
}
}
- Content downloader providing tools for convenient bulk download of the indexed content (of DCP and access.redhat) categorized towards the Red Hat products.
In addition to that, the project contains the analytical part that has driven the selection of the classifier and configuration of the system parameters.
The architecture and the technologies used are briefly introduced in overview presentation and slightly technical presentation.
If you're interested in technical background of the project, try to understand the technical documentation of the system.
Various further evaluation of the current system by some more tricky metrics are summed up in the most fresh analysis.
The overall progress and objectives of the project are tracked here.
The latest deployable version of the project in a form of a web service exposes the REST API able to score the relevance of the provided piece of text towards the categories of training content.
The service is based on Django standalone servlet.
The data download and training process are expected to be run locally, whereas the deployable Django application can run on remote (in our case on Openshift) with the service image created on training locally.
Before running the download and training process, make sure to set up the compatible, separated environment using Conda. If you have not yet, download Miniconda, or install it using your packaging system.
After that, from the root of the repository, run:
# update pip to the latest version
pip install --upgrade pip
# create virtual env
conda create --name classifier python=2.7
# activate a new environment
source activate classifier
# install requirements to newly-created env
pip install -r requirements.txt
Within this prepared env, you can run the python scripts from the following sections.
If you want to download a fresh content (for both RHD products, or other), proceed to data section.
A directory of downloaded content is then passed to training procedure (see below).
See the training section on how to create a new image of trained Service instance from the selected directory of content.
To make a deployed service to score the contents, the service needs to be pointed to the image of its trained instance. This can be done by either:
-
using the service class constructor, passing param
image_dir={relative_path}
to the constructor, where{relative_path}
is relative to the service instance directory. -
setting
service.service_image_dir
to absolute path
To make this change effective in the deployed instance, make the according modification in Django's views.py.
The pre-trained version of image (created 14.6.2018) is available here.
See a readme of pretrained images for a description and other pretrained images of older gensim versions.