Move back to documentation homepage
- Contents
- Outline
- Question Detector
- How can I create a QuestionDetector?
- So how does the QuestionDetector work?
- What is the difference between IssueQuestion vs EmailQuestion vs IssueCommentQuestion?
- What is this context attribute I'm seeing in the Question objects?
- Can I use the QuestionDetector for my projects that aren't issue/email/comment related?
- Fetchers
- Search Engines
- Answer Detector
- How can I add FAQs?
- How does Donkeybot handle the text processing needed?
Almost everything in the sections of this page have a corresponding notebook.
See examples for a more hands on guide by looking at the notebooks.
Also, the functionality explained here is what runs 'under the hood' in the scripts which use Donkeybot.
So instead of explaining those I chose a more straightforward approach and look at the code with easy examples.
Donkeybot's QuestionDetector
must be one of the following types : "email", "issue" or "comment"
This is so that the QuestionDetector
creates the correct type of Question objects.
Be it an EmailQuestion
, IssueQuestion
, CommentQuestion
.
Let's create one for IssueQuestions
!
from bot.question.detector import QuestionDetector
detector = QuestionDetector("issue")
text = """
What is this 'text', you ask?
Well, it's a monologue I'm having... can it help with something you still ask?
In testing the QuesitonDetector of course!
Did that answer all your questions?
I sure hope so..."
"""
Simply use the .detect() method!
The results are going to be a list of Question
objects.
In this specific example IssueQuestion
objects.
results = detector.detect(text)
results
[<bot.question.issues.IssueQuestion at 0x2223974d348>,
<bot.question.issues.IssueQuestion at 0x2223974d448>,
<bot.question.issues.IssueQuestion at 0x2223974d908>]
And all 3 questions from the sample text above have been identified!
[(question.question) for question in results]
[ "What is this 'text', you ask?",
"Did that answer all your questions?",
"can it help with something you still ask?"]
The only difference is their origin
and how they get their context
attributes.
Look at What is this context
attribute I'm seeing? for more
results[1].__dict__
{'id': '8621376d766242ab9fd740a3698f0dd2',
'question': 'Did that answer all your questions?',
'start': 188,
'end': 223,
'origin': 'issue',
'context': None}
Well, that's what the AnswerDetector uses to try and answer each question!
To be more specific
- When a new User Question is asked and is very similar or identical to the questions archived by using the .detect() method.
- Then the context of these archived questions is used as context for the new User Question.
- Donkeybot's AnswerDetector tries to find suitable answers!
For IssueQuestions
the context are any comments that are part of the same GitHub issue.
For IssueCommentQuestion
the context are comments after this specific one where the Question was detected.
For EmailQuestions
the context are the bodies of the reply emails to the email where the Question was detected.
Each different Question object has it's own unique find_context_from_table()
method that sets the attribute by following the logic explained above.
Basically go into the table in our Data Storage and SELECT the context we want.
Yes!
But, if you aren't following the issue, email, comment logic Donkeybot follows at the point of writing this. (end of GSoC '20').
Then, Donkeybot needs to be expanded to have a Question
superclass and a set_contexT()
method fo you to simple set the context without going into some dependand Data Storage.
If you want to see this in Donkeybot open an issue and suggest it. I'll see that you've been reading the documentation and that this functionality is needed :D
The scripts fetch_issues.py
, fetch_rucio_docs.py
do everything explained here.
See scripts for source code and run the scripts with the '-h' option for info on the arguments they take.
eg.
(virt)$ python scripts/fetch_rucio_docs.py -h
Simple, use the FetcherFactory
and just pick the fetcher type
- Issue for a GitHub
IssueFetcher
- Rucio Documentation for a
RucioDocsFetcher
What about the EmailFetcher
?
- Currently as explained in How It Works emails are fetched from different scripts run in CERN and not through Donkeybot.
from bot.fetcher.factory import FetcherFactory
Let's create a GitHub IssueFetcher
.
issues_fetcher = FetcherFactory.get_fetcher("Issue")
issues_fetcher
<bot.fetcher.issues.IssueFetcher at 0x1b75c30b6c8>
You need 4 things.
- The repository whose issues we are fetching
- A GitHub API token. To generate a GitHub token visit Personal Access Tokens and follow Creating a Personal Access Token.
- The maximum number of pages the fetcher will look through to fetch issues. (default is 201)
- A couple pandas DataFrames, one which will hold the issues data and one for the issue comments data.
import pandas as pd
repository = 'rucio/rucio' # but you can use any in the format user/repo
token = "<YOUR_TOKEN>"
max_pages = 3
(issues_df, comments_df) = issues_fetcher.fetch(repo=repository, api_token=token, max_pages=max_pages)
The resulting DataFrames will look like this:
issues_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 issue_id 26 non-null object
1 title 26 non-null object
2 state 26 non-null object
3 creator 26 non-null object
4 created_at 26 non-null object
5 comments 26 non-null object
6 body 26 non-null object
dtypes: object(7)
memory usage: 1.5+ KB
comments_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 issue_id 16 non-null object
1 comment_id 16 non-null object
2 creator 16 non-null object
3 created_at 16 non-null object
4 body 16 non-null object
dtypes: object(5)
memory usage: 768.0+ bytes
It's the same process we followed with the IssueFetcher
only now the factory will create a RucioDocsFetcher
from bot.fetcher.factory import FetcherFactory
docs_fetcher = FetcherFactory.get_fetcher("Rucio Documentation")
docs_fetcher
<bot.fetcher.docs.RucioDocsFetcher at 0x1b75c43bf48>
token = "<YOUR_TOKEN>"
docs_df = docs_fetcher.fetch(api_token=token)
For this we need to
Step 1. open a connection to our Data Storage
from bot.database.sqlite import Databae
# open the connection
db_name = 'data_storage'
data_storage = Database(f"{db_name}.db")
Step 2. Save the fetched issues and comments data.
# save the fetched data
issues_fetcher.save(
db=data_storage,
issues_table_name='issues',
comments_table_name='issue_comments',
)
Step 2.1. Alternativerly save the documentation data.
# save the fetched data
docs_fetcher.save(db=data_storage, docs_table_name='docs')
Step 3. Finally close the connection
# close the connection
data_storage.close_connection()
Alternative : If you don't want to use Donkeybot's Data Storage you can use the save_with_pickle()
and load_with_pickle()
methods to achieve the same results.
You can use the script query.py
to query the search engines and create_se_indexes.py
is what creates the Search Engine
indexes for Donkeybot.
See scripts for source code and run the scripts with the '-h' option for info on the arguments they take.
eg.
(virt)$ python scripts/query.py -h
There are 3 types of Search Engines in Donkeybot at the moment:
SearchEngine
which is used to query general documenation ( in our case Rucio Documentation )QuestionSearchEngine
which is used to query Question objects saved in Data StorageFAQSearchEngine
which is used to query FAQs saved in Data Storage
Let's create a QuestionSearchEngine
from bot.searcher.question import QuestionSearchEngine
qse = QuestionSearchEngine()
qse
<bot.searcher.question.QuestionSearchEngine at 0x2a2cf58a348>
The QuestionSearchEngine is not yet usable!
We need 3 things:
Step 1. Have a pandas DataFrame with the column question that holds the information we will index. The document id for th QuestionSearchEngine will be a column named question_id under corpus.
sidenote: A nice addition to Donkeybot will be the ability to change the name of these columns and have something more general.
But, this is only needed for the sqlite implementation. If in the future we move to Elasticsearch there is no need.
Step 2. Have an open connection to the Data Storage
Step 3. create_index()
or load_index()
which is the document term matrix of the questions.
# Step 1
import pandas as pd
# example DataFrame
corpus_df = pd.DataFrame({"question_id": [0,1,2,3],
"question":["What happened in GSoC 2020 ?",
"How can I create an index ?",
"How can I load an index ?",
"Why are there so many questions in this example?"],
"answer":["Donkeybot was created!",
"With the .create_index() method!",
"With the .load_index() method!",
"Because BM25 need enough data to create good tf-df vectors :D"]})
corpus_df
# Step 2
from bot.database.sqlite import Database
data_storage = Database('your_data_storage.db')
# Step 3 create the index!
qse.create_index(
corpus=corpus_df, db=data_storage, table_name="corpus_doc_term_matrix"
)
qse.index
data_storage.close_connection()
Now the QuestionSearchEngine is ready!
Let's try and query the QuestionSearchEngine
we just created above
query = "Anything cool that happened in this year's GSoC?" # whatever you want to ask
top_n = 1 # number of retrieved documents
And just run the .search()
method.
qse.search(query, top_n)
This is pretty much the logic of the FAQ table which holds Question and Answer pairs.
It's very simple, just call the constructor!
from bot.answer.detector import AnswerDetector
answer_detector = AnswerDetector(model='distilbert-base-cased-distilled-squad',
extended_answer_size=30,
handle_impossible_answer=True,
max_answer_len=20,
max_question_len=20,
max_seq_len=256,
num_answers_to_predict=3,
doc_stride=128,
device=0)
What do all these paremeters mean?
Well if you want to go deeper you can always look at the Source Code.
The important parameters for now are :
- model : name of the transformer model used for QuestionAnswering.
- num_answers_to_predit : Number of answers that are predicted for each document that the AnswerDetector is given.
Remember these documents are the ones retrieved by each Search Engine so a lot of answers are predicted until top_k are returned.
Step 1. Have a question.
Step 2. Have some documents in which the answer might reside in.
Step 3. Make sure those documents are in a pandas DataFrame and the context used for answer detection is under the "context" column.
As of right now there is no option to simply use the AnswerDetector with strings.
For Donkeybot which uses different datasources we decided to utilize pandas DataFrames.
Donkeybot can always be expanded if the functionality is required.
import pandas as pd
question = "What is the aim of Donkeybot?"
documents = pd.DataFrame({
"context" : ["""
The aim of the Donkeybot project under GSoC 2020 is to use Native Language Processing (NLP)
to develop an intelligent bot prototype able to provide satisfying answers to Rucio users
and handle support requests up to a certain level of complexity,
forwarding only the remaining ones to the experts.
""",
"""
Different levels of expert support are available for users in case of problems.
When satisfying answers are not found at lower support levels, a request from a user or a group
of users can be escalated to the Rucio support. Due to the vast amount of support requests,
methods to assist the support team in answering these requests are needed.
"""],
"col_2" : ["first_doc", "second_doc"],
"col_3" : ["other", "data"]
})
answers = answer_detector.predict(question, documents, top_k=2)
So asking What is the aim of Donkeybot?
, providing the above documents and asking for 2 answers gives us:
print(question)
[(f"answer {i+1}: {answer.answer} | confidence : {answer.confidence}") for i,answer in enumerate(answers)]
What is the aim of Donkeybot?
['answer 1: assist the support team | confidence : 0.44691182870541724',
'answer 2: to use Native Language Processing (NLP) | confidence : 0.24011110691572668']
answers[1].__dict__
{'id': 'c3e44f0799b645c9b690f98e4b5e07ea',
'user_question': 'What is the aim of Donkeybot?',
'user_question_id': '2fc28e8f32',
'answer': 'to use Native Language Processing (NLP)',
'start': 69,
'end': 125,
'confidence': 0.24011110691572668,
'extended_answer': 'ot project under GSoC 2020 is to use Native Language Processing (NLP) \n to develop an intelligent bot',
'extended_start': 39,
'extended_end': 155,
'model': 'distilbert-base-cased-distilled-squad',
'origin': 'questions',
'created_at': '2020-08-26 18:08:08+00:00',
'label': None,
'metadata': {'col_2': 'first_doc', 'col_3': 'other'}}
See How it Works where we cover the same information and explain in more detail.
Basically the QAInterface under brain.py
of Donkeybot, glues together all SearchEngines
nd the AnswerDetector
.
It is the interface used in ask_donkeybot.py
script. Take a look at the Source Code for more information.
Given that you have correctly created:
AnswerDetector
SearchEngine
QuestionSearchEngine
FAQSearchEngine
All correctly.
Then simply load the interface
from bot.brain import QAInterface
# load interface
qa_interface = QAInterface(
detector=answer_detector,
question_engine=question_se,
faq_engine=faq_se,
docs_engine=docs_se,
)
Yes, but it probably will require some tweaking and if you aren't using Donkeybot for setting up and curating your data then it might not be worth it.
Simply look under the hood and use Transformer pipelines for your needs.
The easiest way to do this is to use the very simple GUI Donkeybot provides.
All you need to remember is:
Always re-index the FAQ table after adding new FAQs.
Otherwise, the FAQSearchEngine won't see them.
Main Window
You'll see the window's logic follows the 2 step process for adding any new data that the Search Engines needs to query.
Step 1. Insert new FAQ.
Step 2. Re-index the FAQ table with the new data.
Just make sure that the Database and FAQ table are the same in both cases.
Donkeybot will remind you to re-index in case you forget what the docs suggest 😊
With the help of libraries like string, datetime, pytz and nltk 😁
See bot.utils module for the text-processing source code.