copyright | lastupdated | ||
---|---|---|---|
|
2017-10-16 |
{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:tip: .tip} {:pre: .pre} {:codeblock: .codeblock} {:screen: .screen} {:javascript: .ph data-hd-programlang='javascript'} {:java: .ph data-hd-programlang='java'} {:python: .ph data-hd-programlang='python'} {:swift: .ph data-hd-programlang='swift'}
Migrating from {{site.data.keyword.retrieveandrankshort}} to the {{site.data.keyword.discoveryfull}} service. A continuation of the Cranfield Getting Started Tutorial
{: #overview} This tutorial guides you through the process of creating and training a {{site.data.keyword.discoveryfull}} Service with sample data. This tutorial uses the same data set used in the {{site.data.keyword.retrieveandrankshort}} Getting Started Tutorial, but you can use the same approach to create a service instance that uses your own data.
The process for users migrating data from {{site.data.keyword.retrieveandrankshort}} to {{site.data.keyword.discoveryshort}} consists of two main steps.
- Migrate the collection data.
- Migrate the training data.
When migrating your trained collection data, what is most important is to keep the document ids the same. This is because your training data uses those ids to reference the ground truth, and if the ids are changed in moving from {{site.data.keyword.retrieveandrankshort}} to {{site.data.keyword.discoveryshort}} then the reranking is going to be completely off (or training might not even start). {{site.data.keyword.discoveryshort}} allows you to specify the document ID in the API upload process, so this problem can be avoided by following the guidelines in this document. The {{site.data.keyword.retrieveandrankshort}} Training data is usually stored in a csv
file. In this tutorial, this csv
file is used to upload the sample training data into {{site.data.keyword.discoveryshort}}. Migration of training data exported from the {{site.data.keyword.retrieveandrankshort}} tooling is detailed in migrating training data from the service.
This tutorial assumes {{site.data.keyword.retrieveandrankshort}} was setup similar to the {{site.data.keyword.retrieveandrankshort}} Getting Started Tutorial and uses the migrate from source path described here. See Evaluate your migration path to Watson Discovery service for other migration options.
To complete the tutorial, you need the following:
- cURL. You can install the version of cURL for your operating system from haxx.se {: new_window}. You must install the version that supports the Secure Sockets Layer (SSL) protocol. Make sure to include the installed binary file on your PATH environment variable.
- Sample Cranfield data. This tutorial uses the sample collection data from the {{site.data.keyword.retrieveandrankshort}} Getting Started Tutorial. cranfield json data {: new_window}
- Sample data ground truth This tutorial uses sample Cranfield ground truth from the {{site.data.keyword.retrieveandrankshort}} Getting Started Tutorial. cranfield ground truth csv{: new_window}
- Python version 2. To check whether Python is installed, enter
python --version
at a command prompt and ensure that the version number starts with 2. If you need to install Python, see Downloading Python {: new_window}. - Data upload script: Discovery document uploader {: new_window}
- Training Data upload script: Discovery training uploader {: new_window}
The following pre-requisites are necessary before beginning this tutorial:
-
This tutorial assumes you have already created a {{site.data.keyword.discoveryshort}} instance, if you need directions on how to create a {{site.data.keyword.discoveryshort}} instance, please refer to the following tutorial.
-
This tutorial assumes that you have your service credentials.
- When in the Watson {{site.data.keyword.discoveryshort}} service on {{site.data.keyword.Bluemix_notm}}, click Service credentials.
- Click View credentials under Actions.
- Copy the
username
andpassword
values and make sure that theurl
value matches the one in the examples below, if it doesn't, replace it as well.
-
Create an Environment.
curl -X POST -u "{username}":"{password}" -H "Content-Type: application/json" -d '{ "name": "my_environment", "description": "My environment" }' "https://gateway.watsonplatform.net/discovery/api/v1/environments?version=2017-10-16"
{: pre}
Copy the
environment-id
that is listed in the returned JSON. -
Create a Collection.
curl -X POST -u "{username}":"{password}" -H "Content-Type: application/json" -d '{ "name": "test_collection", "description": "My test collection", "configuration_id": "{configuration_id}", "language_code": "en" }' "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections?version=2017-10-16"
{: pre}
Copy the
collection-id
that is listed in the returned JSON. -
Add the documents that are to be searched.
-
Download the cranfield-data.json {: new_window} file if you haven't already. This is the source of documents that are used in {{site.data.keyword.retrieveandrankshort}}. The Cranfield collection documents are in JSON format, which is the format {{site.data.keyword.retrieveandrankshort}} accepted and which works well for Watson {{site.data.keyword.discoveryshort}} as well. Note: {{site.data.keyword.discoveryshort}} does not require uploading the Solr schema. This is because {{site.data.keyword.discoveryshort}} infers the schema from the JSON structure automatically.
-
Download the Data upload script here {: new_window}. This script will upload the Cranfield json into {{site.data.keyword.discoveryshort}}. The script reads through the JSON file and sends each individual JSON document to the {{site.data.keyword.discoveryshort}} service using a default configuration in {{site.data.keyword.discoveryshort}}. Note: The default configuration in {{site.data.keyword.discoveryshort}} provides similar settings to the default Solr config in {{site.data.keyword.retrieveandrankshort}}.
-
Issue the following command to upload the
cranfield-data-json
data to thecranfield_collection
collection. Replace{username}
,{password}
,{path_to_file}
,{environment_id}
,{collection_id}
with your information. Note that there are additional options, -d for debug and –v for verbose output from curl.python ./disco-upload.py -u {username}:{password} -i {path_to_file}/cranfield-data.json –e {environment_id} -c {collection_id}
{: pre}
-
-
Once the upload process has completed, you can check that the documents are there by issuing the following command to view the collection details:
curl -u "{username}":"{password}" "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}?version=2017-10-16"
{: pre}
The output will look something like this:
{ "collection_id": "f1360220-ea2d-4271-9d62-89a910b13c37", "name": "democollection", "description": "this is a demo collection", "created": "2015-08-24T18:42:25.324Z", "updated": "2015-08-24T18:42:25.324Z", "status": "available", "configuration_id": "6963be41-2dea-4f79-8f52-127c63c479b0", "language": "en", "document_counts": { "available": 1000, "processing": 20, "failed": 180 }, "disk_usage": { "used_bytes": 260 }, "training": { "total_examples": 0, "available": false, "processing": false, "minimum_queries_added": false, "minimum_examples_added": true, "sufficient_label_diversity": false, "notices": 0, "successfully_trained": null, "data_updated": null } }
{: codeblock}
Look at the section document_counts
to see how many documents were uploaded successfully. We aren't expecting any document failures with this sample data set. However, with other data sets, you may see failed document counts. If you have any failed document counts, then you can view the notices API to see the error messages. Look at the section here {: new_window} to review the notices API command.
The training
section of the return gives you information about your training. We'll review that section after you upload your training data.
Watson {{site.data.keyword.discoveryshort}} Service uses a machine learning model to re-rank documents. To do so you need to train a model. Training occurs after you have loaded enough queries along with the appropriate rated documents. By loading enough examples with enough variance to Watson {{site.data.keyword.discoveryshort}}, you are teaching it what a "good" document is. In this step, we will use the existing Cranfield "ground truth" that is used in {{site.data.keyword.retrieveandrankshort}} to train Watson {{site.data.keyword.discoveryshort}}.
- Download the sample Cranfield ground truth csv file {: new_window} from the {{site.data.keyword.retrieveandrankshort}} tutorial if you haven't already done so.
The file is a set of questions that a user might ask about the documents. The file provides the example information to train the ranker in {{site.data.keyword.retrieveandrankshort}} and relevancy training in {{site.data.keyword.discoveryshort}} about questions and relevant answers.
For each question, there is at least one identifier to an answer (the document ID). Each document ID includes a number to indicate how relevant the answer is to the question. The document ID points to the answer in the cranfield-data.json
file that you uploaded to {{site.data.keyword.discoveryshort}} in the previous step.
-
Download the Training Data upload script. You will use this script to upload the training data into {{site.data.keyword.discoveryshort}}. The script transforms the
csv
file into a set of JSON queries and examples and sends them to the {{site.data.keyword.discoveryshort}} service using the training data APIs {: new_window} Note: {{site.data.keyword.discoveryshort}} manages training data within the service, so when generating new examples and training queries they can be stored in {{site.data.keyword.discoveryshort}} itself rather than as part of a separate CSV file that needs to be maintained. -
Execute the training upload script to upload the training data into {{site.data.keyword.discoveryshort}}. Replace
{username}
,{password}
,{path_to_file}
,{environment_id}
,{collection_id}
with your information. Note that there are additional options,-d
for debug and–v
for verbose output from curl.python ./disco-train.py -u {username}:{password} -i {path_to_file}/cranfield-gt.csv –e {environment_id} -c {collection_id}
{: pre}
-
Once the data is loaded, you can check the status of training using the collection details command we saw in the previous section. {{site.data.keyword.discoveryshort}} will check about once per hour to see if there is any new data, and if there is it will begin processing it and turn it into a machine learning model. When a model is training, you will see the state of the training section change from
"processing": false
to"processing": true
. Once the model has been trained, you will see the state in the training section to change from"available": false
to"available": true
. You will also see the date change for the value"successfully_trained"
. If there are any errors, you can view them by looking at the notices API as described in the previous section.
{: search}
The {{site.data.keyword.discoveryshort}} service will automatically use a trained model to re-rank search results if available. When an API call {: new_window} is made with natural_language_query
instead of query
, a check is made to see if there is a model available. If a model is available then {{site.data.keyword.discoveryshort}} uses that model to re-rank results. First, we will do a search over unranked documents, and then we will do a search using the ranking model.
-
You can search for documents in your collection by using a cURL command. Perform a query using the query API call to see unranked results. Replace
{username}
,{password}
,{environment_id}
,{collection_id}
, with your own values. The results returned will be unranked results, and will use the default {{site.data.keyword.discoveryshort}} ranking formulas. You can try other queries by opening the training datacsv
file and copying the value of the first column into the query parameter.curl -u "{username}":"{password}""https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2017-10-16&query=what is the basic mechanism of the transonic aileron buzz"
{: pre}
-
Now perform a search using the model by setting the
natural_language_query
parameter. Before you do so, make sure you check that you have a trained model as described in the previous section. Paste the following code in your console, replacing the{username}
,{password}
,{environment_id}
,{collection_id}
with your values.curl -u "{username}":"{password}""https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2017-10-16&natural_language_query=what is the basic mechanism of the transonic aileron buzz"
{: pre}
This command will return re-ranked results using the model you trained earlier. Compare the results of this search as well as the results of some of the other searches you tried earlier. You may see some differences in results compared to what you see in {{site.data.keyword.retrieveandrankshort}}. This is because some of the techniques used for search have changed to simplify the experience and improve results, but overall the quality of results will be similar.
After evaluating the reranked search results, you can refine them in {{site.data.keyword.discoveryshort}} by repeating the step of uploading training data with additional training queries and examples, and viewing the search results. You can also add new documents, as described in the first step, to broaden the scope of the search. Similar to {{site.data.keyword.retrieveandrankshort}}, improving results with training is an iterative process.