2017-10-16---
copyright: years: 2015, 2017 lastupdated: "2017-10-16"
{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:tip: .tip} {:pre: .pre} {:codeblock: .codeblock} {:screen: .screen} {:javascript: .ph data-hd-programlang='javascript'} {:java: .ph data-hd-programlang='java'} {:python: .ph data-hd-programlang='python'} {:swift: .ph data-hd-programlang='swift'}
This topic explains how to use the data crawler to ingest files from your local filesystem, to use with the {{site.data.keyword.discoveryfull}} service. {: shortdesc}
Before attempting this task, create an instance of the {{site.data.keyword.discoveryshort}} service in {{site.data.keyword.Bluemix}}. In order to complete this guide, you will need to use the credentials that are associated with the instance of the service that you created.
Use the bash POST /v1/environments method to create an environment. Think of an environment as the warehouse where you are storing all your boxes of documents. The following example creates an environment that is called my-first-environment
:
Replace {username}
and {password}
with your service credentials.
curl -X POST -u "{username}":"{password}" -H "Content-Type: application/json" -d '{ "name":"my-first-environment", "description":"exploring environments"}' "https://gateway.watsonplatform.net/discovery/api/v1/environments?version=2017-10-16"
{: pre}
The API returns a response that includes information such as your environment ID, environment status, and how much storage your environment is using. Do not go on to the next step until your environment status is ready
. When you create the environment, if the status returns status:pending
, use the GET /v1/environments/{environment_id}
method to check the status until it is ready. In this example, replace {username}
and {password}
with your service credentials, and replace {environment_id}
with the environment ID that was returned when you created the environment.
curl -u "{username}":"{password}" https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}?version=2017-10-16
{: pre}
Next, use the POST /v1/environments/{environment_id}/collections
method to create a collection. Think of a collection as a box where you will store your documents in your environment. This example creates a collection that is called my-first-collection
in the environment that you created in the previous step, and uses the following default configuration:
- Replace
{username}
and{password}
with your service credentials. - Replace
{environment_id}
with the environment ID for the environment that you created in step 1.
Before creating a collection you must get the ID of your default configuration. To find your default configuration_id
, use the GET /v1/environments/{environment_id}/configurations
method:
curl -u "{username}":"{password}" https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/configurations?version=2017-10-16
{: pre}
Once you have the default configuration ID, use it to create your collection. Replace {configuration_id}
with the default configuration ID for your environment.
curl -X POST -u "{username}":"{password}" -H "Content-Type: application/json" -d '{"name": "my-first-collection", "description": "exploring collections", "configuration_id":"{configuration_id}" , "language": "en_us"}' https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections?version=2017-10-16
{: pre}
The API returns a response that includes information such as your collection ID, collection status, and how much storage your collection is using. Do not go on to the next step until your collection status is online
. When you create the collection, if the status returns status:pending
, use the GET /v1/environments/{environment_id}/collections/{collection_id}
method to check the status until it is ready. In this example, replace {username}
and {password}
with your service credentials, replace {environment_id}
with your environment ID, and replace {collection_id}
with the collection ID that was returned earlier in this step.
curl -u "{username}":"{password}" https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}?version=2017-10-16
{: pre}
Download these documents:
-
Verify your system prerequisites
-
Java Runtime Environment version 8 or higher
Note: Your
JAVA_HOME
environment variable must be set correctly, or not be set at all, in order to run the Crawler. -
Red Hat Enterprise Linux 6 or 7, or Ubuntu Linux 15 or 16. For optimal performance, the Data Crawler should run on its own instance of Linux, whether it is a virtual machine, a container, or hardware.
-
Minimum 2 GB RAM on the Linux system
-
-
Open a browser and log into your {{site.data.keyword.Bluemix_notm}} account {: new_window}.
-
From your {{site.data.keyword.Bluemix_notm}} Dashboard, select the {{site.data.keyword.discoveryshort}} service you previously created.
-
Under Intended Use, select the appropriate download link for your system (DEB, RPM, or ZIP) to download the Data Crawler.
-
As an administrator, use the appropriate commands to install the archive file that you downloaded:
- On systems such as Red Hat and CentOS that use rpm packages, use a command such as the following:
rpm -i /full/path/to/rpm/package/rpm-file-name
- On systems such as Ubuntu and Debian that use deb packages, use a command such as the following:
dpkg -i /full/path/to/deb/package/deb-file-name
- The Crawler scripts are installed into
{installation directory}/bin
; for example,/opt/ibm/crawler/bin
. Ensure that{installation_directory}/bin
is in yourPATH
environment variable for the Crawler commands to work correctly.
Crawler scripts are also installed to
/usr/local/bin
, so this can be added to yourPATH
environment variable as well. {: tip} - On systems such as Red Hat and CentOS that use rpm packages, use a command such as the following:
Copy the contents of the {installation_directory}/share/examples/config
directory to a working directory on your system, for example /home/config
.
Warning: Do not modify the provided configuration example files directly. Copy and then edit them. If you edit the example files in-place, your configuration may be overwritten when upgrading the Data Crawler, or may be removed when uninstalling it.
Note: References in this guide to files in the config
directory, such as config/crawler.conf
, refer to that file in your working directory, and NOT in the installed {installation_directory}/share/examples/config
directory.
To set up the Data Crawler to crawl your repository, you must specify which local system files you want to crawl, and which {{site.data.keyword.discoveryshort}} service to send the collection of crawled files to, once the crawl has been completed.
-
filesystem-seed.conf
- Open theseeds/filesystem-seed.con
file in a text editor. Modify thevalue
attribute directly under thename-"url"
attribute to the file path that you want to crawl. For example:value-"sdk-fs:///TMP/MY_TEST_DATA/"
Note: The URLs must start with
sdk-fs://
. So to crawl, for example,/home/watson/mydocs
, the value of this URL would besdk-fs:///home/watson/mydocs
- the third / is necessary!Save and close the file.
-
discovery_service.conf
- Open thediscovery/discovery_service.conf
file in a text editor. Modify the following values specific to the {{site.data.keyword.discoveryshort}} service you previously created on {{site.data.keyword.Bluemix_notm}}:environment_id
- Your {{site.data.keyword.discoveryshort}} service environment ID.collection_id
- Your {{site.data.keyword.discoveryshort}} service collection ID.configuration_id
- Your {{site.data.keyword.discoveryshort}} service configuration ID.configuration
- The full path location of thisdiscovery_service.conf
file, for example,/home/config/discovery/discovery_service.conf
.username
- Username credential for your {{site.data.keyword.discoveryshort}} service.password
- Password credential for your {{site.data.keyword.discoveryshort}} service.
-
crawler.conf
- Open theconfig/crawler.conf
file in a text editor.-
Set the
output_adapter
class
andconfig
options for the {{site.data.keyword.discoveryshort}} service as follows:class - "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter" config - "discovery_service" discovery_service { include "discovery/discovery_service.conf" }
{: pre}
-
-
After modifying these files, you are ready to crawl your data.
Run the following command: crawler crawl --config [config/crawler.conf]
This will run a crawl with the configuration file crawler.conf
.
Note: The path to the configuration file passed in the --config
option must be a qualified path. That is, it must be in relative formats, such as config/crawler.conf
or ./crawler.conf
, or in an absolute path such as /path/to/config/crawler.conf
.
Finally, use the GET /v1/environments/{environment_id}/collections/{collection_id}/query
method to search your collection of documents. The following example returns all entities that are called IBM
:
- Replace
{username}
and{password}
with your service credentials. - Replace
{environment_id}
with the environment ID for the environment you created in step 1. - Replace
{collection_id}
with the collection ID of the collection that you created in step 2.
curl -u "{username}":"{password}" 'https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2017-10-16&query-enriched_text.entities.text:IBM'
{: pre}
You have now successfully queried documents in an environment and collection you created. Now you can begin customizing by adding more documents to the collection.