Skip to content

Latest commit

 

History

History
executable file
·
169 lines (113 loc) · 11.3 KB

data-crawler-qs.md

File metadata and controls

executable file
·
169 lines (113 loc) · 11.3 KB

2017-10-16---

copyright: years: 2015, 2017 lastupdated: "2017-10-16"


{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:tip: .tip} {:pre: .pre} {:codeblock: .codeblock} {:screen: .screen} {:javascript: .ph data-hd-programlang='javascript'} {:java: .ph data-hd-programlang='java'} {:python: .ph data-hd-programlang='python'} {:swift: .ph data-hd-programlang='swift'}

Getting started with the Data Crawler

This topic explains how to use the data crawler to ingest files from your local filesystem, to use with the {{site.data.keyword.discoveryfull}} service. {: shortdesc}

Before attempting this task, create an instance of the {{site.data.keyword.discoveryshort}} service in {{site.data.keyword.Bluemix}}. In order to complete this guide, you will need to use the credentials that are associated with the instance of the service that you created.

Create an environment

Use the bash POST /v1/environments method to create an environment. Think of an environment as the warehouse where you are storing all your boxes of documents. The following example creates an environment that is called my-first-environment:

Replace {username} and {password} with your service credentials.

curl -X POST -u "{username}":"{password}" -H "Content-Type: application/json" -d '{ "name":"my-first-environment", "description":"exploring environments"}' "https://gateway.watsonplatform.net/discovery/api/v1/environments?version=2017-10-16"

{: pre}

The API returns a response that includes information such as your environment ID, environment status, and how much storage your environment is using. Do not go on to the next step until your environment status is ready. When you create the environment, if the status returns status:pending, use the GET /v1/environments/{environment_id} method to check the status until it is ready. In this example, replace {username} and {password} with your service credentials, and replace {environment_id} with the environment ID that was returned when you created the environment.

curl -u "{username}":"{password}" https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}?version=2017-10-16

{: pre}

Create a collection

Next, use the POST /v1/environments/{environment_id}/collections method to create a collection. Think of a collection as a box where you will store your documents in your environment. This example creates a collection that is called my-first-collection in the environment that you created in the previous step, and uses the following default configuration:

  • Replace {username} and {password} with your service credentials.
  • Replace {environment_id} with the environment ID for the environment that you created in step 1.

Before creating a collection you must get the ID of your default configuration. To find your default configuration_id, use the GET /v1/environments/{environment_id}/configurations method:

curl -u "{username}":"{password}" https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/configurations?version=2017-10-16

{: pre}

Once you have the default configuration ID, use it to create your collection. Replace {configuration_id} with the default configuration ID for your environment.

curl -X POST -u "{username}":"{password}" -H "Content-Type: application/json" -d '{"name": "my-first-collection", "description": "exploring collections", "configuration_id":"{configuration_id}" , "language": "en_us"}' https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections?version=2017-10-16

{: pre}

The API returns a response that includes information such as your collection ID, collection status, and how much storage your collection is using. Do not go on to the next step until your collection status is online. When you create the collection, if the status returns status:pending, use the GET /v1/environments/{environment_id}/collections/{collection_id} method to check the status until it is ready. In this example, replace {username} and {password} with your service credentials, replace {environment_id} with your environment ID, and replace {collection_id} with the collection ID that was returned earlier in this step.

curl -u "{username}":"{password}" https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}?version=2017-10-16

{: pre}

Download example documents

Download these documents:

Download and install the Data Crawler

  1. Verify your system prerequisites

    • Java Runtime Environment version 8 or higher

      Note: Your JAVA_HOME environment variable must be set correctly, or not be set at all, in order to run the Crawler.

    • Red Hat Enterprise Linux 6 or 7, or Ubuntu Linux 15 or 16. For optimal performance, the Data Crawler should run on its own instance of Linux, whether it is a virtual machine, a container, or hardware.

    • Minimum 2 GB RAM on the Linux system

  2. Open a browser and log into your {{site.data.keyword.Bluemix_notm}} account External link icon{: new_window}.

  3. From your {{site.data.keyword.Bluemix_notm}} Dashboard, select the {{site.data.keyword.discoveryshort}} service you previously created.

  4. Under Intended Use, select the appropriate download link for your system (DEB, RPM, or ZIP) to download the Data Crawler.

  5. As an administrator, use the appropriate commands to install the archive file that you downloaded:

    • On systems such as Red Hat and CentOS that use rpm packages, use a command such as the following: rpm -i /full/path/to/rpm/package/rpm-file-name
    • On systems such as Ubuntu and Debian that use deb packages, use a command such as the following: dpkg -i /full/path/to/deb/package/deb-file-name
    • The Crawler scripts are installed into {installation directory}/bin; for example, /opt/ibm/crawler/bin. Ensure that {installation_directory}/bin is in your PATH environment variable for the Crawler commands to work correctly.

    Crawler scripts are also installed to /usr/local/bin, so this can be added to your PATH environment variable as well. {: tip}

Create your working directory

Copy the contents of the {installation_directory}/share/examples/config directory to a working directory on your system, for example /home/config.

Warning: Do not modify the provided configuration example files directly. Copy and then edit them. If you edit the example files in-place, your configuration may be overwritten when upgrading the Data Crawler, or may be removed when uninstalling it.

Note: References in this guide to files in the config directory, such as config/crawler.conf, refer to that file in your working directory, and NOT in the installed {installation_directory}/share/examples/config directory.

Configure crawl options

To set up the Data Crawler to crawl your repository, you must specify which local system files you want to crawl, and which {{site.data.keyword.discoveryshort}} service to send the collection of crawled files to, once the crawl has been completed.

  1. filesystem-seed.conf - Open the seeds/filesystem-seed.con file in a text editor. Modify the value attribute directly under the name-"url" attribute to the file path that you want to crawl. For example: value-"sdk-fs:///TMP/MY_TEST_DATA/"

    Note: The URLs must start with sdk-fs://. So to crawl, for example, /home/watson/mydocs, the value of this URL would be sdk-fs:///home/watson/mydocs - the third / is necessary!

    Save and close the file.

  2. discovery_service.conf - Open the discovery/discovery_service.conf file in a text editor. Modify the following values specific to the {{site.data.keyword.discoveryshort}} service you previously created on {{site.data.keyword.Bluemix_notm}}:

    • environment_id - Your {{site.data.keyword.discoveryshort}} service environment ID.
    • collection_id - Your {{site.data.keyword.discoveryshort}} service collection ID.
    • configuration_id - Your {{site.data.keyword.discoveryshort}} service configuration ID.
    • configuration - The full path location of this discovery_service.conf file, for example, /home/config/discovery/discovery_service.conf.
    • username - Username credential for your {{site.data.keyword.discoveryshort}} service.
    • password - Password credential for your {{site.data.keyword.discoveryshort}} service.
  3. crawler.conf - Open the config/crawler.conf file in a text editor.

    • Set the output_adapter class and config options for the {{site.data.keyword.discoveryshort}} service as follows:

      class - "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter"
      
      config - "discovery_service"
      
      discovery_service {
        include "discovery/discovery_service.conf"
      }

      {: pre}

  4. After modifying these files, you are ready to crawl your data.

Crawl your data

Run the following command: crawler crawl --config [config/crawler.conf]

This will run a crawl with the configuration file crawler.conf.

Note: The path to the configuration file passed in the --config option must be a qualified path. That is, it must be in relative formats, such as config/crawler.conf or ./crawler.conf, or in an absolute path such as /path/to/config/crawler.conf.

Search your documents

Finally, use the GET /v1/environments/{environment_id}/collections/{collection_id}/query method to search your collection of documents. The following example returns all entities that are called IBM:

  • Replace {username} and {password} with your service credentials.
  • Replace {environment_id} with the environment ID for the environment you created in step 1.
  • Replace {collection_id} with the collection ID of the collection that you created in step 2.
curl -u "{username}":"{password}" 'https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2017-10-16&query-enriched_text.entities.text:IBM'

{: pre}

Results

You have now successfully queried documents in an environment and collection you created. Now you can begin customizing by adding more documents to the collection.