copyright | lastupdated | ||
---|---|---|---|
|
2017-08-18 |
{:shortdesc: .shortdesc} {:new_window: target="_blank"} {:tip: .tip} {:pre: .pre} {:codeblock: .codeblock} {:screen: .screen} {:javascript: .ph data-hd-programlang='javascript'} {:java: .ph data-hd-programlang='java'} {:python: .ph data-hd-programlang='python'} {:swift: .ph data-hd-programlang='swift'}
To set up the Data Crawler to crawl your repository, you must specify the appropriate input adapter in the crawler.conf
file, and then configure repository-specific information in the input adapter configuration files.
{: shortdesc}
Before making the changes listed in these steps, make sure that you have created your working directory by copying the contents of the {installation_directory}/share/examples/config
directory to a working directory on your system, for example /home/config
.
Important: Do not modify the provided configuration example files directly. Copy and then edit them. If you edit the example files in-place, your configuration may be overwritten when upgrading the Data Crawler, or may be removed when uninstalling it.
Note: References in this guide to files in the config
directory, such as config/crawler.conf
, refer to that file in your working directory, and NOT in the installed {installation_directory}/share/examples/config
directory.
The specified values are the defaults in config/crawler.conf
, and configure the Filesystem connector:
-
Open the
config/crawler.conf
file in a text editor.-
Set the
crawl_config_file
option to the.conf
that you previously modified, for example:connectors/filesystem.conf
. -
Set the
crawl_seed_file
option to the-seed.conf
that you previously modified, for example:seeds/filesystem-seed.conf
. -
Set the
output_adapter
class
andconfig
options for the {{site.data.keyword.discoveryshort}} service as follows:class - "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter" config - "discovery_service" discovery_service { include "discovery/discovery_service.conf" }
{: codeblock}
There are other optional settings in this file that may be set as appropriate to your environment, see: Configuring crawl options, Configuring the input adapter, Configuring the output adapter, and Additional crawl management options for detailed information about setting these values.
-
-
Open the
discovery/discovery_service.conf
file in a text editor. Modify the following values specific to the {{site.data.keyword.discoveryshort}} service you previously created on {{site.data.keyword.Bluemix}}:environment_id
- Your {{site.data.keyword.discoveryshort}} service environment ID.collection_id
- Your {{site.data.keyword.discoveryshort}} service collection ID.configuration_id
- Your {{site.data.keyword.discoveryshort}} service configuration ID.configuration
- The full path location of thisdiscovery_service.conf
file, for example,/home/config/discovery/discovery_service.conf
.username
- Username credential for your {{site.data.keyword.discoveryshort}} service.password
- Password credential for your {{site.data.keyword.discoveryshort}} service.
There are other optional settings in this file that may be set as appropriate to your environment. See Configuring Service Options for detailed information about setting these values.
-
After modifying these files, you are ready to crawl your data. Proceed to Crawling your data repository to continue.
{: #configuring-crawl-options}
The file config/crawler.conf
contains information that tells the Data Crawler which files to use for its crawl (input adapter), where to send the collection of crawled files once the crawl has been completed (output adapter), and other crawl management options.
Note: All file paths are relative to the config
directory, except where noted.
To access the in-product manual for the crawler.conf
file, with the most up-to-date information, type the following command from the Crawler installation directory: man crawler.conf
{: tip}
The options that can be set in this file are:
{: #input-adapter}
-
class
- Internal use only; defines the Data Crawler input adapter class. The default value is:com.ibm.watson.crawler.connectorframeworkinputadapter.Crawl
-
config
- Internal use only; defines the connector framework configuration. The default configuration key within this block to pass to the chosen input adapter is:connector_framework
The connector framework is what allows you to talk to your data. It could be internal data within the enterprise, or it could be external data on the web or in the cloud. The connectors allow access to a number of different data sources, while connecting is actually controlled by the crawling process.
Important: Data retrieved by the Connector Framework Input Adapter is cached locally. It is not stored encrypted. By default, the data is cached to a temporary directory that should be cleared on reboot, and should be readable only by the user who executed the crawler command.
There is a chance that this directory could outlive the crawler if the connector framework was to go away before it could clean up after itself. Carefully consider the location for your cached data - you may put it on an encrypted filesystem, but be aware of the performance implications of doing so. Only you can decide the appropriate balance between speed and security for your crawls.
-
crawl_config_file
- The configuration file to use for the crawl. Default value is:connectors/filesystem.conf
-
crawl_seed_file
- The crawl seed file to use for the crawl. Default value is:seeds/filesystem-seed.conf
-
id_vcrypt_file
- Keyfile used for data encryption by the Crawler; the default key included with the crawler isid_vcrypt
. Use the vcrypt script in thebin
folder if you need to generate a newid_vcrypt
file. -
crawler_temp_dir
- The Crawler temporary folder for connector logs. Default value,tmp
, is provided. If it doesn't already exist, thetmp
folder will be created in the current working directory. -
extra_jars_dir
- Adds a directory of extra JARs to the connector framework classpath.Note: Relative to the connector framework
lib/java
directory.- This value must be
oakland
when using the SharePoint connector. - This value must be
database
when using the Database connector.
You can leave this value empty (i.e., empty string "") when using other connectors.
- This value must be
-
urls_to_filter
- Whitelist of URLs to crawl, in regular expression form. The Data Crawler only crawls URLs which match one of the regular expressions provided.The domain list contains the most common top-level domains; add to it if necessary.
The file extension-type list contains the file extensions that the Orchestration Service supports, as of this release of the Data Crawler.
Ensure that your seed URL domain is allowed by the filter. For example, if the seed URL looks like
http://testdomain.test.in
, add "in
" to the domain filter.Ensure that your seed URL will not be excluded by a filter, or the Crawler may hang.
-
max_text_size
- The maximum size, in bytes, that a document can be before it is written to disk by the Connector Framework. Adjusting this higher decreases the amount of documents written to disk, but increases the memory requirement. Default value is1048576
-
extra_vm_params
- Allows you to add extra Java parameters to the command used to launch the Connector Framework. -
bootstrap_logging
- Writes connector framework startup log; useful for advanced debugging only. Possible values aretrue
orfalse
. Log file will be written tocrawler_temp_dir
{: #output-adapter}
-
class
- Defines the Data Crawler output adapter class. -
config
- Defines which configuration key to pass to the output adapter. The string must correspond to a key within this configuration object. In the following code example:class - "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter" config - "discovery_service" discovery_service { include "discovery/discovery_service.conf" }
{: codeblock}
the configuration key is
discovery_service
.
You must select an output adapter by specifying its class
parameter and config
key.
- {{site.data.keyword.discoveryshort}} Service Output Adapter - Uploads crawled documents to the {{site.data.keyword.discoveryfull}} Service. Select this adapter by setting the
class
parameter andconfig
key as follows.
class - "com.ibm.watson.crawler.discoveryserviceoutputadapter.DiscoveryServiceOutputAdapter"
config - "discovery_service"
discovery_service {
include "discovery/discovery_service.conf"
}
{: codeblock}
-
Test Output Adapter - The Test Output Adapter writes a representation of the crawled files to disk in a specified location. Select this adapter by setting the
class
parameter andconfig
key as follows.An additional parameter,
output_directory
, selects the directory to which the representation of the crawled data should be written.
class - "com.ibm.watson.crawler.testoutputadapter.TestOutputAdapter"
config - "test"
output_directory - "/tmp/crawler-test-output"`
{: codeblock}
-
retry
- Specifies the options for retry in case of failed attempts to push to the output adapter.max_attempts
- Maximum number of retry attempts. Default value is10
delay
- Minimum amount of delay between attempts, in seconds. Default value is2
exponent_base
- Factor that determines the growth of the delay time over each failed attempt. Default value is2
The formula is:
**`d(nth_retry) - delay * (exponent_base ^ nth_retry)`**
For example, the default settings with a delay of 1 second and an exponent base of 2, will cause the second retry - the third attempt - to delay 2 seconds instead of 1, and the next to delay 4 seconds.
`d(0) - 1 * (2 ^ 0)` - 1 second `d(1) - 1 * (2 ^ 1)` - 2 seconds `d(2) - 1 * (2 ^ 2)` - 4 seconds
So, with the default settings, a submission will be attempted up to 10 times, waiting up to approximately 1022 seconds - a little more than 17 minutes. This time is approximate because there is additional time added in order to avoid having multiple resubmissions execute simultaneously. This "fuzzed" time is up to 10%, so the last retry in the previous example could delay up to 7.7 seconds. The wait time does not include the time spent connecting to the service, uploading data, or waiting for a response.
The
output_timeout
value takes precedence over the wait time here: if the total retry wait time exceeds that setting, the submission will fail even if it should have been retried. {: tip}
{: #additional-crawl-management-options}
-
full_node_debugging
- Activates debugging mode; possible values aretrue
orfalse
.Important: This will put the full data of every document crawled into the logs.
logging.log4j.configuration_file
* - The configuration file to use for logging. In the samplecrawler.conf
file, this option is defined inlogging.log4j
and its default value islog4j_custom.properties
. This option must be similarly defined whether using a.properties
or.conf
file.shutdown_timeout
- Specifies the timeout value, in minutes, before shutting down the application. Default value is10
.output_limit
- The highest number of indexable items that the Crawler will try to send simultaneously to the output adapter. This can be further limited by the number of cores available to do the work. It says that at any given point there will be no more than "x" indexable items sent to the output adapter waiting to return. Default value is10
.input_limit
- Limits the number of URLs that can be requested from the input adapter at one time. Default value is3
.output_timeout
- The amount of time, in seconds, before the Data Crawler gives up on a request to the output adapter, and then removes the item from the output adapter queue to allow more processing. Default value is1200
.Consideration should be given to the constraints imposed by the output adapter, as those constraints may relate to the limits defined here. The defined
output_limit
only relates to how many indexable objects can be sent to the output adapter at once. Once an indexable object is sent to the output adapter, it is "on the clock," as defined by theoutput_timeout
variable. It is possible that the output adapter itself has a throttle preventing it from being able to process as many inputs as it receives. For instance, the orchestration output adapter may have a connection pool, configurable for HTTP connections to the service. If it defaults to 8, for example, and if you set theoutput_limit
to a number greater than 8, then you will have processes, on the clock, waiting for a turn to execute. You may then experience timeouts.num_threads
- The number of parallel threads that can be run at one time. This value can be either an integer, which specifies the number of parallel threads directly, or it can be a string, with the format"xNUM"
, specifying the multiplication factor of the number of available processors, for example,"x1.5"
. The default value is"30"
{: #configuring-service-options}
The {{site.data.keyword.discoveryshort}} service tells the crawler how to manage crawled files when using the {{site.data.keyword.discoveryfull}} service.
To access the in-product manual for the discovery-service.conf
file, with the most up-to-date information, type the following command from the Crawler installation directory:
man discovery_service.conf
{: pre} {: tip}
Default options can be changed directly by opening the config/discovery/discovery_service.conf
file, and specifying the following values specific to your use case:
-
http_timeout
- The timeout, in seconds, for the document read/index operation; the default is125
. -
concurrent_upload_connection_limit
- The number of simultaneous connections allowed for uploading documents. The default is2
.Note: When using the Orchestration Service Output Adapter, this number should be greater than, or equal to, the
output_limit
set when configuring crawl options.base_url
- The URL to which your crawled documents will be sent. For the current release of the {{site.data.keyword.discoveryshort}} service, the value ishttps://gateway.watsonplatform.net/discovery/api
.environment_id
- The location of your crawled document collection at the base URL.collection_id
- Name of the document collection that you set up in the {{site.data.keyword.discoveryshort}} service.api_version
- Internal use only. Date of the last API version change.configuration_id
- The filename of the configuration file that the {{site.data.keyword.discoveryshort}} service uses.username
- Username to authenticate to the location of your crawled document collection.password
- Password to authenticate to the location of your crawled document collection.
The {{site.data.keyword.discoveryshort}} Service Output Adapter can send statistics in order for {{site.data.keyword.IBM}} to better understand and serve its users. The following options can be set for the send_stats
variable:
jvm
- Java Virtual Machine (JVM) statistics sent include the Java vendor and version, as reported by the JVM used to execute the data crawler. Value is eithertrue
orfalse
. Default value istrue
.os
- Operating system (OS) statistics sent include OS name, version, and architecture, as reported by the JVM used to execute the data crawler. Value is eithertrue
orfalse
. Default value istrue
.